Reference¶

okscraper.base¶

class okscraper.base.BaseScraper(*args, **kwargs)[source]¶

Abstract Scraper class - should be extended by concrete scraper objects

You must declare the following:

def __init__(self, *args, **kwargs):
    self.source = (an object derived from a class based on okscraper.sources.BaseSource)
    self.storage = (an object derived from a class based on okscraper.storages.BaseStorage)

def _scrape(self):
    # here you do the actual scraping based on source and storing to storage

class okscraper.base.ParsingFromFileTestCase(methodName='runTest')[source]¶

base class for testing scrapers with input from a file

minimal implementation sample:

class MyScraperTestCase(ParsingFromFileTestCase):
    def _getScraperClass(self):
        return MyScraper

    def _getFilename(self):
        # this is a file containing test data
        return 'my_data_<<id>>.xml'

    def testParsing(self):
        self.assertScrape(
            args=(220),
            expectedData={'id': 220, 'name':'Hello World',}
        )

okscraper.storages¶

class okscraper.storages.BaseStorage[source]¶

Abstract class, implementing classes must define the following methods:

store - store data
commit - (optional, commit the data)
get - (optioanl, return stored data or pointer to stored data)

class okscraper.storages.DictStorage[source]¶: Storage to store dict data

class okscraper.storages.ListStorage[source]¶: Storage to store list data

okscraper.sources¶

class okscraper.sources.BaseSource[source]¶

Abstract BaseSource

extending classes must implement a fetch method which returns the input data

class okscraper.sources.UrlSource(source_string)[source]¶: fetch data from a url

class okscraper.sources.FileSource(source_string)[source]¶: fetch data from a file

class okscraper.sources.ScraperSource(scraper)[source]¶: fetch data from an okscraper

okscraper.cli.runner¶

class okscraper.cli.runner.Runner(module_name, scraper_class_name=None, *args, **kwargs)[source]¶

Provides functionality for running a scraper from the command line

it gets a module_name and looks for a scrapers module under that module name

e.g. if module_name = lobbyists then the scrapers module is under lobbyists.scrapers

it then looks for a MainScraper class in that module and scrapes that class

alternatively - if scraper_class_name is provided it uses that scraper class

also - you can pass arbitrary args and kwargs which are passed to the scraper

class okscraper.cli.runner.LogRunner(*args, **kwargs)[source]¶: Adds logging capabilities to the Runner class

class okscraper.cli.runner.DbLogRunner(*args, **kwargs)[source]¶: Adds capabilities relevant to db logging to the LogRunner class