Mining Engine¶

This is the main driver for connecting a site and gathering information in a centaurminer.PageLocations object.

class centaurminer.MiningEngine(site_locations: centaurminer.DOM_elements.PageLocations, driver_path=None, headless=True)¶

A simple mining engine to gather information from article-hosting websites.

The locations in the DOM for each element are given by a PageLocations object in the constructor. This class then uses those instructions to gather the info.

To give instructions to mine data for a location stored in PageLocations.xyz, you should create/override the get_xyz() function, like so:

def get_xyz(self, element):
    '''
    Short description

    Arguments
    ---------
    element: :class:`centaurminer.Element`
        Location of the given element in the page.
   '''
   # Add instructions here, using self.wd and self.get
   # Below is equivalent to what happens without this override function
   return self.get(element)

Parameters

site_locations (centaurMiner.PageLocations) – A class reference to a subclass of centaurMiner.PageLocations.
driver_path (str, optional) – The location of your webdriver. If this is not specified, one will be installed/cached automatically.
headless (bool, optional) – If False, the webdriver will open a GUI as it performs its tasks. Defaults to True (no GUI).

site¶

Stores the location of the elements you want to extract information from.

Type: PageLocations

wd¶

The webdriver used to connect to and collect DOM elements from the URL you specify.

Type: selenium.webdriver

results¶

Storage dictionary for the results from data gathering.

Type: dict

gather(url)¶

Gather the information denoted in self.site from a single page.

Arguments:

url: (string) URL for the site you want to mine data from.

get(element, several=False)¶

Default method for extracting an element from the page.

Handles errors gracefully and waits for the element to become visible before grabbing it. When creating custom get_***** functions, use this function to grab the data from the element, before doing additional processing on it.

Parameters

element (centaurminer.Element) – The location of the element to mine data from.
several (Boolean) –
Use to indicate that we should get all elements of this type, instead of the first one on the page.

Note

You should only use several=True in a custom get_* method, so you can do more processing after getting this list of elements.

get_authors(element)¶

Override for author collection instructions.

Simply gets the list of authors and joins them together, using TagList()

Parameters: element (centaurminer.Element) – The location of the element to mine data from - in this case, it’s several elements located with the same identifier.