Mining Engine

This is the main driver for connecting a site and gathering information in a centaurminer.PageLocations object.

class centaurminer.MiningEngine(site_locations: centaurminer.DOM_elements.PageLocations, driver_path=None, headless=True)

A simple mining engine to gather information from article-hosting websites.

The locations in the DOM for each element are given by a PageLocations object in the constructor. This class then uses those instructions to gather the info.

To give instructions to mine data for a location stored in PageLocations.xyz, you should create/override the get_xyz() function, like so:

def get_xyz(self, element):
    '''
    Short description

    Arguments
    ---------
    element: :class:`centaurminer.Element`
        Location of the given element in the page.
   '''
   # Add instructions here, using self.wd and self.get
   # Below is equivalent to what happens without this override function
   return self.get(element)
Parameters
  • site_locations (centaurMiner.PageLocations) – A class reference to a subclass of centaurMiner.PageLocations.

  • driver_path (str, optional) – The location of your webdriver. If this is not specified, one will be installed/cached automatically.

  • headless (bool, optional) – If False, the webdriver will open a GUI as it performs its tasks. Defaults to True (no GUI).

site

Stores the location of the elements you want to extract information from.

Type

PageLocations

wd

The webdriver used to connect to and collect DOM elements from the URL you specify.

Type

selenium.webdriver

results

Storage dictionary for the results from data gathering.

Type

dict

gather(url)

Gather the information denoted in self.site from a single page.

Arguments:

  • url: (string) URL for the site you want to mine data from.

get(element, several=False)

Default method for extracting an element from the page.

Handles errors gracefully and waits for the element to become visible before grabbing it. When creating custom get_***** functions, use this function to grab the data from the element, before doing additional processing on it.

Parameters
  • element (centaurminer.Element) – The location of the element to mine data from.

  • several (Boolean) –

    Use to indicate that we should get all elements of this type, instead of the first one on the page.

    Note

    You should only use several=True in a custom get_* method, so you can do more processing after getting this list of elements.

get_authors(element)

Override for author collection instructions.

Simply gets the list of authors and joins them together, using TagList()

Parameters

element (centaurminer.Element) – The location of the element to mine data from - in this case, it’s several elements located with the same identifier.