Mining Engine¶
This is the main driver for connecting a site and gathering information in a centaurminer.PageLocations object.
-
class
centaurminer.MiningEngine(site_locations: centaurminer.DOM_elements.PageLocations, driver_path=None, headless=True)¶ A simple mining engine to gather information from article-hosting websites.
The locations in the DOM for each element are given by a PageLocations object in the constructor. This class then uses those instructions to gather the info.
To give instructions to mine data for a location stored in
PageLocations.xyz, you should create/override theget_xyz()function, like so:def get_xyz(self, element): ''' Short description Arguments --------- element: :class:`centaurminer.Element` Location of the given element in the page. ''' # Add instructions here, using self.wd and self.get # Below is equivalent to what happens without this override function return self.get(element)
- Parameters
site_locations (
centaurMiner.PageLocations) – A class reference to a subclass ofcentaurMiner.PageLocations.driver_path (str, optional) – The location of your webdriver. If this is not specified, one will be installed/cached automatically.
headless (bool, optional) – If False, the webdriver will open a GUI as it performs its tasks. Defaults to True (no GUI).
-
site¶ Stores the location of the elements you want to extract information from.
- Type
-
wd¶ The webdriver used to connect to and collect DOM elements from the URL you specify.
- Type
selenium.webdriver
-
results¶ Storage dictionary for the results from data gathering.
- Type
dict
-
gather(url)¶ Gather the information denoted in self.site from a single page.
Arguments:
url: (string) URL for the site you want to mine data from.
-
get(element, several=False)¶ Default method for extracting an element from the page.
Handles errors gracefully and waits for the element to become visible before grabbing it. When creating custom
get_*****functions, use this function to grab the data from the element, before doing additional processing on it.- Parameters
element (
centaurminer.Element) – The location of the element to mine data from.several (Boolean) –
Use to indicate that we should get all elements of this type, instead of the first one on the page.
Note
You should only use
several=Truein a customget_*method, so you can do more processing after getting this list of elements.
Override for author collection instructions.
Simply gets the list of authors and joins them together, using
TagList()- Parameters
element (
centaurminer.Element) – The location of the element to mine data from - in this case, it’s several elements located with the same identifier.