3. Critical coding - web-scraping utilities¶
This episode is highly incomplete and does not do a good job of demonstrating critical coding...
There are missing tests; the code is still quite messy (e.g. the exposed API could be improved and exceptions are relied on for some control flow); several concepts are glossed over; and the
## annotations are not coherent.
On reflection, the complexity of interacting with a web-browser (and everything that can go wrong in doing so) and the level of prior knowledge required means that this was probably a poor choice of task to pick for discussing critical coding.
A lot of the modern web relies on Javascript to do just about everything, if we just use requests then many web-pages will not render content => use Selenium
Selenium is meant to be used to test applications (implying prior knowledge of the website structure - e.g. the page is loaded when we can find a <div> with a specific class/ID - but we are using it to scrape thousands of websites we have no prior knowledge of which adds significant complexity, e.g.:
- We don't know if the website is reachable
- The website might crash the browser
- We don't know when a webpage can be considered loaded. There could be requests for content that take ages or timeout - these could be requests for crucial content or a tiny script snippet to load an ad.
Because of these sources of complexity and the fact that we want to build a scraping pipeline that is robust to failure (we don't want one unreachable website or a (recoverable) selenium driver crash to cause the whole pipeline to fail) we've elected to create a small utility library to help make the task of using Selenium for web-scraping a little easier.
kuebiko/utils/url.pykuebiko/utils/selenium/constants.py- Defines constants, e.g. default timeouts and chrome network error codesexceptions.py- Defines two new exception types we need to handle errorsdriver.py- Provides a wrapper for SeleniumWebDrivers that enables recovery from crasheserror_handling.py- Defines error handling functions that map from the errors that Chrome/Selenium give us, to a smaller set of errors closer to the domain of scrapingutils.py- General utility functions for user-code use (bad catch-all naming)get_page.py- User functions ...
kuebiko/utils/{tests/,selenium/tests}- Tests