
Web Scraping
Wednesday, 9-OCT-2019
HTTP, Requests, and screen scraping
check_circleLearning Objectives
- Explain the essential structure of HTML pages and how they store structured information. Write a basic HTML page about a life interest.
- Employ the urllib and BeautifulSoup libraries to pull down an HTML page and find a particular element, and extract the data in that element.
- Systematically store the result of a screen scraping endeavor into a text file that can be shared and used as inputs by other programs or tools.
bookInternal Resources
languageExternal Resources
- Official python documentation on the string built-in functions
- Beautiful Soup is a URL utility library for breaking down HTML pages into their constituent elements. This is an invaluable library for all screen scrapers out there.
- The urllib package for python 3 provides a suite of tools for fetching URL data. I believe it uses the Requests library in the back end.
- Goodreads.com is a book repository that returns simple, parse-able HTML from URL-encoded queries. This is our sample site for screen scraping.
- For a primer on HTML and web technology in general, visit the Mozilla Foundation's "Learn Web Development" page for HTML, CSS, and Javascript tutorials, references, examples, and tips. Where to start? HTML basics
- RFC 2616: HTTP standard specification, heavy reading, nice to know about
- Selinium browser-based testing tool which could help overcome JS barriers to data access on a target page
movieScreen casts
Screen cast of Spring 2020 online session Part 1
Screen cast of Spring 2020 online session Part 2
listLesson Sequence
- Structure of the WWW: Requests and responses in browsers
- Using request and response libraries in URL lib
- Structuring documents in trees! HTML basics
- Using Beautiful soup to parse a simple HTML File
- Run through screen scraping example on GoodReads.com. Learn how to use urllib and BeautifulSoup
- Work time on screen scraping project
HTML and HTTP notes
Internet ~ Early 1970s
- Connect computers running compatible operating systems
- Remote logins via a dedicated data network
- Strongly coupled system sub-networks which were very incompatible with one another
World Wide Web ~ 1990s
- Runs on the HTTP which is a SUBSET of the Internet
- HTTP - Hyptertext Transfer Protocol V.1.1: Rules for transmitting data on this network
- HTML - Hyptertext Markup Language: Format for encoding documents exchanged on the WWW using HTTP
- CSS - Cascading style sheets: Providing browsers with formatting information beyond the built-in stylesheet included with each browser
cakeProducts to Produce
- Code to the specification below. Then upload your a python files and any related documents to your GitHub account. I suggest creating a subdirectory in your git repository called "scraping" or something like that.
program objective |
Create a program that uses the urllib and BeautifulSoup to grab HTML code from a public source, parses that source into meaningful bits of data, and spits those meaningufl bits of data into some form that could be transfered into anothre tool, such as a CSV for slurping up into a Database, a JSON file for use in the web, etc. |
suggestions for good pages to parse |
Choose a website whose page content is retrieved with some sort of query, such as a URL-encoded search query. This will allow your system to programatically tinker with the results you get back, and can be scaled to process lots more data than just a single, static page. Pages with tables of data are great, since that allows us to loop over trees of page elements and procss their data one at a time |
use methods! |
When possible, please structure your code in discrete methods that accomplish a single task, returning useful values to the caller. This helps reduce code repetition, allows for modular re-use, and makes the code generally more readable than blobs of lines in a heap. |