Web Scraping

Wednesday, 9-OCT-2019

HTTP, Requests, and screen scraping

Learning Objectives

Explain the essential structure of HTML pages and how they store structured information. Write a basic HTML page about a life interest.
Employ the urllib and BeautifulSoup libraries to pull down an HTML page and find a particular element, and extract the data in that element.
Systematically store the result of a screen scraping endeavor into a text file that can be shared and used as inputs by other programs or tools.

Internal Resources

SP'21 Master DAT-129 Tracker

Sample code on github

External Resources

Official python documentation on the string built-in functions
Beautiful Soup is a URL utility library for breaking down HTML pages into their constituent elements. This is an invaluable library for all screen scrapers out there.
The urllib package for python 3 provides a suite of tools for fetching URL data. I believe it uses the Requests library in the back end.
Goodreads.com is a book repository that returns simple, parse-able HTML from URL-encoded queries. This is our sample site for screen scraping.
For a primer on HTML and web technology in general, visit the Mozilla Foundation's "Learn Web Development" page for HTML, CSS, and Javascript tutorials, references, examples, and tips. Where to start? HTML basics
RFC 2616: HTTP standard specification, heavy reading, nice to know about
Selinium browser-based testing tool which could help overcome JS barriers to data access on a target page

Screen casts

Screen cast of Spring 2020 online session Part 1

Screen cast of Spring 2020 online session Part 2

Lesson Sequence

Structure of the WWW: Requests and responses in browsers
Using request and response libraries in URL lib
Structuring documents in trees! HTML basics
Using Beautiful soup to parse a simple HTML File
Run through screen scraping example on GoodReads.com. Learn how to use urllib and BeautifulSoup
Work time on screen scraping project

HTML and HTTP notes

Internet ~ Early 1970s

Connect computers running compatible operating systems
Remote logins via a dedicated data network
Strongly coupled system sub-networks which were very incompatible with one another

World Wide Web ~ 1990s

Runs on the HTTP which is a SUBSET of the Internet
HTTP - Hyptertext Transfer Protocol V.1.1: Rules for transmitting data on this network
HTML - Hyptertext Markup Language: Format for encoding documents exchanged on the WWW using HTTP
CSS - Cascading style sheets: Providing browsers with formatting information beyond the built-in stylesheet included with each browser

Products to Produce

Code to the specification below. Then upload your a python files and any related documents to your GitHub account. I suggest creating a subdirectory in your git repository called "scraping" or something like that.


program objective	Create a program that uses the urllib and BeautifulSoup to grab HTML code from a public source, parses that source into meaningful bits of data, and spits those meaningufl bits of data into some form that could be transfered into anothre tool, such as a CSV for slurping up into a Database, a JSON file for use in the web, etc.
suggestions for good pages to parse	Choose a website whose page content is retrieved with some sort of query, such as a URL-encoded search query. This will allow your system to programatically tinker with the results you get back, and can be scaled to process lots more data than just a single, static page. Pages with tables of data are great, since that allows us to loop over trees of page elements and procss their data one at a time
use methods!	When possible, please structure your code in discrete methods that accomplish a single task, returning useful values to the caller. This helps reduce code repetition, allows for modular re-use, and makes the code generally more readable than blobs of lines in a heap.

program objective

Create a program that uses the urllib and BeautifulSoup to grab HTML code from a public source, parses that source into meaningful bits of data, and spits those meaningufl bits of data into some form that could be transfered into anothre tool, such as a CSV for slurping up into a Database, a JSON file for use in the web, etc.

suggestions for good pages to parse

Choose a website whose page content is retrieved with some sort of query, such as a URL-encoded search query. This will allow your system to programatically tinker with the results you get back, and can be scaled to process lots more data than just a single, static page.

Pages with tables of data are great, since that allows us to loop over trees of page elements and procss their data one at a time

use methods!

When possible, please structure your code in discrete methods that accomplish a single task, returning useful values to the caller. This helps reduce code repetition, allows for modular re-use, and makes the code generally more readable than blobs of lines in a heap.