Check this page for updates on upcoming classes, our learning goals, and the lesson modules we'll use to get there.
DAT 102: Introduction to Data Analytics
Course Schedule and Lesson Guides
Each class session has it's own box:
Jump down to current week
Week 1: Sat 20 Jan 18 - Friday 26 Jan
Data schemas, data collection, and essential analysis
- Organizing structured data in tables
- Approaches to data sampling
- Collecting (Messy) data
- Using spreadsheets for first-pass analysis
- Design a coding schema for basic data table work
- Conduct basic analysis on structured data
- Respond intelligently to the question: what is data analytics?
- Learn about one another with a structured data creation exercise
- Grading performance analysis
- Data ratings algorithms and checking them
Doing Data Science code set
Columbia Data Science Program
Western PA Regional Data Center
Vehicle study columns:
- Visible use of mobile computer device/phone
- Driver not looking at road ("The Look")
- # passengers
- Youth in car(boolean)
- Sloppy driving (swerving, crossing lane lines, running lights)
- Vehicle type: passenger car/ pickup, commercial vehicle (tractor trailer, delivery, etc.), mass transit (bus, large van), Motocycle
Please note: weather, location, speed limit, street quality, surrounding features, time of day, temp deg F, method of observation
Week 2: Sat 27 Jan - Friday 2 Feb
Data table design & manipulation | Python crash-course | File conversions
- Create data set tables designed for analytic purposes: clear column headers, data dictionaries, etc.
- Conduct field gathering of data and entering of that data based on data dictionary
- Implement data cleaning procedures to address observational inconsistencies and general "messiness"
- Create a data manipulation and work process log for our distracted driving data
- Export data as a CSV, verify the export, and read that data into python
- Diagram essential python data structures: dictionaries and data frames
- Conduct basic data set manipulations in Python\
- Prepare to work on these steps by Tuesday evening and provide Eric feedback on sticking points or areas for additional learning by midnight on Tuesday. Either call (leave a VM) 412.894.3020 or--if you really despise talking--send an email to him<./li>
- Research and find/purchase/borrow/rent a statistics text that you can study happily and work exercises. I have used Statistics for Business and Economics by McClave and am working on studying the classic Introductory Probability and Statistical Applications by Meyer.
- Install and test Python 3 on your computer.
- Install and test the Anaconda 3 library of data-related python tools.
- Try running Jupyter Notebook which is a python tool for interactive programming and saving python scripts. It is a package already installed with Anaconda 3--no extra installs needed, just figuring out how to run the server and access it in a web browser.
- Download and install Libre office if you ever anticipate not having full access to MS office products. Once installed, create a spreadsheet, export it as a CSV, and import it into, say, google sheets, to test your work.
- Devote 3-4 hours tinkering with CSV imports using the python tutorials linked here and found online. Place a priority on reading about any fundamentals you may not have in your tool set. Use this as a time to assess your programming acumen and interest in going more in depth into Python.
- Work through basic pandas tutorial--focusing on reading data in and printing out the object information
- Explore one other package in the python standard library by importing it and creating the object and calling a few methods on that object.
- Review our sampling of car data on the computer
- Discuss sampling methods and their importance in the foundational value of our data sets
- Conduct a group observational experiment of car data
- Format and process this data in CSV format and export into Python
- Distracted Driving Data Shared Drive
- Python File resources on this site
- Python Documentation
- Reading on Sampling from the University of Texas at Dallas
- Use a non-archiving search engine like DuckDuckGo to find related resources to the weekly to-dos.
Week 3: Sat 3 Feb - Fri 9 Feb
Back to basics with Pivot tables and basic python
- Extract inquiry questions from a dataset of interest
- Using our spreadsheet skills, generate a pivot table and pivot chart to investigate these questions
- Write python code to read in a CSV file and conduct basic processing operations on that set
- Direct the output of a python processing operation to a text file that can be read back by a spreadsheet
- Locate a dataset of interest and brainstorm relevant questions that can be answered with a pivot table
- Explore pivot table basics in MS Excel or in Libre Office
- Answer the inquiry questions using that Pivot table and make a parallel pivot chart
- Write up your mini-analysis in this shared google doc. Use the Table of Contents on page 1 to jump down to your dedicated page. NOTE: you must be logged in to ANY google account to paste in screen shot images of your pivot tables!
- Using your own dataset, use python to do that CSV processing
- Libre Office Calc version of our Allgheny County Jail census data and its pivot table
- WPRDC data set
- MS Excel Pivot Tables tutorial
- Libre Office pivot tables tutorial
- Function/Formula overivew for MS Excel
Mid-week Exercises: Spreadsheet Madness
Based on last week's class, we have growth needs in spreadsheet fundamentals. These exercises will ask you to process some data in a spreadsheet using a series of functions. You'll be given links to various core skills in spreadsheets. We'll review them in class on Saturday.
Note that these steps will provide you links for resources on using functions in Libre Office Calc since this is a free and open source program anybody can download. You can find similar documentation on MS Excel with an internet search. Many of the functions have the same name across programs.
- Download a CSV version of the non-traffic citations issued in Allegheny county on the WPRDC.
- Open this file in a spreadsheet (either Libre office or Microsoft Corporation's Excel) and save the file as a native spreadsheet file (i.e. either a .ods file or .xlsx). This allows you to use the full features of the spreadsheet and preserve any pivot tables you create.
- Let's make a backup of our data so we can always go back and restart with little hassle. Do this by duplicating the tab that contains the raw data. Label the two tabs logical names such that one is our master original data copy and the other is the processed data. L
- Take some time to browse the data, using the data dictionary as your guide. As yourself: what kind of questions can I ask about this data? Is it clean? Do I see fields that I can easily crunch or do they need to be formatted at all? Are there data fields that I might need to adjust to be easily processable?
- This step sequence will help you develop spreadsheet skills necessary to answer these brainstormed questions, which we'll then discuss in person on Saturday:
We need to do some trimming of this data, particularly the data column and our ugly offenses field. Start by generating a new column next to the existing "CITEDTIME" column. Use the formula called 'Left' to extract only the year, month, and day, discarding the time.
Now let's wrestle with the "OFFENSES" column. First, the data are ugly--some are all caps, some are not. The section and subsections are not a consistent length, so we can't just extract those with a nice left() or right() function call. Let's start by getting them all to be in lowercase for consistency (this way it doesn't look like some offenses are more important than others just because they are in all caps). Scan the list of text functions in Libre office for an appropriate function to convert all that text to lowercase. Remember, you'll want to create new columns for each adjustment to the original data. We can always hide the original data.
Before we process the offense data, let's make sure we don't have any trailing spaces at the end of our offenses. Find the name of the function that will automatically remove non-printing characters from the beginning and end of data in a particular cell. You get to do the searching for this one!
This is a tricky step and it will take time and PATIENCE. It may take you up to an hour to get this working right--but is a rite of passage in spreadsheets since this formula is so powerful. We would have liked to be given an office ID that is a consistent length that we could use for sorting and displaying. But we weren't given that data--only the ugly legal code section, subsection, etc. What's worse, those are not in a consistent format. Some have () in them, or more than one pair of (). Some have decimal places, others dont. We want to generate our own set of offense IDs and attach these IDs to existing offense columns. We're going to do this with a special function called VLOOKUP(). The idea behind this function is that we can create a key table that associates unique offense names and our own offense IDs that are pretty, like 1000 and 10001. VLOOKUP will search our data column called OFFENSE and see if there is a matching value in our little lookup table. If a match is found in our lookup table, our own offense IDs are returned as the output of the VLOOKUP() function call. This is handy, but tricky to get to work correctly.
Start by reading the documentation on Vlookup from libre office help. This is not a very helpful guide on this function. But we always start with the official documentation and go from there. This page looks ugly, but is one of the most complete discussions of using VLOOKUP in Libre office.. You can also find some video tutorials out there, but I think videos are tedious and not very good as references, which I generally want.
Go ahead and try to generate a new column in our data table called "offenseID" and populate it with the offense IDs that you created in a reference table that VLOOKUP uses. (The analog is matching up the letter A with a value of 90 to 100 in a final grade percent table--we are looking up offense IDs instead of letter grades). When creating your lookup table, you'll want to list all of the unique values found in the original data in one column, and give them a pretty ID number of your choosing. So your first row in this lookup table will be something like "5511(c)(1) CRUELTY TO ANIMALS" in column 1 and a number like 1000 in column 2.
When building your lookup table, you need to determine all the unique values in the OFFENSES column. Here is a discussion of some ways to do this in Libre Office calc.
Whew! With your new offense IDs generated, you're ready to start analyzing the data with filters, pivot tables, and charts.
Filter all the columns such that you can sort by unique values, and exclude values that you don't want to work with. Here's the Ahuka tutorial.
Use your pivot table skills to mush the data around to look for answers to the above inquiry questions or ones you generate on your own. Prepare to discuss these in person on Saturday.
- Which neighborhood has the highest incident frequency?
- Of the neighborhood with the highest incident frequency, which offense type was the most frequent? What conclusions can we draw about severity of offense and frequency of offense?
- Are black folks more likely to be cited than white folks?
- Are younger folks more likely to be cited for a certain crime than older people? Which ones?
Week 4: Sat 10 Feb - Fri 16 Feb
Return of the spreadsheet | OpenRefine Magic
We must never underestimate the importance and value of spreadsheets as the foundation for data analysis. This lesson will review essential spreadsheet operations. We'll also introduce a tool designed for more powerful data cleaning and replacement (but weaker on analysis tools) called OpenRefine. The mid-week exercise will involve producing a small data analysis project through its lifecycle.
- Confidently implement essential spreadsheet text and numeric functions to process data for display
- Create pivot tables and pivot charts after initial processing of spreadsheet data
- Conduct basic faceting and filtering in Open Refine
- Critique a data journalism/report created by a data scientist and published online
- OpenRefine's GitHub Account with install instructions
- OpenRefine's Expression Language Reference
- The master data repo: Data is Plural Blog Archive and repo links (Make a copy into your own Google Drive account)
- Eric's data processing example: Nuclear Blast record
- Review the following data journalism piece by the local outfit: PublicSource. Data journalism example: Let's Talk About race and statistics every Pittsburgher should know. Consider the following questions:
Let's stretch our spreadsheet muscles with school-related civil rights (discipline) data located on our server here. Download this spreadsheet and follow these analysis guidelines.
Review the data guides for this CRDC data published by the office of civil rights.
In your spreadsheet, copy the initial raw data table. Rename the initial tab something like "rawData". In your copied tab, delete unwanted columns to isolate the variables of interest.
Shorten column names, remove spaces, etc.
Cut out the rows to isolate the level of analysis you are targeting: county, district, or school.
Calculate a per enrolled student metric (field / total enrollment) for each of your columns of interest.
Generate inquiry questions related to the connection between school size or school type and your field of interest.
Use min/max/stdev functions to compute summary statistics for each of your fields of interest (5-6)
Create a pivot chart to help you answer your inquiry questions. Once you have isolated the data you want, generate a few charts from the pivot table that shed light on your conclusions.
Publish your results in our shared google doc here
- What makes the graphs presented effective? How do you feel about the "drawing" effect used in the graphics?
- How did Public Source use honest journalism principles in their article?
- This article didn't present any analysis or political opinions explicitly--but are there implicit messages given in this data? How do you feel about this approach (sharing policized data without analysis)?
- What is the source of this data? Does knowing the source add or detract from the value of this article?
Open Refine Practice
- Acquire our CSV of nuclear explosion data from our server
- Load this data into OpenRefine
- Use the Facet functions to clean up the blast size field so it's all potentially numeric
- use the toNumber() function to convert this field to numeric values
- Categorize the blasts based on their size into three classes: small, medium, and large blasts
- Export data to CSV and open in a spreadsheet
- Develop inquiry questions based on this data
- Follow standard analysis procedures in a spreadsheet to uncover the answers. Prepare to share.
- Make sure to have week 3 pivot table practice done and uploaded to the google drive file
- Conclude and make your Civil Rights data analysis presentable
cakeNEW Products to Produce Mid-Week
- Find a "data journalism" article and prepare to share your findings: what was concluded? Was it presented with obvious bias? Is the sources of the data accessible? Write a short email to the author sharing your findings and links to the analysis google doc here.
Week 5: Sat 17 Feb - Fri 23 Feb
Visualization tools | US Census and ACS data processing
- Access US Census data through American Factfinder and navigate the download tool to extract two years of data
- Clean US Census data to isolate variables of interest in a spreadsheet and OpenRefine
- Create a database to conduct a join on the data
- Export Joined data to a spreadsheet and create basic chart visualizations of that data
- US Census data is all accessible through the American FactFinder portal located here.
- Directory listing of class files: including sample database and chart spreadsheet
- Let's begin with a discussion of a fascinating use of census data to visualize distributions of people by race in the USA. Investigate this "Racial Dot Map" tool created by researchers at the University of Virginia. Consider these discussion questions:
Browse the American Community Survey data to find two geographies of interest over two time periods in which you can investigate change in some set of variables. You'll need to make sure that you have data on the same geography level for the tables you choose in both years.
Download both tables and import them into a spreadsheet
Clean the columns by developing sensible column names, deleting the columns you don't want (probably margin of errors for this practice activity). Remember--No strange columns and no spaces in variable names!
Save this file and import into OpenRefine to clean up fields. Delete records in which there is very little data. Replace no-value markers with 0 so we can use numeric functions on the fields
Export the data from Open Refine back into a spreadsheet
Copy the cleaned data into LibreOffice base and create a master table with joined data on a key column for export
Extract data from the database back into the spreadsheet for visualization
Visualize the data you've gathered and do the write-up in the shared google doc located here.
Prepare to give a short presentation on this data at the start of the next class.
- Where did you explore first? What were your first impressions of this data?
- What makes this an effective data visualization tool? What are its limitations?
- Explore the researcher's data portal by clicking the "what am I looking at?" Link. What principles of good data analysis are exhibited?
- What additional layers of data would you like to add to this racial dot map? What conclusions or ideas would adding this data allow viewers to consider or conclude?
- Complete the in-class census data exercise and make sure your write-up is solid and presentable next week
cakeProducts to Produce
- Completed, thorough, and presentable US Census data mini-project in the google spreadsheet linked above
Week 6: Sat 24 Feb - Fri 2 March
Mapping Fundamentals | Spatial data analysis | CartoDB magic
cakeProducts to Produce
Week 7: Sat 3 March - Fri 9 March
Exposure to Python and Using python scripts | Taste of regression | Project Design
cakeProducts to Produce
Week 8 [LAST SESSION]: Sat 10 March
Share final analysis projects with client | Celebrate the promise of data analytics
cakeProducts to Produce