DAT 102: Introduction to Data Analytics
Course Schedule and Lesson Guides
Each class session has it's own box:
Week 1: Sat 20 Jan 18 - Friday 26 Jan
Data schemas, data collection, and essential analysis
Study Topics:
Learning Objectives
|
Learning Exercises:
Learning ResourcesWestern PA Regional Data Center Spacing Vehicle study columns:
Please note: weather, location, speed limit, street quality, surrounding features, time of day, temp deg F, method of observation |
Week 2: Sat 27 Jan - Friday 2 Feb
Data table design & manipulation | Python crash-course | File conversions
Learning Objectives
Mid-week to-dos
|
Learning Exercises:
Learning Resources
|
Week 3: Sat 3 Feb - Fri 9 Feb
Back to basics with Pivot tables and basic python
Learning Objectives
- Extract inquiry questions from a dataset of interest
- Using our spreadsheet skills, generate a pivot table and pivot chart to investigate these questions
- Write python code to read in a CSV file and conduct basic processing operations on that set
- Direct the output of a python processing operation to a text file that can be read back by a spreadsheet
Learning Exercises:
- Locate a dataset of interest and brainstorm relevant questions that can be answered with a pivot table
- Explore pivot table basics in MS Excel or in Libre Office
- Answer the inquiry questions using that Pivot table and make a parallel pivot chart
- Write up your mini-analysis in this shared google doc. Use the Table of Contents on page 1 to jump down to your dedicated page. NOTE: you must be logged in to ANY google account to paste in screen shot images of your pivot tables!
- Using your own dataset, use python to do that CSV processing
Learning Resources
- Libre Office Calc version of our Allgheny County Jail census data and its pivot table
- WPRDC data set
- MS Excel Pivot Tables tutorial
- Libre Office pivot tables tutorial
- Function/Formula overivew for MS Excel
Mid-week Exercises: Spreadsheet Madness
Based on last week's class, we have growth needs in spreadsheet fundamentals. These exercises will ask you to process some data in a spreadsheet using a series of functions. You'll be given links to various core skills in spreadsheets. We'll review them in class on Saturday.
Note that these steps will provide you links for resources on using functions in Libre Office Calc since this is a free and open source program anybody can download. You can find similar documentation on MS Excel with an internet search. Many of the functions have the same name across programs.
- Download a CSV version of the non-traffic citations issued in Allegheny county on the WPRDC.
- Open this file in a spreadsheet (either Libre office or Microsoft Corporation's Excel) and save the file as a native spreadsheet file (i.e. either a .ods file or .xlsx). This allows you to use the full features of the spreadsheet and preserve any pivot tables you create.
- Let's make a backup of our data so we can always go back and restart with little hassle. Do this by duplicating the tab that contains the raw data. Label the two tabs logical names such that one is our master original data copy and the other is the processed data. L
- Take some time to browse the data, using the data dictionary as your guide. As yourself: what kind of questions can I ask about this data? Is it clean? Do I see fields that I can easily crunch or do they need to be formatted at all? Are there data fields that I might need to adjust to be easily processable?
- This step sequence will help you develop spreadsheet skills necessary to answer these brainstormed questions, which we'll then discuss in person on Saturday:
- Which neighborhood has the highest incident frequency?
- Of the neighborhood with the highest incident frequency, which offense type was the most frequent? What conclusions can we draw about severity of offense and frequency of offense?
- Are black folks more likely to be cited than white folks?
- Are younger folks more likely to be cited for a certain crime than older people? Which ones?
- We need to do some trimming of this data, particularly the data column and our ugly offenses field. Start by generating a new column next to the existing "CITEDTIME" column. Use the formula called 'Left' to extract only the year, month, and day, discarding the time.
- Now let's wrestle with the "OFFENSES" column. First, the data are ugly--some are all caps, some are not. The section and subsections are not a consistent length, so we can't just extract those with a nice left() or right() function call. Let's start by getting them all to be in lowercase for consistency (this way it doesn't look like some offenses are more important than others just because they are in all caps). Scan the list of text functions in Libre office for an appropriate function to convert all that text to lowercase. Remember, you'll want to create new columns for each adjustment to the original data. We can always hide the original data.
- Before we process the offense data, let's make sure we don't have any trailing spaces at the end of our offenses. Find the name of the function that will automatically remove non-printing characters from the beginning and end of data in a particular cell. You get to do the searching for this one!
- This is a tricky step and it will take time and PATIENCE. It may take you up to an hour to get this working right--but is a rite of passage in spreadsheets since this formula is so powerful. We would have liked to be given an office ID that is a consistent length that we could use for sorting and displaying. But we weren't given that data--only the ugly legal code section, subsection, etc. What's worse, those are not in a consistent format. Some have () in them, or more than one pair of (). Some have decimal places, others dont. We want to generate our own set of offense IDs and attach these IDs to existing offense columns. We're going to do this with a special function called VLOOKUP(). The idea behind this function is that we can create a key table that associates unique offense names and our own offense IDs that are pretty, like 1000 and 10001. VLOOKUP will search our data column called OFFENSE and see if there is a matching value in our little lookup table. If a match is found in our lookup table, our own offense IDs are returned as the output of the VLOOKUP() function call. This is handy, but tricky to get to work correctly.
- Start by reading the documentation on Vlookup from libre office help. This is not a very helpful guide on this function. But we always start with the official documentation and go from there. This page looks ugly, but is one of the most complete discussions of using VLOOKUP in Libre office.. You can also find some video tutorials out there, but I think videos are tedious and not very good as references, which I generally want.
- Go ahead and try to generate a new column in our data table called "offenseID" and populate it with the offense IDs that you created in a reference table that VLOOKUP uses. (The analog is matching up the letter A with a value of 90 to 100 in a final grade percent table--we are looking up offense IDs instead of letter grades). When creating your lookup table, you'll want to list all of the unique values found in the original data in one column, and give them a pretty ID number of your choosing. So your first row in this lookup table will be something like "5511(c)(1) CRUELTY TO ANIMALS" in column 1 and a number like 1000 in column 2.
- When building your lookup table, you need to determine all the unique values in the OFFENSES column. Here is a discussion of some ways to do this in Libre Office calc.
- Whew! With your new offense IDs generated, you're ready to start analyzing the data with filters, pivot tables, and charts.
- Filter all the columns such that you can sort by unique values, and exclude values that you don't want to work with. Here's the Ahuka tutorial.
- Use your pivot table skills to mush the data around to look for answers to the above inquiry questions or ones you generate on your own. Prepare to discuss these in person on Saturday.
Week 4: Sat 10 Feb - Fri 16 Feb
Return of the spreadsheet | OpenRefine Magic
We must never underestimate the importance and value of spreadsheets as the foundation for data analysis. This lesson will review essential spreadsheet operations. We'll also introduce a tool designed for more powerful data cleaning and replacement (but weaker on analysis tools) called OpenRefine. The mid-week exercise will involve producing a small data analysis project through its lifecycle.
check_circleLearning Objectives
- Confidently implement essential spreadsheet text and numeric functions to process data for display
- Create pivot tables and pivot charts after initial processing of spreadsheet data
- Conduct basic faceting and filtering in Open Refine
- Critique a data journalism/report created by a data scientist and published online
bookLearning Resources
- OpenRefine's GitHub Account with install instructions
- OpenRefine's Expression Language Reference
- The master data repo: Data is Plural Blog Archive and repo links (Make a copy into your own Google Drive account)
- Eric's data processing example: Nuclear Blast record
listLearning Exercises:
- Review the following data journalism piece by the local outfit: PublicSource. Data journalism example: Let's Talk About race and statistics every Pittsburgher should know. Consider the following questions:
- What makes the graphs presented effective? How do you feel about the "drawing" effect used in the graphics?
- How did Public Source use honest journalism principles in their article?
- This article didn't present any analysis or political opinions explicitly--but are there implicit messages given in this data? How do you feel about this approach (sharing policized data without analysis)?
- What is the source of this data? Does knowing the source add or detract from the value of this article?
- Let's stretch our spreadsheet muscles with school-related civil rights (discipline) data located on our server here. Download this spreadsheet and follow these analysis guidelines.
- Review the data guides for this CRDC data published by the office of civil rights.
- In your spreadsheet, copy the initial raw data table. Rename the initial tab something like "rawData". In your copied tab, delete unwanted columns to isolate the variables of interest.
- Shorten column names, remove spaces, etc.
- Cut out the rows to isolate the level of analysis you are targeting: county, district, or school.
- Calculate a per enrolled student metric (field / total enrollment) for each of your columns of interest.
- Generate inquiry questions related to the connection between school size or school type and your field of interest.
- Use min/max/stdev functions to compute summary statistics for each of your fields of interest (5-6)
- Create a pivot chart to help you answer your inquiry questions. Once you have isolated the data you want, generate a few charts from the pivot table that shed light on your conclusions.
- Publish your results in our shared google doc here
Open Refine Practice
- Acquire our CSV of nuclear explosion data from our server
- Load this data into OpenRefine
- Use the Facet functions to clean up the blast size field so it's all potentially numeric
- use the toNumber() function to convert this field to numeric values
- Categorize the blasts based on their size into three classes: small, medium, and large blasts
- Export data to CSV and open in a spreadsheet
- Develop inquiry questions based on this data
- Follow standard analysis procedures in a spreadsheet to uncover the answers. Prepare to share.
extensionMid-week to-dos
- Make sure to have week 3 pivot table practice done and uploaded to the google drive file
- Conclude and make your Civil Rights data analysis presentable
cakeNEW Products to Produce Mid-Week
- Find a "data journalism" article and prepare to share your findings: what was concluded? Was it presented with obvious bias? Is the sources of the data accessible? Write a short email to the author sharing your findings and links to the analysis google doc here.
Week 5: Sat 17 Feb - Fri 23 Feb
Visualization tools | US Census and ACS data processing
check_circleLearning Objectives
- Access US Census data through American Factfinder and navigate the download tool to extract two years of data
- Clean US Census data to isolate variables of interest in a spreadsheet and OpenRefine
- Create a database to conduct a join on the data
- Export Joined data to a spreadsheet and create basic chart visualizations of that data
bookLearning Resources
- US Census data is all accessible through the American FactFinder portal located here.
- Directory listing of class files: including sample database and chart spreadsheet
listLearning Exercises:
- Let's begin with a discussion of a fascinating use of census data to visualize distributions of people by race in the USA. Investigate this "Racial Dot Map" tool created by researchers at the University of Virginia. Consider these discussion questions:
- Where did you explore first? What were your first impressions of this data?
- What makes this an effective data visualization tool? What are its limitations?
- Explore the researcher's data portal by clicking the "what am I looking at?" Link. What principles of good data analysis are exhibited?
- What additional layers of data would you like to add to this racial dot map? What conclusions or ideas would adding this data allow viewers to consider or conclude?
- Browse the American Community Survey data to find two geographies of interest over two time periods in which you can investigate change in some set of variables. You'll need to make sure that you have data on the same geography level for the tables you choose in both years.
- Download both tables and import them into a spreadsheet
- Clean the columns by developing sensible column names, deleting the columns you don't want (probably margin of errors for this practice activity). Remember--No strange columns and no spaces in variable names!
- Save this file and import into OpenRefine to clean up fields. Delete records in which there is very little data. Replace no-value markers with 0 so we can use numeric functions on the fields
- Export the data from Open Refine back into a spreadsheet
- Copy the cleaned data into LibreOffice base and create a master table with joined data on a key column for export
- Extract data from the database back into the spreadsheet for visualization
- Visualize the data you've gathered and do the write-up in the shared google doc located here.
- Prepare to give a short presentation on this data at the start of the next class.
extensionMid-week to-dos
- Complete the in-class census data exercise and make sure your write-up is solid and presentable next week
cakeProducts to Produce
- Completed, thorough, and presentable US Census data mini-project in the google spreadsheet linked above
Week 6: Sat 24 Feb - Fri 2 March
Mapping Fundamentals | Spatial data analysis | QGIS Glory | CartoDB magic
check_circleLearning Objectives
- Demonstrate the essential elements of mapping: coordinate planes, projections, datums, and spatial joins
- Assemble a geospatial US Census data set using data from American Factfinder and Tiger shapefiles
- Use basic geospatial analysis tools in QGIS to answer meaningful inquiry questions
- Export map data from QGIS into CartoDB for online sharing
bookExternal Resources
- Cartography on Wikipedia: Essential concepts and interesting history on the field of mapping and geospatial analysis
- QGIS main page: The free and open source competitor to ESRI's $1000+ Arcmap software. It's documentation is extensive and amazing.
- US Census data is all accessible through the American FactFinder portal located here.
- The actual map polygons are all published through the TIGER shapefiles program
- Carto (formerly CartoDB) is a free tool for creating and publishing a variety of map types online. Great for sharing and visualizing and embedding but lacking the spatial join tools available in QGIS.
listLearning Exercises:
- Explore the essential elements of map creation with the physical mapping projection activity (enter old-school overhead transparency films!)
- Check out examples of cool online maps: Eric's CRDC black/white suspension ratios dot mapand CMU Students for Urban Data Systems Police jurisdiction maps and Mark Egge's bus bunching analysis and map. What principles of honest data journalism are employed? How does qualitative and quantitative data blend or not blend in these examples?
- Navigating mapping with carto online: Visualizing and tinkering with Allegheny County Property Assessment Data accessed through the Western Penssylviaia Regional Data Center's Property Information Extractor. We'll cook up a data set to export and dump it into Carto to tinker with map basics: choropleth map analysis, layers, annotations, presentation.
- Explore QGIS as a fully featured mapping tool with support for layer building, data joining, advanced geospatial analysis, and exporting/sharing.
- Build a map of our US Census data we analyzed last week using QGIS and export up to Carto.
buildMap analysis in QGIS assembly sequence
- Generate compelling inquiry questions that you hope to answer (at least partially) through your analysis.
- Create a blank map in QGIS
- Assemble the "flat file" of US Census data that you'd like to map. Make sure to preserve the geographic entity id (usually called GEOID) since that's what we'll use to join the flat file data to the geospatial features.
- Download the appropriate TIGER shapefile (resource link above) and import into QGIS. Make sure the data seem reasonable--are these indeed the geospatial features you'd like? Ensure a correspondence between the year of your CENSUS data and the year the Shapefile data was published (the US Census adjusts the shapes of the geospatial features every 10 years). Do a sample lookup of a GEOID in both the flat file table and the map features to make sure the join is likely to find matches for some or all of the features you'd like to visualize data for.
- Join the flat file data to the map features using the data join process described in this tutorial
- Be sure to open your data table for the geospatial layer and make sure the join worked.
- In the symbology tab of the polygon layer choose appropriate data fields to visualize and tinker with the bin size, etc. of the coloring to make sure you're drawing out the essential data that speak to your inquiry questions.
- Export your map as an image and load into our weekly project assembly document.
- Thoughtfully explain your data analysis process, your inquiry questions, and the conclusions your data can (and cannot) support in your section.
extensionMid-week to-dos
- Complete the mapping sequence described above and your full--thoughtful--write up in this week's work product document.
cakeProducts to Produce
- A completed section in the work product document linked above for sharing FIRST THING next week in class.
Week 7: Sat 3 March - Fri 9 March
Exposure to Python and Using python scripts | Taste of regression | Project Design
check_circleFinal Project Share Directory
Please compile all of your project-related files into a single location and upload them to your specific sub-directory in our shared drive linked below.
Required Components
- Original/Raw data sets labeled as such, in its OWN directory
- Processed data files labeled appropriately
- Project outline as a PDF: This should be a document with nice headers and carefully proofed text that includes: background on your project, inquiry questions, analysis steps, conclusions, and ideas for further exploration. Include snaps of any visualizations you use.
- Any graphs or charts or images used--properly titled.
- A work process log that is a step-by-step list of what you did to analyze your data: tools used, files moved, functions/libraries, etc.
- The source code of any scripts (such as that generated by open refine) used in the project
- A file called readme.txt that lists the files in this directory and why they are included.
Week 8 [LAST SESSION]: Sat 10 March
Share final analysis projects with client | Celebrate the promise of data analytics
check_circleLearning Objectives
- Present a coherent narrative of your project's birth, your guiding questions, analysis, and conclusions
- Converse with a professional data analyst about the field at large, specific data analysis procedures and approaches, and recommendations for skill development.
listSpecial Guest Conversant
Our class had the pleasure of hosting professional data analyst Mark Egge, a consultant at High Street consulting, based in Pittsburgh, PA. Mr. Egge lives in Bozeman, MT
cakeProducts to Produce
- Completed final project directory as described and linked in last week's guide