Midterm Project Update – Hacking the Humanities 2021

We began this week by contacting the Career Center, Alumni Relations, the Admissions Office, the Dean of Students, Gould Library, and Institutional Research & Assessment Office. While we heard back from many of the offices, they unfortunately did not have the data we were looking for, although many of them expressed support of the project.

Since we were not able to obtain the data from the administration, we had to design ways of collecting the data on our own. We scraped the Alumni Directory by job field and exported the data to a CSV file, capturing data on locations, majors, class year, and industry. We also used Optimal Character Recognition on the Zoobooks from 1970-2020 and exported the data to a CSV file. This created our two datasets: one on alumni locations and one on student origins.

Even though we were able to gather the data, it was extremely messy and we had to do extensive data cleaning using OpenRefine, Python, and manually through Excel. For each of the datasets, we had to delete rows that had not been processed correctly or were missing information; separate the geographic locations by city, state, zip code, and country; and add spaces to some of the cities that were multiple words long. We are currently geocoding the data in Google Sheets so we can map the locations by their coordinates in ArcGIS.

In regards to problems that we have faced, the speed at which we gather, clean, and analyze the data that we’re interested in has been limited by a lack of replies from the relevant offices, and as a result we have relied more heavily on scraped data, such as that from the Alumni Directory. These data, however, have their own set of challenges, as we are left with an incomplete sample of alumni data, and are ultimately unable to complete as comprehensive an analysis as we’d like. Another issue that we have encountered relates to the mapping of the data points that we’ve extracted. In particular, it is not possible to geocode zip codes to translate them into coordinates that can be mapped cohesively in our utilized mapping softwares. A proposed solution to this is to split our current data into two different CSV files, those for which coordinates were able to be extracted, and those for which zip codes must be utilized. We would then import these files separately into ArcGIS, and then merge them into a single layer.

The plan is currently on track and there were only minor changes. Our updated timeline of deliverables includes: geocoding and importing CSV files into ArcGIS by February 26th; visualizing data by March 3rd; compiling story map by March 6th; and completing presentation by March 8th. Visualization will be done in ArcGIS, Palladio, and Flourish.