311-data
311-data copied to clipboard
Migrate external Huggingface data to 311 Data Huggingface repo
Overview
We need to port in 2016-2022 data into the 311-Data HF repo so that users can have access to all available 311 request data
Action Items
- [ ] create 2016-2022 repos on 311's Hugging Face repo
- [ ] scrub 2016-2022 CSVs into parquet files using a one time python script (could probably be done locally)
- [ ] move the 2016-2022 scrubbed parquet data files into 311's Hugging Face
Resources/Instructions
- Edwin's repo: https://huggingface.co/edwinjue
- 311 Data repo: https://huggingface.co/311-data
- Huggingface token: ask @ryanfchase or @Skydodle
This ticket is ready to be picked up
ETA: Sunday 5/19 Availability: F Sat Sun 6-9pm
Updating the ETA to Sunday 6/1 Availability: F Sat Sun 6-9pm
Added PR that enables 2022 data for now. Waiting for reviews to make sure no issues before continuing to add the other years with same implementation.
Most recent PR only enable 2022 data, reopening this issue to continue migrate other older years.
PR is approved, we're ok to shelve this ticket until we decide we need even earlier data.
@Skydodle I just wanted to get a paper trail on our reasoning for fully closing this. Is it correct that dates 2019 and prior would require a serious amount of data cleaning in order to smoothly integrate it 2020-2024? Could you outline some of the technical hurdles that you had encountered when looking at those datasets?
@ryanfchase
- There were some structural changes prior to 2020. Some examples are the csv column names and values may be different than what we have right now. We can't change how we extract and apply mutations to the data with our current FE setup because that would mess with how we display recent years' data, therefore for prior years we need to examine case by case and transform the abnormal columns into the form we accept right now.
- Data corruption: Some columns may be corrupt or missing values. For examples 2021 data an entire column were missing values with both file from Edwin's HF and the source file from LA data site.
What would consume the most time is that the anomalies in the csv would most likely not be detected until it's been transformed to parquet, upload to 311's hf, configured to displayed on the UI, then we'll see some data not displaying correctly or not displaying at all. And then backtrack to make the correction and redo the entire process.
I've created some tools for debugging in PR #1747