311-data Migrate external Huggingface data to 311 Data Huggingface repo

Overview

We need to port in 2016-2022 data into the 311-Data HF repo so that users can have access to all available 311 request data

Action Items

[ ] create 2016-2022 repos on 311's Hugging Face repo
[ ] scrub 2016-2022 CSVs into parquet files using a one time python script (could probably be done locally)
[ ] move the 2016-2022 scrubbed parquet data files into 311's Hugging Face

Resources/Instructions

Edwin's repo: https://huggingface.co/edwinjue
311 Data repo: https://huggingface.co/311-data
Huggingface token: ask @ryanfchase or @Skydodle

Apr 27 '24 23:04 ryanfchase

This ticket is ready to be picked up

May 08 '24 22:05 ryanfchase

ETA: Sunday 5/19 Availability: F Sat Sun 6-9pm

May 09 '24 19:05 Skydodle

Updating the ETA to Sunday 6/1 Availability: F Sat Sun 6-9pm

May 30 '24 01:05 Skydodle

Added PR that enables 2022 data for now. Waiting for reviews to make sure no issues before continuing to add the other years with same implementation.

May 31 '24 23:05 Skydodle

Most recent PR only enable 2022 data, reopening this issue to continue migrate other older years.

Jun 04 '24 17:06 Skydodle

PR is approved, we're ok to shelve this ticket until we decide we need even earlier data.

Jun 07 '24 18:06 ryanfchase

@Skydodle I just wanted to get a paper trail on our reasoning for fully closing this. Is it correct that dates 2019 and prior would require a serious amount of data cleaning in order to smoothly integrate it 2020-2024? Could you outline some of the technical hurdles that you had encountered when looking at those datasets?

Jun 08 '24 20:06 ryanfchase

@ryanfchase

There were some structural changes prior to 2020. Some examples are the csv column names and values may be different than what we have right now. We can't change how we extract and apply mutations to the data with our current FE setup because that would mess with how we display recent years' data, therefore for prior years we need to examine case by case and transform the abnormal columns into the form we accept right now.
Data corruption: Some columns may be corrupt or missing values. For examples 2021 data an entire column were missing values with both file from Edwin's HF and the source file from LA data site.

What would consume the most time is that the anomalies in the csv would most likely not be detected until it's been transformed to parquet, upload to 311's hf, configured to displayed on the UI, then we'll see some data not displaying correctly or not displaying at all. And then backtrack to make the correction and redo the entire process.

I've created some tools for debugging in PR #1747

Jun 11 '24 17:06 Skydodle

311-data 311-data copied to clipboard

Migrate external Huggingface data to 311 Data Huggingface repo

Overview

Action Items

Resources/Instructions

311-data
311-data copied to clipboard