311-data
311-data copied to clipboard
Create Python script to clean 2015 through 2019 data
Overview
We need to clean the 2016 through 2019 data from the 311 Data Service Request APIs from the city so that we can access them through our Search and Filters modal.
Action Items
We want to know how to clean the data from 2015 to 2019 in order to make it consistent with our 2021-2024 data. To achieve this, complete the following:
- [ ] Review relevant code from our scripts folder (see R1)
- [ ] For each year (2015 through 2020), do the following...
- Download the dataset for that year and compare it to our 2024 data
- In a comment, answer the following questions:
- q1: Does the dataset for this year have the same columns as the 2024 dataset? (can use
check_column_count.pyfrom R1) - q2: Does the dataset for this year contain any "problematic rows", e.g. rows that contain cells that might confuse our DuckDb interpretter (see
inspect_csv.pyfrom R1)
- q1: Does the dataset for this year have the same columns as the 2024 dataset? (can use
- Based on answers from q1 and q2, determine (if any) what steps should be taken to clean the dataset for that particular year
- [ ] If any significant differences or problem rows were identified in the previous step, write a Python script to fix those issues, creating a cleaned dataset as a byproduct
- [ ] commit those scripts under a new folder:
311-data/scripts/clean-2015-through-2020 - [ ] attend a 311 Data weekly general meeting or 311 Data weekly engineering meeting to review your work with the other devs
- [ ] commit those scripts under a new folder:
Resources/Instructions
R1: Relevant files and functions used in our build process, as well as tools used to determine where and how our datasets needed to be cleaned:
- [ ] Review
311-data/scripts/migrateOldHfDataset.py- Note: this script is very similar to
311-data/scripts/updateHfDataset.py, except that it allows you to pass in a year as an argument. - [ ] Review
dlData(year): you'll notice we are drawing data from a personal repo, see resources to know where that data comes from - [ ] Review
hfClean(year): this method has two parts:- it opens a local file (e.g. 2024.csv), creates a new output file (e.g. 2024-fixed.csv), and does a string replacement on the input before writing to the output file
- it opens a connection to DuckDb, creates a temporary table with the data from our fixed csv, then converts it to Parquet
- Skip
hfUpload(year), this won't be needed for the purposes of this ticket - Skip
process_data(...), this is our main control flow for the script, which is determined by command line arguments
- Note: this script is very similar to
- [ ] review
311-data/scripts/csv_debug_tools- [ ] review
check_column_count.py, read documentation at the top of the file - [ ] review
inspect_csv.py, read documentation at the top of the file
- [ ] review
R2: Data Sources
- the 311 Data project's official HuggingFace repo listing
- example: 311-data/2024
- Our previous dev lead kept a copy of each year's data as CSV. See: Edwin Huggingface repo listing
- Generally, data can be obtained from the LA City's open data portal. See: listing of 311 datasets for each year.
- example: MyLA311 Service Request Data 2016 data preview
This ticket is ready to be picked up
Hi @mru-hub ,
Please leave a comment with the following items:
- updated ETA
- progress from the last week (if applicable)
- availability for communications during the week
Hi @mru-hub,
We are checking in since we haven't heard back since our last request for updates. Are you still working on this? If you need to put it down or ask for clarification, that's OK. Otherwise, please provide the following:
- updated ETA
- progress from the last week (if applicable)
- availability for communications during the week
@mru-hub Please provide update
Instructions
- Progress: "What is the current status of your project? What have you completed and what is left to do?"
- Blockers: "Difficulties or errors encountered."
- Availability: "How much time will you have this week to work on this issue?"
- ETA: "When do you expect this issue to be completed?"
- Pictures (if necessary): "Add any pictures that will help illustrate what you are working on."
- Progress:
- Blockers:
- Availability:
- ETA:
- Pictures (if necessary):
Hi @ryanfchase I have few questions
- Should we use LA City's open data portal or Edwin's dataset on Hugging Face? It looks like Edwin modified the dataset before saving it—do we have a preference or any historical reasoning for choosing one over the other?
- Should we clean data from 2015–2019 or 2015–2020? The action items mention cleaning 2016–2019, but the steps outline reviewing data from 2015 through 2020. Can we clarify the exact range we need to process?
- For the data cleaning process, are we only handling string replacement to ensure compatibility with DuckDB, as per script 'migrateOldHfDataset.py', or do we also need to perform full data cleaning steps, such as removing duplicates, null values, and unnecessary columns? Can we clarify the exact scope of cleaning required?
Missing Info from the ticket
- "For each year (2015 through 2020), do the following..." this needs to be "2015 through 2019"
Answers:
- Use Open Data portal
- Check data 2015 through 2019
- We do not need to perform full data cleaning steps (removing dups, nulls, unnecessary columns)
Here’s a comparison of columns between 2015-2019 (LA open data) and 2024
Comparison of columns between 2015 and 2024: Total columns in 2015: 33 Total columns in 2024: 34 Columns missing in 2015 : {'CreatedByUserOrganization'} Additional columns in 2015 : None
Comparison of columns between 2016 and 2024: Total columns in 2016: 33 Total columns in 2024: 34 Columns missing in 2016 : {'CreatedByUserOrganization'} Additional columns in 2016 : None
Comparison of columns between 2017 and 2024: Total columns in 2017: 33 Total columns in 2024: 34 Columns missing in 2017 : {'CreatedByUserOrganization'} Additional columns in 2017 : None
Comparison of columns between 2018 and 2024: Total columns in 2018: 33 Total columns in 2024: 34 Columns missing in 2018 : {'CreatedByUserOrganization'} Additional columns in 2018 : None
Comparison of columns between 2019 and 2024: Total columns in 2019: 34 Total columns in 2024: 34 Columns missing in 2019 : None Additional columns in 2019 : None
Progress: The year-wise data comparison is completed and updated in the comments as per the instructions. The next step is to finalize the data cleaning approach.
Blockers: No major blockers; resolved previous issues and progressing.
Availability: 24 hours this week.
ETA: Expected completion by March 2nd.
Pictures: N/A (or attach any relevant images if needed).
Reviewed R1 script and decided to come up with a comparison script for 2015 to 2019 data. (note: R1 script is still useful to clean special strings in the column 'ZipCode')
R1 can be modified to include any cleaning requirement but it has many other functions for converting data to Parque and running cron job which might not be necessary.
If there is a requirement of cleaning the data further please see the Jupyter Notebook:
I reviewed check_column_count.py and inspect_csv.py and decided to go with new script for comparing the columns.
While we had one note for the script that wasn't completed, the overall work on this ticket can be considered done.
The results of the script have been summarized here: Comparison of 2015 through 2019 service request data