311-data icon indicating copy to clipboard operation
311-data copied to clipboard

Create Python script to clean 2015 through 2019 data

Open ryanfchase opened this issue 1 year ago • 1 comments

Overview

We need to clean the 2016 through 2019 data from the 311 Data Service Request APIs from the city so that we can access them through our Search and Filters modal.

Action Items

We want to know how to clean the data from 2015 to 2019 in order to make it consistent with our 2021-2024 data. To achieve this, complete the following:

  • [ ] Review relevant code from our scripts folder (see R1)
  • [ ] For each year (2015 through 2020), do the following...
    1. Download the dataset for that year and compare it to our 2024 data
    2. In a comment, answer the following questions:
      • q1: Does the dataset for this year have the same columns as the 2024 dataset? (can use check_column_count.py from R1)
      • q2: Does the dataset for this year contain any "problematic rows", e.g. rows that contain cells that might confuse our DuckDb interpretter (see inspect_csv.py from R1)
    3. Based on answers from q1 and q2, determine (if any) what steps should be taken to clean the dataset for that particular year
  • [ ] If any significant differences or problem rows were identified in the previous step, write a Python script to fix those issues, creating a cleaned dataset as a byproduct
    • [ ] commit those scripts under a new folder: 311-data/scripts/clean-2015-through-2020
    • [ ] attend a 311 Data weekly general meeting or 311 Data weekly engineering meeting to review your work with the other devs

Resources/Instructions

R1: Relevant files and functions used in our build process, as well as tools used to determine where and how our datasets needed to be cleaned:

  • [ ] Review 311-data/scripts/migrateOldHfDataset.py
    • Note: this script is very similar to 311-data/scripts/updateHfDataset.py, except that it allows you to pass in a year as an argument.
    • [ ] Review dlData(year): you'll notice we are drawing data from a personal repo, see resources to know where that data comes from
    • [ ] Review hfClean(year): this method has two parts:
      1. it opens a local file (e.g. 2024.csv), creates a new output file (e.g. 2024-fixed.csv), and does a string replacement on the input before writing to the output file
      2. it opens a connection to DuckDb, creates a temporary table with the data from our fixed csv, then converts it to Parquet
    • Skip hfUpload(year), this won't be needed for the purposes of this ticket
    • Skip process_data(...), this is our main control flow for the script, which is determined by command line arguments
  • [ ] review 311-data/scripts/csv_debug_tools
    • [ ] review check_column_count.py, read documentation at the top of the file
    • [ ] review inspect_csv.py, read documentation at the top of the file

R2: Data Sources

ryanfchase avatar Oct 17 '24 21:10 ryanfchase

This ticket is ready to be picked up

ryanfchase avatar Oct 19 '24 17:10 ryanfchase

Hi @mru-hub ,

Please leave a comment with the following items:

  • updated ETA
  • progress from the last week (if applicable)
  • availability for communications during the week

ryanfchase avatar Feb 01 '25 19:02 ryanfchase

Hi @mru-hub,

We are checking in since we haven't heard back since our last request for updates. Are you still working on this? If you need to put it down or ask for clarification, that's OK. Otherwise, please provide the following:

  • updated ETA
  • progress from the last week (if applicable)
  • availability for communications during the week

ryanfchase avatar Feb 09 '25 03:02 ryanfchase

@mru-hub Please provide update

Instructions
  1. Progress: "What is the current status of your project? What have you completed and what is left to do?"
  2. Blockers: "Difficulties or errors encountered."
  3. Availability: "How much time will you have this week to work on this issue?"
  4. ETA: "When do you expect this issue to be completed?"
  5. Pictures (if necessary): "Add any pictures that will help illustrate what you are working on."
  1. Progress:
  2. Blockers:
  3. Availability:
  4. ETA:
  5. Pictures (if necessary):

ExperimentsInHonesty avatar Feb 14 '25 17:02 ExperimentsInHonesty

Hi @ryanfchase I have few questions

  1. Should we use LA City's open data portal or Edwin's dataset on Hugging Face? It looks like Edwin modified the dataset before saving it—do we have a preference or any historical reasoning for choosing one over the other?
  2. Should we clean data from 2015–2019 or 2015–2020? The action items mention cleaning 2016–2019, but the steps outline reviewing data from 2015 through 2020. Can we clarify the exact range we need to process?
  3. For the data cleaning process, are we only handling string replacement to ensure compatibility with DuckDB, as per script 'migrateOldHfDataset.py', or do we also need to perform full data cleaning steps, such as removing duplicates, null values, and unnecessary columns? Can we clarify the exact scope of cleaning required?

mru-hub avatar Feb 15 '25 18:02 mru-hub

Missing Info from the ticket

  • "For each year (2015 through 2020), do the following..." this needs to be "2015 through 2019"

ryanfchase avatar Feb 15 '25 18:02 ryanfchase

Answers:

  1. Use Open Data portal
  2. Check data 2015 through 2019
  3. We do not need to perform full data cleaning steps (removing dups, nulls, unnecessary columns)

ryanfchase avatar Feb 15 '25 19:02 ryanfchase

Here’s a comparison of columns between 2015-2019 (LA open data) and 2024

Comparison of columns between 2015 and 2024: Total columns in 2015: 33 Total columns in 2024: 34 Columns missing in 2015 : {'CreatedByUserOrganization'} Additional columns in 2015 : None

Comparison of columns between 2016 and 2024: Total columns in 2016: 33 Total columns in 2024: 34 Columns missing in 2016 : {'CreatedByUserOrganization'} Additional columns in 2016 : None

Comparison of columns between 2017 and 2024: Total columns in 2017: 33 Total columns in 2024: 34 Columns missing in 2017 : {'CreatedByUserOrganization'} Additional columns in 2017 : None

Comparison of columns between 2018 and 2024: Total columns in 2018: 33 Total columns in 2024: 34 Columns missing in 2018 : {'CreatedByUserOrganization'} Additional columns in 2018 : None

Comparison of columns between 2019 and 2024: Total columns in 2019: 34 Total columns in 2024: 34 Columns missing in 2019 : None Additional columns in 2019 : None

mru-hub avatar Feb 15 '25 19:02 mru-hub

Progress: The year-wise data comparison is completed and updated in the comments as per the instructions. The next step is to finalize the data cleaning approach.

Blockers: No major blockers; resolved previous issues and progressing.

Availability: 24 hours this week.

ETA: Expected completion by March 2nd.

Pictures: N/A (or attach any relevant images if needed).

mru-hub avatar Feb 15 '25 19:02 mru-hub

Reviewed R1 script and decided to come up with a comparison script for 2015 to 2019 data. (note: R1 script is still useful to clean special strings in the column 'ZipCode')

R1 can be modified to include any cleaning requirement but it has many other functions for converting data to Parque and running cron job which might not be necessary.

If there is a requirement of cleaning the data further please see the Jupyter Notebook:

311-data/CSV_files/DataLoading_Script.ipynb

mru-hub avatar Feb 16 '25 20:02 mru-hub

I reviewed check_column_count.py and inspect_csv.py and decided to go with new script for comparing the columns.

mru-hub avatar Feb 16 '25 20:02 mru-hub

While we had one note for the script that wasn't completed, the overall work on this ticket can be considered done.

The results of the script have been summarized here: Comparison of 2015 through 2019 service request data

ryanfchase avatar Mar 15 '25 17:03 ryanfchase