Create 311 data CSV files that can be accessed through a Jupyter notebook
Overview
We want to download 311 data and split by year, then month, so each is under 100MB and we can host tan append-only data warehouse on GitHub.
Action Items
- [x] Get cleaning rules from the 311-data repo and add a link to the rules to Resources below.
- [x] Get city data
- [x] Split by year, then by month
- [x] Outline what you did to clean the data in a comment below
- [x] Create Jupyter notebook to access the data and add notes explaining the cleaning rules
- [x] Create a website (ideally ghpages) that can display the jupyter notebook so that people don't have to know how to download and install one.
Resources/Instructions
Cleaning Rules: https://github.com/hackforla/data-science/blob/main/311-data/CSV_files/Docs/CleaningRules.txt City Data:: https://data.lacity.org/browse?q=311%20data%20%2C%202024&sortBy=relevance (Please update the filter for the year 2024 based on the requirements.) Website (ghpages): https://hackforla.github.io/311-data-jupyter-notebooks/lab (navigate to folder : 311_Data_CleaningScript) Google Colab: Implemented an alternative using Google Colab, allowing easy execution of the notebook and direct access to raw and monthly CSV files without relying on GitHub Pages. Link to Colab notebook: https://colab.research.google.com/drive/1_HFqnSOIDqDCtF3jmslmzkZ82eho10lY?usp=sharing
https://www.google.com/search?q=jupyter+notebook+ghpages&oq=jupyter+notebook+ghpages&aqs=chrome..69i57j0i22i30j0i390i650l3j69i60.9524j0j15&sourceid=chrome&ie=UTF-8#ip=1
I made this repo for @chelseybeck to see if its feasable to use Jupyter Notebook with ghpages https://github.com/hackforla/jupyter-ghpages-test
I am going to create another repo for the 311 data to go into
https://discourse.jupyter.org/t/run-jupyter-notebooks-on-github-with-reporting-to-a-static-website/14982
Outline of Data Cleaning Steps Data Cleaning was essential to prepare the 311 service request data for analysis. The following steps were undertaken:
1. Removing Duplicates
- Action: Used
data.drop_duplicates(inplace=True)to eliminate duplicate rows. - Reason: Duplicates can lead to biased results in analysis and modeling by over-representing certain data points.
2. Identifying Missing Values
- Action: Checked for missing values using
data.isnull().sum(). - Reason: Knowing which columns have missing data is essential for deciding how to handle them, whether through imputation or deletion.
3. Converting Date Columns
- Action: Converted
CreatedDate,UpdatedDate,ServiceDate, andClosedDateto datetime format usingpd.to_datetime(). - Reason: Proper date formats are critical for any time-based analysis, such as trend analysis or date-based filtering.
4. Analyzing Categorical Variables
- Action: Analyzed the frequency of values in
CD&CDMember, andNC&NCName. - Reason: This helps identify redundant columns or combined values, which can simplify the dataset and improve analysis accuracy.
5. Dropping Unnecessary Columns
- Action: Removed columns like
SRNumber,MobileOS, and others usingdata.drop(columns=unnecessary_columns, inplace=True). - Reason: These columns were irrelevant or redundant/Unique identifier, and dropping them simplifies the dataset, improving processing efficiency.
6. Standardizing Categorical Data
- Action: Converted categorical text columns to lowercase using
data[cat_columns] = data[cat_columns].apply(lambda x: x.str.lower()). - Reason: Standardization reduces inconsistencies and errors in data analysis, especially in text-based operations.
7. Handling Missing Data
- Action:
- Dropped rows missing key geographical data.
- Filled missing
ServiceDateandClosedDatebased onStatusandUpdatedDate.
- Reason: Ensures critical data is complete and logical, particularly for location-based and time-based analysis.
8. Cleaning Zipcode column
- Action: Removed invalid entries from the
ZipCodecolumn. - Reason: Ensures only valid postal codes are used, which is essential for accurate location analysis.
9. Saving Cleaned Data
- Action: Saved the cleaned data into monthly CSV files, grouped by
CreatedDate. - Reason: Organizing the data by month makes it easier to perform time-series analysis and manage large datasets.
@bonniewolfe: @mru-hub is asking for clarification on this issue. Do we have a github page already for Hack for LA? Should she create a new page or add her work here https://github.com/hackforla/311-data-jupyter-notebooks? Also she mentioned "We have one for our organization which is created by Bonnie. Also the project page in above URL has '311-data', so i think we have one project page for our repository too. If this is true I have to use the same URL for current ghpage purpose."
I answered this in the data science meeting on 2024-09-16. Basically, the repository is the work for this issue, but it needs updated data files.
Started working on ghpages. Website: https://hackforla.github.io/311-data-jupyter-notebooks/lab (navigate to folder: 311_Data_CleaningScript). I've made some initial updates to the script and will continue working on integrating it for the ghpages.
The Jupyter Notebook, 311-data/CSV_files/DataLoading_Script.ipynb which has been developed and made available on GitHub, is not functioning as expected on GitHub Pages due to kernel (Pyodide)-related issues. The code needs modifications to work properly on GitHub Pages. I will collaborate with the team to investigate and implement the necessary changes to resolve the issue.
Hi @mru-hub are there any updates to this issue?
I'm currently working on resolving the kernel-related issues with DataLoading_Script.ipynb on GitHub Pages. The notebook is still not functioning as expected due to Pyodide limitations. I'm collaborating with Sophia to investigate the problem and implement a fix.
Except for this part, everything else on this ticket is done and ready to use. We'll post another update once we’ve made more progress.
@mru-hub Sofia said she would not be able to keep helping with this, so could you write in a comment what kind of support you need, so I can create an open role for someone with the specific skills / tool experience that you need.
Please provide update by 9am PST Monday, June 16th. So that I can review and respond if you are having any blockers or need anything else. I won't be able to attend the Data Science Community of Practice this Monday because I have onboarding.
Instructions
- Progress: "What is the current status of your project? What have you completed and what is left to do?"
- Blockers: "Difficulties or errors encountered."
- Availability: "How much time will you have this week to work on this issue?"
- ETA: "When do you expect this issue to be completed?"
- Pictures (if necessary): "Add any pictures that will help illustrate what you are working on."
You can use this template
1. Progress:
2. Blockers:
3. Availability:
4. ETA:
5. Pictures (if necessary):
@ExperimentsInHonesty : As the current implementation relies on downloading raw data from public URLs, this creates issues when running the notebook through GitHub Pages or Pyodide-based environments. These environments operate entirely in-browser and are subject to browser constraints.
Issue: Public URLs may not be accessed reliably due to:
- CORS (Cross-Origin Resource Sharing) restrictions
- Browser memory limitations
As a result, downloading data from external sources fails during in-browser execution.
Proposed Workaround: To enable full functionality without relying on external downloads, I propose the following approach:
- Users manually upload the raw data file (e.g., 2024.csv) using the in-browser interface.
- The notebook is then executed in-browser to process and generate individual monthly CSV files.
- Users can update the input path in the code to reflect the uploaded file,
"input_file = "./Raw_csvfiles/2024.csv" " # Replace with your uploaded file name
This workaround enables the notebook to run entirely within the browser environment without dependency on external data sources. However, due to file size, users might face some performance issues.
If this approach is acceptable, I will proceed to implement the necessary code (include 2024.csv file as example) and documentation updates.
Alternative: Google Colab looks like a better alternative for running the notebook because it:
- Supports direct downloading of data from public URLs without CORS restrictions.
- Provides a full Python environment with access to external packages.
- Simplifies file handling and data processing.
If this sounds good, I will try executing the notebook on Colab and provide an update accordingly.
Google Collab sounds good
@mru-hub In case you didn't see my last message "Google Collab sounds good"
Please provide update
Instructions
- Progress: "What is the current status of your project? What have you completed and what is left to do?"
- Blockers: "Difficulties or errors encountered."
- Availability: "How much time will you have this week to work on this issue?"
- ETA: "When do you expect this issue to be completed?"
- Pictures (if necessary): "Add any pictures that will help illustrate what you are working on."
You can use this template
1. Progress:
2. Blockers:
3. Availability:
4. ETA:
5. Pictures (if necessary):
Update: The Google Colab–based solution has been implemented and is ready to use.
Key points:
- Notebook runs fully in Colab without CORS or browser memory issues.
- Raw data (2024.csv) and generated monthly CSV files are visible in Colab’s left-side file browser under /content/ while the session is active.
- Files in /content/ are temporary — once the session is closed, they are deleted.
- If users need to keep the files, they can download them from the file browser during the session or modify the notebook to save them to Google Drive.
Link to Colab notebook: https://colab.research.google.com/drive/1_HFqnSOIDqDCtF3jmslmzkZ82eho10lY?usp=sharing
future improvement is to add prior years as options to the collab file
@mru-hub we also need to add a page to the wiki and link it to this page https://github.com/hackforla/data-science/wiki/Our-Work, can you look at the other pages under Our work, and draft the markdown that should be added a new page. You can just add the draft to a comment on this issue, so we can review it.
### Draft wiki page that explains project
As discussed, extend the Colab notebook to include prior years (2020–2025) in the URLs dictionary. The user can input the desired year by assigning it to the year variable. The notebook could then save cleaned data both as a single CSV for the entire year and/or as separate CSVs for each month, making it fully dynamic.
I’ll first review the existing pages under Our Work to understand the format and structure. Once I’m familiar with it, I’ll draft the markdown for the new page and share it here for review.
311 Data Cleaning & Hosting Project Summary
Background
The 311 service request dataset is very large and challenging to host or query in in-browser environments. To make this data more accessible and usable, the project aims to process, clean, and split it into manageable files, enabling users to work with the data efficiently.
Tools Used
- Python & Jupyter Notebook – for data cleaning and transformation
- Pandas – for processing large datasets efficiently
- Colab – for providing user access to temporary dataset files
Objective Build a reproducible pipeline that:
- Downloads raw 311 Service Request data.
- Cleans the dataset according to standardized rules for consistency and quality.
- Splits the data by year, then by month, with each file around 100MB in size.
- Provides cleaned and split datasets via Colab for direct download by users (instead of publishing the datasets to GitHub as an append-only warehouse).
Process 1. Data Acquisition
- Downloaded 311 Service Request data from the city’s open data portal.
- Stored raw files to ensure reproducibility.
2. Data Cleaning Data cleaning was essential to prepare the 311 service request data for analysis. The following high-level steps were performed:
- Removing duplicates: Eliminated duplicate rows to prevent biased results.
- Handling missing values: Identified missing data and addressed critical gaps, including filling or dropping values based on context.
- Converting date columns: Standardized all date fields to proper datetime format to support time-based analysis.
- Analyzing categorical variables: Reviewed key categorical fields to identify redundancies and simplify the dataset.
- Dropping unnecessary columns: Removed irrelevant or unique identifier columns to improve processing efficiency.
- Standardizing categorical data: Converted text columns to lowercase for consistency and to reduce errors in analysis.
- Cleaning geographical data: Addressed missing or invalid entries in location-related fields such as ZipCode to ensure accurate analysis.
- Saving cleaned data: Partitioned and saved the cleaned dataset into monthly files for easier time-series analysis and handling of large datasets.
3. Data Splitting
- Partitioned cleaned datasets by year, then by month.
- Organized files in a clear folder hierarchy for easy access.
4. Jupyter Notebook/ Colab Notebook
- Created a documented notebook to:
- Download raw data
- Apply cleaning rules
- Automate splitting logic
- Save outputs in the required folder structure
Notebook includes detailed notes explaining the purpose of each cleaning step.
Deliverables
- Annotated Jupyter Notebook with the full data pipeline
- Cleaned and partitioned datasets available as runtime temporary files on Colab
- Documentation of cleaning rules: https://github.com/hackforla/data-science/blob/main/311-data/CSV_files/Docs/CleaningRules.txt
@mru-hub Did you mean to format this part as
4. Jupyter Notebook/ Colab Notebook
- Created a documented notebook to:
- Download raw data
- Apply cleaning rules (notebook includes detailed notes explaining the purpose of each cleaning step).
- Automate splitting logic
- Save outputs in the required folder structure
Please add any links to the collab files and folder to the document above, plus any formatting change, and I'll look at this again + we will try to get a peer review (I added the issue to peer-review column.
@mru-hub Did you mean to format this part as
4. Jupyter Notebook/ Colab Notebook
- Created a documented notebook to:
- Download raw data
- Apply cleaning rules (notebook includes detailed notes explaining the purpose of each cleaning step).
- Automate splitting logic
- Save outputs in the required folder structure
Please add any links to the collab files and folder to the document above, plus any formatting change, and I'll look at this again + we will try to get a peer review (I added the issue to peer-review column.
@mru-hub Andrew brought up a good question. Should we put the Collab into the GitHub directory, and I think we should. Along with instructions on how to load it. If you're at the next data science meeting on September 8th, you can talk about it with Andrew.
I started a conversation with ChatGPT but it started to ask questions I did not have the answer to, in order to create the instructions. https://chatgpt.com/share/68b62f0b-d6c0-8008-b0f8-7baee180179e
@mru-hub Please provide update
Instructions
- Progress: "What is the current status of your project? What have you completed and what is left to do?"
- Blockers: "Difficulties or errors encountered."
- Availability: "How much time will you have this week to work on this issue?"
- ETA: "When do you expect this issue to be completed?"
- Pictures (if necessary): "Add any pictures that will help illustrate what you are working on."
You can use this template
1. Progress:
2. Blockers:
3. Availability:
4. ETA:
5. Pictures (if necessary):
@ExperimentsInHonesty
I saw Andrew’s point about putting the Colab notebook into the GitHub repo. Right now, we already have a Jupyter notebook there that works locally, and the Colab version was meant to make it system-independent. I just want to confirm the expectation—do we also want to add the Colab notebook to the repo with instructions on how to open it, or is the local notebook enough?
Link to github repo: https://github.com/hackforla/data-science/tree/main/311-data/CSV_files
@mru-hub is going to make a
- how to use the collab draft to be added to the wiki.
- a link to the readme for the how to use for the repo
- a readme draft for the repository (includes
- inventory of what is in the repo
- instructions on how to use it
- link to wiki page and brief mention:
Please see [X] wiki page for background on this project and details of how you can use our Google Collab notebook instead of this repo, if space is limited on your machine or you just wan to preview it without forking this repo or downloading its contents.
README:
311 Data Cleaning & Hosting Project
Overview This project provides a reproducible pipeline for processing the large 311 Service Request dataset. The pipeline downloads, cleans, and splits the data into smaller, manageable files, making it easier to work with in local or in-browser environments.
For background, methodology, and details on how to use Google Colab instead of this repo, please see the Project Wiki
Repository Contents
- 311_data_cleaning.ipynb – Jupyter Notebook for acquisition, cleaning, and splitting
- CSV_files/Docs/CleaningRules.txt – Documentation of cleaning rules
- README.md – This file.
How to Use Run Locally
-
Clone this repository -git clone https://github.com/hackforla/data-science.git -cd data-science/311-data
-
Open the Jupyter Notebook -311-data/CSV_files -jupyter notebook 311_data_cleaning.ipynb
-
Run the notebook -Download raw 311 Service Request data (provide open data URL for the desired year). -Apply cleaning rules by running the notebook -Automatically split and save the dataset into monthly files
Run in Google Colab Use the hosted notebook (no fork or download required): Open in Colab
References Project Wiki Documentation of cleaning rules
README:
311 Data Cleaning & Hosting Project
Overview
This project provides a reproducible pipeline for processing the large 311 Service Request dataset. The pipeline downloads, cleans, and splits the data into smaller, manageable files, making it easier to work with in local or in-browser environments.
For background, methodology, and details on how to use Google Colab instead of this repo, please see the Project Wiki
Repository Contents
- 311_data_cleaning.ipynb – Jupyter Notebook for acquisition, cleaning, and splitting
- CSV_files/Docs/CleaningRules.txt – Documentation of cleaning rules
- README.md – This file.
How to Use
Run Locally
-
Fork this repository
git clone https://github.com/hackforla/data-science.gitcd data-science/311-data -
Open the Jupyter Notebook
- 311-data/CSV_files
- jupyter notebook 311_data_cleaning.ipynb
-
Run the notebook
- Download raw 311 Service Request data (provide open data URL for the desired year).
- Apply cleaning rules by running the notebook
- Automatically split and save the dataset into monthly files
OR
Run in Google Colab
Use the hosted notebook (no fork or download required):
References
- Project Wiki
- Documentation of cleaning rules
- GitHub Tutorials
- https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/fork-a-repo
README:
311 Data Cleaning & Hosting Project
Overview
This project provides a reproducible pipeline for processing the large 311 Service Request dataset. The pipeline downloads, cleans, and splits the data into smaller, manageable files, making it easier to work with in local or in-browser environments. For background, methodology, and details on how to use Google Colab instead of this repo, please see the Project Wiki
Repository Contents
- 311_data_cleaning.ipynb – Jupyter Notebook for acquisition, cleaning, and splitting
- CSV_files/Docs/CleaningRules.txt – Documentation of cleaning rules
- README.md – This file.
How to Use
Run Locally
-
Fork this repository on GitHub.
-
Clone your forked copy:
git clone https://github.com/<your-username>/data-science.git # clone your fork
cd data-science/311-data # move into 311-data folder
- Open the Jupyter Notebook
- 311-data/CSV_files
- jupyter notebook 311_data_cleaning.ipynb
- Run the notebook
- Download raw 311 Service Request data (provide open data URL for the desired year).
- Apply cleaning rules by running the notebook
- Automatically split and save the dataset into monthly files
OR
Run in Google Colab
Use the hosted notebook (no fork or download required):
References
- Project Wiki
- Documentation of cleaning rules
- GitHub Tutorials
- https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/fork-a-repo
Please provide update @mru-hub
Instructions
- Progress: "What is the current status of your project? What have you completed and what is left to do?"
- Blockers: "Difficulties or errors encountered."
- Availability: "How much time will you have this week to work on this issue?"
- ETA: "When do you expect this issue to be completed?"
- Pictures (if necessary): "Add any pictures that will help illustrate what you are working on."
You can use this template
1. Progress:
2. Blockers:
3. Availability:
4. ETA:
5. Pictures (if necessary):
@chinaexpert1 Please find the Readme file README.md