data-science Create 311 data CSV files that can be accessed through a Jupyter notebook

Overview

We want to download 311 data and split by year, then month, so each is under 100MB and we can host tan append-only data warehouse on GitHub.

Action Items

[x] Get cleaning rules from the 311-data repo and add a link to the rules to Resources below.
[x] Get city data
[x] Split by year, then by month
[x] Outline what you did to clean the data in a comment below
[x] Create Jupyter notebook to access the data and add notes explaining the cleaning rules
[x] Create a website (ideally ghpages) that can display the jupyter notebook so that people don't have to know how to download and install one.

Resources/Instructions

Cleaning Rules: https://github.com/hackforla/data-science/blob/main/311-data/CSV_files/Docs/CleaningRules.txt City Data:: https://data.lacity.org/browse?q=311%20data%20%2C%202024&sortBy=relevance (Please update the filter for the year 2024 based on the requirements.) Website (ghpages): https://hackforla.github.io/311-data-jupyter-notebooks/lab (navigate to folder : 311_Data_CleaningScript) Google Colab: Implemented an alternative using Google Colab, allowing easy execution of the notebook and direct access to raw and monthly CSV files without relying on GitHub Pages. Link to Colab notebook: https://colab.research.google.com/drive/1_HFqnSOIDqDCtF3jmslmzkZ82eho10lY?usp=sharing

Mar 31 '23 20:03 akhaleghi

https://www.google.com/search?q=jupyter+notebook+ghpages&oq=jupyter+notebook+ghpages&aqs=chrome..69i57j0i22i30j0i390i650l3j69i60.9524j0j15&sourceid=chrome&ie=UTF-8#ip=1

Apr 14 '23 00:04 ExperimentsInHonesty

I made this repo for @chelseybeck to see if its feasable to use Jupyter Notebook with ghpages https://github.com/hackforla/jupyter-ghpages-test

I am going to create another repo for the 311 data to go into

Apr 14 '23 00:04 ExperimentsInHonesty

https://discourse.jupyter.org/t/run-jupyter-notebooks-on-github-with-reporting-to-a-static-website/14982

Apr 17 '23 01:04 ExperimentsInHonesty

Outline of Data Cleaning Steps Data Cleaning was essential to prepare the 311 service request data for analysis. The following steps were undertaken:

1. Removing Duplicates

Action: Used data.drop_duplicates(inplace=True) to eliminate duplicate rows.
Reason: Duplicates can lead to biased results in analysis and modeling by over-representing certain data points.

2. Identifying Missing Values

Action: Checked for missing values using data.isnull().sum().
Reason: Knowing which columns have missing data is essential for deciding how to handle them, whether through imputation or deletion.

3. Converting Date Columns

Action: Converted CreatedDate, UpdatedDate, ServiceDate, and ClosedDate to datetime format using pd.to_datetime().
Reason: Proper date formats are critical for any time-based analysis, such as trend analysis or date-based filtering.

4. Analyzing Categorical Variables

Action: Analyzed the frequency of values in CD & CDMember, and NC & NCName.
Reason: This helps identify redundant columns or combined values, which can simplify the dataset and improve analysis accuracy.

5. Dropping Unnecessary Columns

Action: Removed columns like SRNumber, MobileOS, and others using data.drop(columns=unnecessary_columns, inplace=True).
Reason: These columns were irrelevant or redundant/Unique identifier, and dropping them simplifies the dataset, improving processing efficiency.

6. Standardizing Categorical Data

Action: Converted categorical text columns to lowercase using data[cat_columns] = data[cat_columns].apply(lambda x: x.str.lower()).
Reason: Standardization reduces inconsistencies and errors in data analysis, especially in text-based operations.

7. Handling Missing Data

Action:
- Dropped rows missing key geographical data.
- Filled missing ServiceDate and ClosedDate based on Status and UpdatedDate.
Reason: Ensures critical data is complete and logical, particularly for location-based and time-based analysis.

8. Cleaning Zipcode column

Action: Removed invalid entries from the ZipCode column.
Reason: Ensures only valid postal codes are used, which is essential for accurate location analysis.

9. Saving Cleaned Data

Action: Saved the cleaned data into monthly CSV files, grouped by CreatedDate.
Reason: Organizing the data by month makes it easier to perform time-series analysis and manage large datasets.

Aug 19 '24 06:08 mru-hub

@bonniewolfe: @mru-hub is asking for clarification on this issue. Do we have a github page already for Hack for LA? Should she create a new page or add her work here https://github.com/hackforla/311-data-jupyter-notebooks? Also she mentioned "We have one for our organization which is created by Bonnie. Also the project page in above URL has '311-data', so i think we have one project page for our repository too. If this is true I have to use the same URL for current ghpage purpose."

Sep 15 '24 22:09 salice

I answered this in the data science meeting on 2024-09-16. Basically, the repository is the work for this issue, but it needs updated data files.

Sep 24 '24 23:09 ExperimentsInHonesty

Started working on ghpages. Website: https://hackforla.github.io/311-data-jupyter-notebooks/lab (navigate to folder: 311_Data_CleaningScript). I've made some initial updates to the script and will continue working on integrating it for the ghpages.

Oct 01 '24 02:10 mru-hub

The Jupyter Notebook, 311-data/CSV_files/DataLoading_Script.ipynb which has been developed and made available on GitHub, is not functioning as expected on GitHub Pages due to kernel (Pyodide)-related issues. The code needs modifications to work properly on GitHub Pages. I will collaborate with the team to investigate and implement the necessary changes to resolve the issue.

Feb 15 '25 19:02 mru-hub

Hi @mru-hub are there any updates to this issue?

May 19 '25 22:05 akhaleghi

I'm currently working on resolving the kernel-related issues with DataLoading_Script.ipynb on GitHub Pages. The notebook is still not functioning as expected due to Pyodide limitations. I'm collaborating with Sophia to investigate the problem and implement a fix.

Except for this part, everything else on this ticket is done and ready to use. We'll post another update once we’ve made more progress.

May 20 '25 00:05 mru-hub

@mru-hub Sofia said she would not be able to keep helping with this, so could you write in a comment what kind of support you need, so I can create an open role for someone with the specific skills / tool experience that you need.

Please provide update by 9am PST Monday, June 16th. So that I can review and respond if you are having any blockers or need anything else. I won't be able to attend the Data Science Community of Practice this Monday because I have onboarding.

Instructions

Progress: "What is the current status of your project? What have you completed and what is left to do?"
Blockers: "Difficulties or errors encountered."
Availability: "How much time will you have this week to work on this issue?"
ETA: "When do you expect this issue to be completed?"
Pictures (if necessary): "Add any pictures that will help illustrate what you are working on."

You can use this template

1. Progress: 
2. Blockers: 
3. Availability:
4. ETA:
5. Pictures (if necessary):

Jun 15 '25 17:06 ExperimentsInHonesty

@ExperimentsInHonesty : As the current implementation relies on downloading raw data from public URLs, this creates issues when running the notebook through GitHub Pages or Pyodide-based environments. These environments operate entirely in-browser and are subject to browser constraints.

Issue: Public URLs may not be accessed reliably due to:

CORS (Cross-Origin Resource Sharing) restrictions
Browser memory limitations

As a result, downloading data from external sources fails during in-browser execution.

Proposed Workaround: To enable full functionality without relying on external downloads, I propose the following approach:

Users manually upload the raw data file (e.g., 2024.csv) using the in-browser interface.
The notebook is then executed in-browser to process and generate individual monthly CSV files.
Users can update the input path in the code to reflect the uploaded file,

"input_file = "./Raw_csvfiles/2024.csv" " # Replace with your uploaded file name

This workaround enables the notebook to run entirely within the browser environment without dependency on external data sources. However, due to file size, users might face some performance issues.

If this approach is acceptable, I will proceed to implement the necessary code (include 2024.csv file as example) and documentation updates.

Alternative: Google Colab looks like a better alternative for running the notebook because it:

Supports direct downloading of data from public URLs without CORS restrictions.
Provides a full Python environment with access to external packages.
Simplifies file handling and data processing.

If this sounds good, I will try executing the notebook on Colab and provide an update accordingly.

Jun 16 '25 18:06 mru-hub

Google Collab sounds good

Jun 24 '25 03:06 ExperimentsInHonesty

@mru-hub In case you didn't see my last message "Google Collab sounds good"

Please provide update

Instructions

Progress: "What is the current status of your project? What have you completed and what is left to do?"
Blockers: "Difficulties or errors encountered."
Availability: "How much time will you have this week to work on this issue?"
ETA: "When do you expect this issue to be completed?"
Pictures (if necessary): "Add any pictures that will help illustrate what you are working on."

You can use this template

1. Progress: 
2. Blockers: 
3. Availability:
4. ETA:
5. Pictures (if necessary):

Aug 11 '25 23:08 ExperimentsInHonesty

Update: The Google Colab–based solution has been implemented and is ready to use.

Key points:

Notebook runs fully in Colab without CORS or browser memory issues.
Raw data (2024.csv) and generated monthly CSV files are visible in Colab’s left-side file browser under /content/ while the session is active.
Files in /content/ are temporary — once the session is closed, they are deleted.
If users need to keep the files, they can download them from the file browser during the session or modify the notebook to save them to Google Drive.

Link to Colab notebook: https://colab.research.google.com/drive/1_HFqnSOIDqDCtF3jmslmzkZ82eho10lY?usp=sharing

Aug 12 '25 01:08 mru-hub

future improvement is to add prior years as options to the collab file

Aug 12 '25 02:08 ExperimentsInHonesty

@mru-hub we also need to add a page to the wiki and link it to this page https://github.com/hackforla/data-science/wiki/Our-Work, can you look at the other pages under Our work, and draft the markdown that should be added a new page. You can just add the draft to a comment on this issue, so we can review it.

### Draft wiki page that explains project

Aug 17 '25 17:08 ExperimentsInHonesty

As discussed, extend the Colab notebook to include prior years (2020–2025) in the URLs dictionary. The user can input the desired year by assigning it to the year variable. The notebook could then save cleaned data both as a single CSV for the entire year and/or as separate CSVs for each month, making it fully dynamic.

I’ll first review the existing pages under Our Work to understand the format and structure. Once I’m familiar with it, I’ll draft the markdown for the new page and share it here for review.

Aug 18 '25 10:08 mru-hub

311 Data Cleaning & Hosting Project Summary

Background

The 311 service request dataset is very large and challenging to host or query in in-browser environments. To make this data more accessible and usable, the project aims to process, clean, and split it into manageable files, enabling users to work with the data efficiently.

Tools Used

Python & Jupyter Notebook – for data cleaning and transformation
Pandas – for processing large datasets efficiently
Colab – for providing user access to temporary dataset files

Objective Build a reproducible pipeline that:

Downloads raw 311 Service Request data.
Cleans the dataset according to standardized rules for consistency and quality.
Splits the data by year, then by month, with each file around 100MB in size.
Provides cleaned and split datasets via Colab for direct download by users (instead of publishing the datasets to GitHub as an append-only warehouse).

Process 1. Data Acquisition

Downloaded 311 Service Request data from the city’s open data portal.
Stored raw files to ensure reproducibility.

2. Data Cleaning Data cleaning was essential to prepare the 311 service request data for analysis. The following high-level steps were performed:

Removing duplicates: Eliminated duplicate rows to prevent biased results.
Handling missing values: Identified missing data and addressed critical gaps, including filling or dropping values based on context.
Converting date columns: Standardized all date fields to proper datetime format to support time-based analysis.
Analyzing categorical variables: Reviewed key categorical fields to identify redundancies and simplify the dataset.
Dropping unnecessary columns: Removed irrelevant or unique identifier columns to improve processing efficiency.
Standardizing categorical data: Converted text columns to lowercase for consistency and to reduce errors in analysis.
Cleaning geographical data: Addressed missing or invalid entries in location-related fields such as ZipCode to ensure accurate analysis.
Saving cleaned data: Partitioned and saved the cleaned dataset into monthly files for easier time-series analysis and handling of large datasets.

3. Data Splitting

Partitioned cleaned datasets by year, then by month.
Organized files in a clear folder hierarchy for easy access.

4. Jupyter Notebook/ Colab Notebook

Created a documented notebook to:
- Download raw data
- Apply cleaning rules
- Automate splitting logic
- Save outputs in the required folder structure

Notebook includes detailed notes explaining the purpose of each cleaning step.

Deliverables

Annotated Jupyter Notebook with the full data pipeline
- Github https://github.com/hackforla/data-science/tree/177-create-311-data-csv-files-that-can-be-accessed-through-a-jupyter-notebook/311-data/CSV_files
- Colab- https://colab.research.google.com/drive/1_HFqnSOIDqDCtF3jmslmzkZ82eho10lY?usp=sharing
Cleaned and partitioned datasets available as runtime temporary files on Colab
Documentation of cleaning rules: https://github.com/hackforla/data-science/blob/main/311-data/CSV_files/Docs/CleaningRules.txt

Aug 25 '25 08:08 mru-hub

@mru-hub Did you mean to format this part as

4. Jupyter Notebook/ Colab Notebook

Created a documented notebook to:
- Download raw data
- Apply cleaning rules (notebook includes detailed notes explaining the purpose of each cleaning step).
- Automate splitting logic
- Save outputs in the required folder structure

Please add any links to the collab files and folder to the document above, plus any formatting change, and I'll look at this again + we will try to get a peer review (I added the issue to peer-review column.

Aug 25 '25 23:08 ExperimentsInHonesty

@mru-hub Did you mean to format this part as

4. Jupyter Notebook/ Colab Notebook

Created a documented notebook to:
- Download raw data
- Apply cleaning rules (notebook includes detailed notes explaining the purpose of each cleaning step).
- Automate splitting logic
- Save outputs in the required folder structure

Please add any links to the collab files and folder to the document above, plus any formatting change, and I'll look at this again + we will try to get a peer review (I added the issue to peer-review column.

Sep 01 '25 23:09 ExperimentsInHonesty

@mru-hub Andrew brought up a good question. Should we put the Collab into the GitHub directory, and I think we should. Along with instructions on how to load it. If you're at the next data science meeting on September 8th, you can talk about it with Andrew.

I started a conversation with ChatGPT but it started to ask questions I did not have the answer to, in order to create the instructions. https://chatgpt.com/share/68b62f0b-d6c0-8008-b0f8-7baee180179e

Sep 01 '25 23:09 ExperimentsInHonesty

@mru-hub Please provide update

Instructions

Progress: "What is the current status of your project? What have you completed and what is left to do?"
Blockers: "Difficulties or errors encountered."
Availability: "How much time will you have this week to work on this issue?"
ETA: "When do you expect this issue to be completed?"
Pictures (if necessary): "Add any pictures that will help illustrate what you are working on."

You can use this template

1. Progress: 
2. Blockers: 
3. Availability:
4. ETA:
5. Pictures (if necessary):

Sep 09 '25 01:09 ExperimentsInHonesty

@ExperimentsInHonesty I saw Andrew’s point about putting the Colab notebook into the GitHub repo. Right now, we already have a Jupyter notebook there that works locally, and the Colab version was meant to make it system-independent. I just want to confirm the expectation—do we also want to add the Colab notebook to the repo with instructions on how to open it, or is the local notebook enough?
Link to github repo: https://github.com/hackforla/data-science/tree/main/311-data/CSV_files

Sep 09 '25 01:09 mru-hub

@mru-hub is going to make a

how to use the collab draft to be added to the wiki.
a link to the readme for the how to use for the repo

a readme draft for the repository (includes

inventory of what is in the repo
instructions on how to use it

link to wiki page and brief mention:

Please see [X] wiki page for background on this project and details of how you can use our Google Collab notebook instead of this repo, if space is limited on your machine or you just wan to preview it without forking this repo or downloading its contents.

Sep 09 '25 02:09 ExperimentsInHonesty

README:

311 Data Cleaning & Hosting Project

Overview This project provides a reproducible pipeline for processing the large 311 Service Request dataset. The pipeline downloads, cleans, and splits the data into smaller, manageable files, making it easier to work with in local or in-browser environments.

For background, methodology, and details on how to use Google Colab instead of this repo, please see the Project Wiki

Repository Contents

311_data_cleaning.ipynb – Jupyter Notebook for acquisition, cleaning, and splitting
CSV_files/Docs/CleaningRules.txt – Documentation of cleaning rules
README.md – This file.

How to Use Run Locally

Clone this repository -git clone https://github.com/hackforla/data-science.git -cd data-science/311-data
Open the Jupyter Notebook -311-data/CSV_files -jupyter notebook 311_data_cleaning.ipynb
Run the notebook -Download raw 311 Service Request data (provide open data URL for the desired year). -Apply cleaning rules by running the notebook -Automatically split and save the dataset into monthly files

Run in Google Colab Use the hosted notebook (no fork or download required): Open in Colab

References Project Wiki Documentation of cleaning rules

Sep 15 '25 06:09 mru-hub

README:

311 Data Cleaning & Hosting Project

Overview

This project provides a reproducible pipeline for processing the large 311 Service Request dataset. The pipeline downloads, cleans, and splits the data into smaller, manageable files, making it easier to work with in local or in-browser environments.

For background, methodology, and details on how to use Google Colab instead of this repo, please see the Project Wiki

Repository Contents

311_data_cleaning.ipynb – Jupyter Notebook for acquisition, cleaning, and splitting
CSV_files/Docs/CleaningRules.txt – Documentation of cleaning rules
README.md – This file.

How to Use

Run Locally

Fork this repository

 git clone https://github.com/hackforla/data-science.git

cd data-science/311-data

Open the Jupyter Notebook
- 311-data/CSV_files
- jupyter notebook 311_data_cleaning.ipynb
Run the notebook
- Download raw 311 Service Request data (provide open data URL for the desired year).
- Apply cleaning rules by running the notebook
- Automatically split and save the dataset into monthly files

OR

Run in Google Colab

Use the hosted notebook (no fork or download required):

Open in Colab

References

Project Wiki
Documentation of cleaning rules
GitHub Tutorials
- https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/fork-a-repo

Sep 16 '25 02:09 ExperimentsInHonesty

README:

311 Data Cleaning & Hosting Project

Overview

This project provides a reproducible pipeline for processing the large 311 Service Request dataset. The pipeline downloads, cleans, and splits the data into smaller, manageable files, making it easier to work with in local or in-browser environments. For background, methodology, and details on how to use Google Colab instead of this repo, please see the Project Wiki

Repository Contents

311_data_cleaning.ipynb – Jupyter Notebook for acquisition, cleaning, and splitting
CSV_files/Docs/CleaningRules.txt – Documentation of cleaning rules
README.md – This file.

How to Use

Run Locally

Fork this repository on GitHub.
Clone your forked copy:

git clone https://github.com/<your-username>/data-science.git   # clone your fork

 cd data-science/311-data                                       # move into 311-data folder

Open the Jupyter Notebook
- 311-data/CSV_files
- jupyter notebook 311_data_cleaning.ipynb
Run the notebook
- Download raw 311 Service Request data (provide open data URL for the desired year).
- Apply cleaning rules by running the notebook
- Automatically split and save the dataset into monthly files

OR

Run in Google Colab

Use the hosted notebook (no fork or download required):

Open in Colab

References

Project Wiki
Documentation of cleaning rules
GitHub Tutorials
- https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/fork-a-repo

Sep 28 '25 20:09 mru-hub

Please provide update @mru-hub

Instructions

Progress: "What is the current status of your project? What have you completed and what is left to do?"
Blockers: "Difficulties or errors encountered."
Availability: "How much time will you have this week to work on this issue?"
ETA: "When do you expect this issue to be completed?"
Pictures (if necessary): "Add any pictures that will help illustrate what you are working on."

You can use this template

1. Progress: 
2. Blockers: 
3. Availability:
4. ETA:
5. Pictures (if necessary):

Oct 14 '25 01:10 chinaexpert1

@chinaexpert1 Please find the Readme file README.md

Oct 14 '25 01:10 mru-hub