access-the-data Exploratory Analysis on MyLA311 Service Request Data 2020

Overview

Analyze MyLA311 Service Request Data 2020, a public dataset available for preview and download (1,491,773 records, 34 feature columns), to understand what kind of stories (insight) we could tell based on it

Action Items

[x] Dataset preprocessing, overview check NA, duplicated records, and looking for any potential issues with the dataset
[x] EDA on interesting features

possible next step analysis: focus on graffiti (NEW)

put the data into context & dig in to understand graffiti from "in-depth" (data literacy) data analysis: (need to find other datasets)

[ ] relation with income level (wealth, safety, crime)
[ ] relation with population (maybe only useful to compare similar areas (define "area") )
[ ] aggregate by council districts
[ ] possibly tag some areas (university, tourism, downtown, residency, commercial) -> could be difficult to find the dictionary

Analysis & Findings

104 NC, 203 NC names, more than the actual 99 NC's
Find that ServiceDate and ClosedDate have NA values
Find one request that has a typo of ServiceDate and ClosedDate (year=3020)

------- (Analysis below is based on requests (94.96%) with valid CreatedDate, UpdatedDate, ServiceDate and ClosedDate )-------

Compare distribution of CreatedDate, UpdatedDate, ServiceDate and ClosedDate by hours
- 1,278,193 (90.23%) requests have ServiceDate during 00:00 to 00:59
Compute new columns: duration between CreatedDate and each of UpdatedDate, ServiceDate and ClosedDate
Find that there are requests with process hours <0
Compare
- ratios of requests with ServiceData on the same day as CreatedDay in total requests of a type
- ratios of requests of a type in total requests
Analysis on Graffiti Removal, the request type that has the most ServiceDate later than CreatedDate
- by processing hours
- by processing days
Count of RequestType, RequestSource and Status

Stories

Make Suggestions to the database (based on merely analyzing if the values of certain features make sense):
1. NC names need more consistency: e.g ''GRANADA HILLS NORTH NC" & "Granada Hills North" ; "SYLMAR NC" & "Sylmar"
2. Convert timestamp to 24-hour time to facilitate data analysis
3. New mechanism for ServiceDate & ClosedDate (possibly, some systems record the ServiceDate to 00:00 on the date of service by default, creating problems when considering processing days or hours if w/o explanations)
- RequestSource: Driver Self Report, Voicemail
- RequestType: Graffiti Removal, Multiple Streetlight Issue, Single Streetlight Issue
- Status: all
- PolicePrecinct: CENTRAL, NEWTON
- AssignTo:
- Owner: BSL, OCB
Graffiti Removal Processing Days:
- Understand how to look at boxplot and find outliers by IQR
- Aggregate hours to days to have more insights
- 91.40% Graffiti Removal requests are processed within 2 days after the day request created
- 67.68% Graffiti Removal requests are processed on the same day
- 3.2% Graffiti Removal requests are processed after a week
- 0.79% Graffiti Removal requests are processed after a month
- 0.09% Graffiti Removal requests are processed after a year

------------------------------------UPDATE------------------------------------

Data cleaning

each record has 4 timestamp columns: CreatedDate, UpdatedDate, ServiceDate, ClosedDate

valid data: all the timestamp columns are in 2020 and ServiceDate>CreatedDate and ClosedDate>CreatedDate
NA data: records with NA (missing value)
NC: some NC's are not in the LA NC's area but still in the data and 1 NC is misclassified: remove NC's not in the LA NC's area and correct the misclassified one

Analytical concepts:

how outliers affect data and the difficulty of dealing with it; average, median
how to define, find and remove outliers
by comparing simple bar plots, we could derive many insights
differences between a bar plot and histogram
how to deal with missing values and invalid data: don't make assumpting of missing data before knowing why they are missing. missing values in categorical features could be treated as another level, which can be used to derive insights
understanding meaning of a column before analysis

Google Colab Notebook

data cleaning EDA

Oct 14 '21 00:10 snooravi

Things we need:

[x] Where are we getting our data from / how are we handling data limitations
[x] Do we need an export from the 311data team?
[ ] Are we using the visualizations in the tool? or some other tool?

Oct 19 '21 01:10 snooravi

https://github.com/hackforla/access-the-data/issues/89

Nov 11 '21 01:11 snooravi

Updated content

Nov 16 '21 02:11 ShikaZzz

@ShikaZzz can you link any material that you have for this issue. We will review today

Nov 16 '21 22:11 snooravi

@snooravi The updated content is from my analysis of the dataset on my local Jupyter notebook. I can share my screen and organize the result into a Google colab notebook later if necessary

Nov 16 '21 22:11 ShikaZzz

That would be great, thanks!

Nov 16 '21 22:11 snooravi

@ShikaZzz please always add your work to a Google colab or commit your code in a draft pr, in case you git hit by a bus. Its the data science equivalent of only working on documents that are stored in our shared drive instead of a persons mydrive.

Thanks!!!

Nov 17 '21 17:11 ExperimentsInHonesty

may pause for now and change to how to use 311-data.org tools and the Alpha Report Tool to derive insights

Alpha Report Tool:

lack timestamp (no date level data)

311-data.org:

when using customized date range: need to clear Start Date & End Date to select date out of the start-end date range
due to the auto-adjusted time interval
- <=24 days (e.g 11/05-11/28): 1 day
- >24 days: 2 days
- 6 months: 2 weeks
- a year: 1 month

stories: detect data anomalies

compare by request type
compare by NC:
- e.g last month, Graffiti, North East Valley: 2 outstanding peaks in the frequency plot: 1 on Oct 29-30, 1 on Nov 20-21 removing Sylmar, data trend dramatically changes
- other metrics: way of contact

- con: 
  1. need to manually select NC to filter out the one that affects the data trend
  2. cannot compare different request types between different NC's unless by con 1

Availability: 15 hours ETA: by next week

Nov 29 '21 22:11 ShikaZzz

analytics concepts:

normalization/standardization: not only comparing the total number but also comparing the fraction; by population
stacked barplot vs barplot

challenge:

data preprocessing: consistency,

Nov 30 '21 02:11 ShikaZzz

@ShikaZzz Please add update using this template

Progress: "What is the current status of your project? What have you completed and what is left to do?" Blockers: "Difficulties or errors encountered." Availability: "How much time will you have this week to work on this issue?" ETA: "When do you expect this issue to be completed?" Pictures: "Add any pictures of the visual changes made to the site so far."

If you need help, be sure to either:

ask for help at your next meeting
put a "Status: Help Wanted" label on your issue and pull request
put up a request for assistance on the #access-the-data channel.

Dec 07 '21 22:12 lrchang2

Progress:

Completed: data cleaning by dividing into 2 general sets: 1 set has only valid data (ServiceDate/ClosedDate is in 2020 and after CreatedDate); 1 complimentary data set
Doing: Working on analysis that focuses on median days to ServiceDate by:
1. overview of various features
2. graffiti For the median days, there will be a comparison on before vs after outliers by using the IQR rule, which could also be used as one of the highlighted analytical "points" (story: number of requests of different types is an indicator of various underlying factors: e.g why there are more graffiti in a specific NC)
To-Do:
1. share Jupyter Notebook to Google Colab
2. More analysis for the later workshop: comparative analysis of invalid data

Blockers:

NC Names and NC ID cross-checking NC certified and Google Maps
creating standardized metrics to remove bias towards population

Availability: 20 hours ETA: ~7-8 hours for "doing" item

Dec 09 '21 01:12 ShikaZzz

Progress:

did:
- uploaded notebooks to Google Colab (see the end of the main)
- updated main content: data cleaning, analytical concepts, google colab notebooks
- cleaned the data
- visualizations of:
  1. Hour of a day when a request is created: total number of requests, request types and comparison of valid data and other invalid data
  2. processing days (days from request created to service) by various features (feature can be easily visualized by using created function) and comparison of before removing outliers vs after removing outliers
  3. exploration of linear relationship: A: between MobileOS, Request Type and days to Service"; B: between 'CreatedByUserOrganization','RequestType','days_toServiceDate'
  4. total number of requests by Request Type
  5. Graffiti: number of requests vs hour of a day; number of requests vs date; number of requests vs different features ('RequestSource','RequestType','CreatedByUserOrganization','MobileOS','ActionTaken','Status','AssignTo','NC')
To-do: creating stories (?), working on feedback from meetings; possibly creating context by adding new datasets, which are not easy to find

Blockers: how to define outliers: different features have different distributions of processing days

Availability: 10 hours

ETA: 1-2 weeks

Dec 13 '21 23:12 ShikaZzz

@ShikaZzz Please add update using this template (even if you have a pull request)

Progress: "What is the current status of your project? What have you completed and what is left to do?" Blockers: "Difficulties or errors encountered." Availability: "How much time will you have this week to work on this issue?" ETA: "When do you expect this issue to be completed?" Pictures: "Add any pictures of the visual changes made to the site so far."

If you need help, be sure to either:

ask for help at your next meeting
put a "Status: Help Wanted" label on your issue and pull request
put up a request for assistance on the #access-the-data channel.

Jan 04 '22 22:01 lrchang2

updated links for EDA and data cleaning to the collab notebooks in the access the data drive

Jan 13 '22 01:01 snooravi

access-the-data access-the-data copied to clipboard

Exploratory Analysis on MyLA311 Service Request Data 2020

Overview

Action Items

** possible next step analysis: focus on graffiti (NEW) **

Analysis & Findings

Stories

Data cleaning

Analytical concepts:

Google Colab Notebook

access-the-data
access-the-data copied to clipboard

possible next step analysis: focus on graffiti (NEW)