access-the-data
access-the-data copied to clipboard
Exploratory Analysis on MyLA311 Service Request Data 2020
Overview
Analyze MyLA311 Service Request Data 2020, a public dataset available for preview and download (1,491,773 records, 34 feature columns), to understand what kind of stories (insight) we could tell based on it
Action Items
- [x] Dataset preprocessing, overview check NA, duplicated records, and looking for any potential issues with the dataset
- [x] EDA on interesting features
** possible next step analysis: focus on graffiti (NEW) **
put the data into context & dig in to understand graffiti from "in-depth" (data literacy) data analysis: (need to find other datasets)
- [ ] relation with income level (wealth, safety, crime)
- [ ] relation with population (maybe only useful to compare similar areas (define "area") )
- [ ] aggregate by council districts
- [ ] possibly tag some areas (university, tourism, downtown, residency, commercial) -> could be difficult to find the dictionary
Analysis & Findings
- 104 NC, 203 NC names, more than the actual 99 NC's
- Find that
ServiceDateandClosedDatehave NA values - Find one request that has a typo of
ServiceDateandClosedDate(year=3020)
------- (Analysis below is based on requests (94.96%) with valid CreatedDate, UpdatedDate, ServiceDate and ClosedDate )-------
- Compare distribution of
CreatedDate,UpdatedDate,ServiceDateandClosedDateby hours- 1,278,193 (90.23%) requests have
ServiceDateduring 00:00 to 00:59
- 1,278,193 (90.23%) requests have
- Compute new columns: duration between
CreatedDateand each ofUpdatedDate,ServiceDateandClosedDate - Find that there are requests with process hours <0
- Compare
- ratios of requests with
ServiceDataon the same day asCreatedDayin total requests of a type - ratios of requests of a type in total requests
- ratios of requests with
- Analysis on Graffiti Removal, the request type that has the most
ServiceDatelater thanCreatedDate- by processing hours
- by processing days
- Count of
RequestType,RequestSourceandStatus
Stories
-
Make Suggestions to the database (based on merely analyzing if the values of certain features make sense):
- NC names need more consistency: e.g ''GRANADA HILLS NORTH NC" & "Granada Hills North" ; "SYLMAR NC" & "Sylmar"
- Convert timestamp to 24-hour time to facilitate data analysis
- New mechanism for ServiceDate & ClosedDate (possibly, some systems record the ServiceDate to 00:00 on the date of service by default, creating problems when considering processing days or hours if w/o explanations)
RequestSource: Driver Self Report, VoicemailRequestType: Graffiti Removal, Multiple Streetlight Issue, Single Streetlight IssueStatus: allPolicePrecinct: CENTRAL, NEWTONAssignTo:Owner: BSL, OCB
-
Graffiti Removal Processing Days:
- Understand how to look at boxplot and find outliers by IQR
- Aggregate hours to days to have more insights
- 91.40% Graffiti Removal requests are processed within 2 days after the day request created
- 67.68% Graffiti Removal requests are processed on the same day
- 3.2% Graffiti Removal requests are processed after a week
- 0.79% Graffiti Removal requests are processed after a month
- 0.09% Graffiti Removal requests are processed after a year
------------------------------------UPDATE------------------------------------
Data cleaning
each record has 4 timestamp columns: CreatedDate, UpdatedDate, ServiceDate, ClosedDate
valid data: all the timestamp columns are in 2020 and ServiceDate>CreatedDate and ClosedDate>CreatedDateNA data: records with NA (missing value)NC: some NC's are not in the LA NC's area but still in the data and 1 NC is misclassified: remove NC's not in the LA NC's area and correct the misclassified one
Analytical concepts:
- how outliers affect data and the difficulty of dealing with it; average, median
- how to define, find and remove outliers
- by comparing simple bar plots, we could derive many insights
- differences between a bar plot and histogram
- how to deal with missing values and invalid data: don't make assumpting of missing data before knowing why they are missing. missing values in categorical features could be treated as another level, which can be used to derive insights
- understanding meaning of a column before analysis
Google Colab Notebook
Things we need:
- [x] Where are we getting our data from / how are we handling data limitations
- [x] Do we need an export from the 311data team?
- [ ] Are we using the visualizations in the tool? or some other tool?
https://github.com/hackforla/access-the-data/issues/89
Updated content
@ShikaZzz can you link any material that you have for this issue. We will review today
@snooravi The updated content is from my analysis of the dataset on my local Jupyter notebook. I can share my screen and organize the result into a Google colab notebook later if necessary
That would be great, thanks!
@ShikaZzz please always add your work to a Google colab or commit your code in a draft pr, in case you git hit by a bus. Its the data science equivalent of only working on documents that are stored in our shared drive instead of a persons mydrive.
Thanks!!!
may pause for now and change to how to use 311-data.org tools and the Alpha Report Tool to derive insights
Alpha Report Tool:
- lack timestamp (no date level data)
311-data.org:
- when using customized date range: need to clear Start Date & End Date to select date out of the start-end date range
- due to the auto-adjusted time interval
- <=24 days (e.g 11/05-11/28): 1 day
- >24 days: 2 days
- 6 months: 2 weeks
- a year: 1 month
stories: detect data anomalies
- compare by request type
- compare by NC:
- e.g last month, Graffiti, North East Valley: 2 outstanding peaks in the frequency plot: 1 on Oct 29-30, 1 on Nov 20-21 removing Sylmar, data trend dramatically changes
- other metrics: way of contact
- con:
1. need to manually select NC to filter out the one that affects the data trend
2. cannot compare different request types between different NC's unless by con 1
Availability: 15 hours ETA: by next week
analytics concepts:
- normalization/standardization: not only comparing the total number but also comparing the fraction; by population
- stacked barplot vs barplot
challenge:
- data preprocessing: consistency,
@ShikaZzz Please add update using this template
Progress: "What is the current status of your project? What have you completed and what is left to do?" Blockers: "Difficulties or errors encountered." Availability: "How much time will you have this week to work on this issue?" ETA: "When do you expect this issue to be completed?" Pictures: "Add any pictures of the visual changes made to the site so far."
If you need help, be sure to either:
- ask for help at your next meeting
- put a "Status: Help Wanted" label on your issue and pull request
- put up a request for assistance on the #access-the-data channel.
Progress:
-
Completed: data cleaning by dividing into 2 general sets: 1 set has only valid data (ServiceDate/ClosedDate is in 2020 and after CreatedDate); 1 complimentary data set
-
Doing: Working on analysis that focuses on median days to ServiceDate by:
- overview of various features
- graffiti For the median days, there will be a comparison on before vs after outliers by using the IQR rule, which could also be used as one of the highlighted analytical "points" (story: number of requests of different types is an indicator of various underlying factors: e.g why there are more graffiti in a specific NC)
-
To-Do:
- share Jupyter Notebook to Google Colab
- More analysis for the later workshop: comparative analysis of invalid data
Blockers:
- NC Names and NC ID cross-checking NC certified and Google Maps
- creating standardized metrics to remove bias towards population
Availability: 20 hours ETA: ~7-8 hours for "doing" item
Progress:
-
did:
- uploaded notebooks to Google Colab (see the end of the main)
- updated main content: data cleaning, analytical concepts, google colab notebooks
- cleaned the data
- visualizations of:
- Hour of a day when a request is created: total number of requests, request types and comparison of valid data and other invalid data
- processing days (days from request created to service) by various features (feature can be easily visualized by using created function) and comparison of before removing outliers vs after removing outliers
- exploration of linear relationship: A: between MobileOS, Request Type and days to Service"; B: between 'CreatedByUserOrganization','RequestType','days_toServiceDate'
- total number of requests by Request Type
- Graffiti: number of requests vs hour of a day; number of requests vs date; number of requests vs different features ('RequestSource','RequestType','CreatedByUserOrganization','MobileOS','ActionTaken','Status','AssignTo','NC')
-
To-do: creating stories (?), working on feedback from meetings; possibly creating context by adding new datasets, which are not easy to find
Blockers: how to define outliers: different features have different distributions of processing days
Availability: 10 hours
ETA: 1-2 weeks
@ShikaZzz Please add update using this template (even if you have a pull request)
Progress: "What is the current status of your project? What have you completed and what is left to do?" Blockers: "Difficulties or errors encountered." Availability: "How much time will you have this week to work on this issue?" ETA: "When do you expect this issue to be completed?" Pictures: "Add any pictures of the visual changes made to the site so far."
If you need help, be sure to either:
- ask for help at your next meeting
- put a "Status: Help Wanted" label on your issue and pull request
- put up a request for assistance on the #access-the-data channel.
updated links for EDA and data cleaning to the collab notebooks in the access the data drive