access-the-data icon indicating copy to clipboard operation
access-the-data copied to clipboard

Exploratory Analysis on MyLA311 Service Request Data 2020

Open snooravi opened this issue 4 years ago • 14 comments

Overview

Analyze MyLA311 Service Request Data 2020, a public dataset available for preview and download (1,491,773 records, 34 feature columns), to understand what kind of stories (insight) we could tell based on it

Action Items

  • [x] Dataset preprocessing, overview check NA, duplicated records, and looking for any potential issues with the dataset
  • [x] EDA on interesting features

** possible next step analysis: focus on graffiti (NEW) **

put the data into context & dig in to understand graffiti from "in-depth" (data literacy) data analysis: (need to find other datasets)

  • [ ] relation with income level (wealth, safety, crime)
  • [ ] relation with population (maybe only useful to compare similar areas (define "area") )
  • [ ] aggregate by council districts
  • [ ] possibly tag some areas (university, tourism, downtown, residency, commercial) -> could be difficult to find the dictionary

Analysis & Findings

  1. 104 NC, 203 NC names, more than the actual 99 NC's
  2. Find that ServiceDate and ClosedDate have NA values
  3. Find one request that has a typo of ServiceDate and ClosedDate (year=3020)

------- (Analysis below is based on requests (94.96%) with valid CreatedDate, UpdatedDate, ServiceDate and ClosedDate )-------

  1. Compare distribution of CreatedDate, UpdatedDate, ServiceDate and ClosedDate by hours
    • 1,278,193 (90.23%) requests have ServiceDate during 00:00 to 00:59
  2. Compute new columns: duration between CreatedDate and each of UpdatedDate, ServiceDate and ClosedDate
  3. Find that there are requests with process hours <0
  4. Compare
    • ratios of requests with ServiceData on the same day as CreatedDay in total requests of a type
    • ratios of requests of a type in total requests
  5. Analysis on Graffiti Removal, the request type that has the most ServiceDate later than CreatedDate
    • by processing hours
    • by processing days
  6. Count of RequestType, RequestSource and Status

Stories

  • Make Suggestions to the database (based on merely analyzing if the values of certain features make sense):

    1. NC names need more consistency: e.g ''GRANADA HILLS NORTH NC" & "Granada Hills North" ; "SYLMAR NC" & "Sylmar"
    2. Convert timestamp to 24-hour time to facilitate data analysis
    3. New mechanism for ServiceDate & ClosedDate (possibly, some systems record the ServiceDate to 00:00 on the date of service by default, creating problems when considering processing days or hours if w/o explanations)
    • RequestSource: Driver Self Report, Voicemail
    • RequestType: Graffiti Removal, Multiple Streetlight Issue, Single Streetlight Issue
    • Status: all
    • PolicePrecinct: CENTRAL, NEWTON
    • AssignTo:
    • Owner: BSL, OCB
  • Graffiti Removal Processing Days:

    • Understand how to look at boxplot and find outliers by IQR
    • Aggregate hours to days to have more insights
    • 91.40% Graffiti Removal requests are processed within 2 days after the day request created
    • 67.68% Graffiti Removal requests are processed on the same day
    • 3.2% Graffiti Removal requests are processed after a week
    • 0.79% Graffiti Removal requests are processed after a month
    • 0.09% Graffiti Removal requests are processed after a year

------------------------------------UPDATE------------------------------------

Data cleaning

each record has 4 timestamp columns: CreatedDate, UpdatedDate, ServiceDate, ClosedDate

  • valid data: all the timestamp columns are in 2020 and ServiceDate>CreatedDate and ClosedDate>CreatedDate
  • NA data: records with NA (missing value)
  • NC: some NC's are not in the LA NC's area but still in the data and 1 NC is misclassified: remove NC's not in the LA NC's area and correct the misclassified one

Analytical concepts:

  • how outliers affect data and the difficulty of dealing with it; average, median
  • how to define, find and remove outliers
  • by comparing simple bar plots, we could derive many insights
  • differences between a bar plot and histogram
  • how to deal with missing values and invalid data: don't make assumpting of missing data before knowing why they are missing. missing values in categorical features could be treated as another level, which can be used to derive insights
  • understanding meaning of a column before analysis

Google Colab Notebook

data cleaning EDA

snooravi avatar Oct 14 '21 00:10 snooravi

Things we need:

  • [x] Where are we getting our data from / how are we handling data limitations
  • [x] Do we need an export from the 311data team?
  • [ ] Are we using the visualizations in the tool? or some other tool?

snooravi avatar Oct 19 '21 01:10 snooravi

https://github.com/hackforla/access-the-data/issues/89

snooravi avatar Nov 11 '21 01:11 snooravi

Updated content

ShikaZzz avatar Nov 16 '21 02:11 ShikaZzz

@ShikaZzz can you link any material that you have for this issue. We will review today

snooravi avatar Nov 16 '21 22:11 snooravi

@snooravi The updated content is from my analysis of the dataset on my local Jupyter notebook. I can share my screen and organize the result into a Google colab notebook later if necessary

ShikaZzz avatar Nov 16 '21 22:11 ShikaZzz

That would be great, thanks!

snooravi avatar Nov 16 '21 22:11 snooravi

@ShikaZzz please always add your work to a Google colab or commit your code in a draft pr, in case you git hit by a bus. Its the data science equivalent of only working on documents that are stored in our shared drive instead of a persons mydrive.

Thanks!!!

ExperimentsInHonesty avatar Nov 17 '21 17:11 ExperimentsInHonesty

may pause for now and change to how to use 311-data.org tools and the Alpha Report Tool to derive insights

Alpha Report Tool:

  • lack timestamp (no date level data)

311-data.org:

  • when using customized date range: need to clear Start Date & End Date to select date out of the start-end date range
  • due to the auto-adjusted time interval
    • <=24 days (e.g 11/05-11/28): 1 day
    • >24 days: 2 days
    • 6 months: 2 weeks
    • a year: 1 month

stories: detect data anomalies

  1. compare by request type
  2. compare by NC:
    • e.g last month, Graffiti, North East Valley: 2 outstanding peaks in the frequency plot: 1 on Oct 29-30, 1 on Nov 20-21 removing Sylmar, data trend dramatically changes
    • other metrics: way of contact
- con: 
  1. need to manually select NC to filter out the one that affects the data trend
  2. cannot compare different request types between different NC's unless by con 1

Availability: 15 hours ETA: by next week

ShikaZzz avatar Nov 29 '21 22:11 ShikaZzz

analytics concepts:

  • normalization/standardization: not only comparing the total number but also comparing the fraction; by population
  • stacked barplot vs barplot

challenge:

  • data preprocessing: consistency,

ShikaZzz avatar Nov 30 '21 02:11 ShikaZzz

@ShikaZzz Please add update using this template

Progress: "What is the current status of your project? What have you completed and what is left to do?" Blockers: "Difficulties or errors encountered." Availability: "How much time will you have this week to work on this issue?" ETA: "When do you expect this issue to be completed?" Pictures: "Add any pictures of the visual changes made to the site so far."

If you need help, be sure to either:

  1. ask for help at your next meeting
  2. put a "Status: Help Wanted" label on your issue and pull request
  3. put up a request for assistance on the #access-the-data channel.

lrchang2 avatar Dec 07 '21 22:12 lrchang2

Progress:

  • Completed: data cleaning by dividing into 2 general sets: 1 set has only valid data (ServiceDate/ClosedDate is in 2020 and after CreatedDate); 1 complimentary data set

  • Doing: Working on analysis that focuses on median days to ServiceDate by:

    1. overview of various features
    2. graffiti For the median days, there will be a comparison on before vs after outliers by using the IQR rule, which could also be used as one of the highlighted analytical "points" (story: number of requests of different types is an indicator of various underlying factors: e.g why there are more graffiti in a specific NC)
  • To-Do:

    1. share Jupyter Notebook to Google Colab
    2. More analysis for the later workshop: comparative analysis of invalid data

Blockers:

  • NC Names and NC ID cross-checking NC certified and Google Maps
  • creating standardized metrics to remove bias towards population

Availability: 20 hours ETA: ~7-8 hours for "doing" item

ShikaZzz avatar Dec 09 '21 01:12 ShikaZzz

Progress:

  • did:

    • uploaded notebooks to Google Colab (see the end of the main)
    • updated main content: data cleaning, analytical concepts, google colab notebooks
    • cleaned the data
    • visualizations of:
      1. Hour of a day when a request is created: total number of requests, request types and comparison of valid data and other invalid data
      2. processing days (days from request created to service) by various features (feature can be easily visualized by using created function) and comparison of before removing outliers vs after removing outliers
      3. exploration of linear relationship: A: between MobileOS, Request Type and days to Service"; B: between 'CreatedByUserOrganization','RequestType','days_toServiceDate'
      4. total number of requests by Request Type
      5. Graffiti: number of requests vs hour of a day; number of requests vs date; number of requests vs different features ('RequestSource','RequestType','CreatedByUserOrganization','MobileOS','ActionTaken','Status','AssignTo','NC')
  • To-do: creating stories (?), working on feedback from meetings; possibly creating context by adding new datasets, which are not easy to find

Blockers: how to define outliers: different features have different distributions of processing days

Availability: 10 hours

ETA: 1-2 weeks

ShikaZzz avatar Dec 13 '21 23:12 ShikaZzz

@ShikaZzz Please add update using this template (even if you have a pull request)

Progress: "What is the current status of your project? What have you completed and what is left to do?" Blockers: "Difficulties or errors encountered." Availability: "How much time will you have this week to work on this issue?" ETA: "When do you expect this issue to be completed?" Pictures: "Add any pictures of the visual changes made to the site so far."

If you need help, be sure to either:

  1. ask for help at your next meeting
  2. put a "Status: Help Wanted" label on your issue and pull request
  3. put up a request for assistance on the #access-the-data channel.

lrchang2 avatar Jan 04 '22 22:01 lrchang2

updated links for EDA and data cleaning to the collab notebooks in the access the data drive

snooravi avatar Jan 13 '22 01:01 snooravi