311-data icon indicating copy to clipboard operation
311-data copied to clipboard

Investigate the relationship between crime and 311 requests

Open nichhk opened this issue 1 year ago • 17 comments

Overview

It would be interesting to see if crime is positively correlated with 311 requests--i.e., as crime increases, there are more quality-of-life issues to report. It's also possible to see the reverse, a negative correlation--perhaps more 311 requests are indicative of a more engaged/active community, and crime is lower. And of course, maybe there's no relationship at all.

@Michaeldavidng to fill out more details. Please include the datasets that you're using!

nichhk avatar Jul 08 '22 21:07 nichhk

Michael has shared his work on a Tableau dashboard over the past couple weeks. @Michaeldavidng, could you share what sort of analyses you ran, and what your findings were? As well as the next steps?

nichhk avatar Jul 22 '22 23:07 nichhk

Hello:

Day 1: one the first day, I ran a regression between the amount of crime that happened in each precinct and the amount of the 311 request in each precinct.

analysis indicated that the crime and 311 request were not correlated.

was requested to focus the area as the location might have been to broad. was requested to filter out "bulky items" category

Day 2: re-ran the regression from day 1 except filter out "bulky items" results provided an even lower R-squared value, again indicating a lack of correlation.

created a new column combining longitude and latitude for crime. 311 dataset already had one. was going to run another regression between these 2 column but didn't know how.

graph all instances of crime on an actual map and put in side-by-side with a map showing 311 request. analysis showed that the concentration of crime was not matching with correlation of 311 request.

was told this was not good enough as latitude, longitude data might be too granular.

was requested to work with another person to find away to graph the data using geospatial blocks.

7/22/2022 met with 311 member via google. she was kind enough to discuss techniques as to how to merge datasets of python. she specifically mentioned how the 311 data set had a "neighborhood council" column which, once we merge it with the crime dataset, we can then divide crime based on the neighborhood council.

specifically, each police precinct is divided into smaller blocks based on neighborhood council. Thus, I can pull trends localized data.

she then discuss the possibility on combining the 311 dataset and the crime dataset based on geospatial data but needed time to do it.

Next steps: will attempt to merge the two datasets via python and extract or plot a regression analysis. will wait for kind 311 member to follow up with her geospatial idea.

Michaeldavidng avatar Jul 23 '22 00:07 Michaeldavidng

7/25/2022 Attempted to utilize the code given by other 311 member.

the specific code merging the two dataset on the shared column had an issue: the merged dataset only provided column headers with no data in the table.

Spent most of the afternoon, writing up code and exploring other python codes to merge the two datasets. encounter issue where the joining the two column causes issue with data types: https://stackoverflow.com/questions/57795399/trouble-with-df-join-valueerror-you-are-trying-to-merge-on-object-and-int64

also revamped the datasets by making the 'policeprecinct' columns have the same header and values i.e. Both will have West Los Angeles instead of West LA, things like that

notified fellow 311 member about the code issue.

next step: thinking of using Microsoft access, then joining the two datasets via SQL.

7 25 2022 part 3 7 25 2022 7 25 2022 part 2

Michaeldavidng avatar Jul 25 '22 23:07 Michaeldavidng

Hi Michael, when I looked at the rough script that I shared with you on Friday- I realized that the I missed c on 'PolicePrecinct'. That is why the merge was not working out. I sent you updated script. So pd.merge() works now.

Also, when I looked at the 'PolicePrecinct' column in crime data- the string is title case. Where as the same string in 311 data is upper case. So you will have to work on this first before merging the 2 dataframes as you will end up with NAN values after merging them the way they are right now.

priyakalyan avatar Jul 26 '22 03:07 priyakalyan

7/26/2022 Note: even though I'm writing this on 7/27, it really took place yesterday. Got correspondence back from fellow 311 member (Anupriya). Was told the issue was a spelling error in the code. Also told of additional issue that because the the merging columns (i.e. PolicePrecinct) had differences that results in NaN values. However, I fixed both issues yesterday on 7/25.

the 311 member sent me additional code that helps to shorten unncessary data in the 311 request dataset. I used this new code but the merging process kept complaining about a memory issue.

Becane frustrated with Python since unable to visualize the data or provide analysis. Decided it was ultimately dead end. Went to tableau as its easy to use

Tableau utilized Anupriya's advice to merge the two datasets via PolicePrecinct (albeit updated so it won't generate NaN or Null values due to unmatching names) then I would graph out crime via Neighborhood council (NC).

created 5 graphs

  1. First graph, Crime in each neighborhood council. Used a horizontal bar chart to show the count of crime for each neighborhood council
  2. second graph, 311 request in each neighborhood council same as first graph.
  3. 3rd graph, combine 1st and 2nd graph and put them on top of each other to see the differnce
  4. 4th graph, regression performed by tableau. count of crime vs count of 311 request.
  5. 5th graph, side-by-side comparsion of 1st and 2nd graph.

corresponded with Anupriya that I wouldn't be using python for the project due to an unresolvable memory issue.

next step:

  1. try to run Anupriya and my code on google colab to see if it works
  2. add NC "pages" for the regression. to show regression for each council.

https://public.tableau.com/app/profile/michael.david.ng/viz/311correlationwithcrime/Sheet1

Michaeldavidng avatar Jul 27 '22 23:07 Michaeldavidng

7.28.2022 https://public.tableau.com/app/profile/michael.david.ng/viz/311correlationwithcrime/Sheet4

I shared my finding with in the official meeting. overall, the group agreed with my data.

Conclusion: there is no discernable relationship or correlation between 311 request and crime.

normally, I would have closed this issue but another 311 member has expressed interested in finding other additional issues (i.e. a deeper dive) related to crime.

Michaeldavidng avatar Jul 29 '22 02:07 Michaeldavidng

Wanted to add some thoughts for some ways to explore this issue. Feel free to ignore all of this if there are other approaches you think works best.

Before that, let's clarify some of the objectives of this deep dive from my understanding:

  • Investigate the relationship between crime and 311 requests; attempt to reveal any pattern between the two datasets
  • If there are any relationships, what are some actionable items we can reccomend to improve the existing situation

Below are some steps we could consider when conducting this analysis:

  • Gather the relevant datasets over the years (MyLA 311 data and crime data from lacity, please see documentation)
  • Decide on a specific granularity to aggregate dataset (Michael tried the NC level, perhaps we can drill further down into specific decimal places of Lat / Long, or even NC Level over the years). We need to do this step in order to combine both datasets. So in implementation, we group by the level of granularity (e.g. NC level) and take the count of (specific) requests and (specific) crimes
  • Combine the aggregate LA 311 Data with aggregate crime data such that the granularity level matches (e.g. Arleta 311 agregate matches with Arleta Crime Data, or if you want to go deeper, Aleta 311 request in June 2015 pair with Arleta Crime in June 2015)
  • Conduct any statistical analysis as you desire. The simplest form will be regression analysis with dependent variable as crime and independent variables being any features you want to incorporate, such as one-hot encoding of different request types, the number of 311 request, the datetime, or other featured engineered variables. Ultimately, we want to see the size of our effect and how well the data fits. From there, we can generate some insights as to what's important feature and recommend the NCs to focus resource on those
  • Another thing to consider as the possibility of reverse causation - try using the number of 311 requests as dependent variable. If you really want to go fancy, Durbin-Wu-Hausman endogenity test works too.
  • After the analysis, what are some actionable insights the NC could have taken to improve the existing situation? If there are none, what are some existing patterns? What have you learn from this analysis?
  • After that, try to find a way to present your results. Whether it's a scatter plot visualization, a bar chart, or other dashboard visuals that is most helpful to explain the result to a 5 year old.

Again, feel free to ignore this - I hope this can serve as a reference and at least help generate some ideas!

joshuayhwu avatar Jul 30 '22 02:07 joshuayhwu

I didn't mean to open this pull request as a way to close the entire issue - just to add analysis to backup the hypothesis that work crews were seeing graffiti and cleaning it on the fly, which led to the 16% of records where the Service Date was before the Create date.

lgewirtz avatar Sep 29 '22 01:09 lgewirtz

9/28/2022

Decided to look at crime and call data from 2019 since the year is before lockdown and it's a good starting point to compare to.

  • Filtered date for year 2019 using DATE OCC column and dropped Date Rptd column and columns that don't seem relevant.
  • Dropped 4773 duplicate rows

Cleaned up Call data from 2019

  • Dropped irrelevant columns
  • Dropped rows that had nulls in PolicePrecinct (1322) because will be merging on that column
  • Dropped rows with ClosedDate in 2017, 2018 & 2109 (very small percentage)
  • Renamed 'PolicePrecinct' to AREA NAME to match crime data set; also renamed values in the columns so they all match

Currently having trouble merging 2 datasets - MemoryError Output exceeds the [size limit]. Working on lowing the amount of memory call data dataframe uses.

lgewirtz avatar Sep 29 '22 01:09 lgewirtz

@lgewirtz Hey Liz, thanks for the updates! A couple questions come into mind:

  • What is the percentage of duplicate rows? And which columns did you use to determine duplicate rows?
  • What are the columns that you kept?
  • Are there police precinct that exist on requests that doesn't exist on the crime data? Or vice versa?
  • What's the reasoning for dropping rows with ClosedDate in 2017/18/19?

I think both dataset are too big to merge. I would suggest first aggregate the dataset on the level you're trying to merge (e.g. group by month, precinct) for both dataset, and take the total number of request count before merging. Would you mind uploading your code to your own github repo so I can take a look?

joshuayhwu avatar Oct 01 '22 15:10 joshuayhwu

Anupriya figured out that the memory error is occurring because, when merging on the Crime Data_column "AREA NAME' and the 311 Calls column "PolicePrecinct", the 'AREA NAME' values are repeating and the merge is trying to create too many rows.

I tried to fix this by merging on 2 columns. The second merge column was created by using Crime Data_column "Date Rptd' and 311 Calls column "DateCreated"

It did merge but there were but it still created too many rows - 38176837 The number of rows in the LA Calls dataset for 2019 is 1306771 and the number of rows in the Crime Dataset for 2019 is 220556.

Here is the notebook: Merge on Two Column Jupyter Notebook

The next step may be aggregating by neighborhood council in the LA Call dataset. While aggregating on the Crime Dataset by PolicePrecinct (or 'AREA NAME' before rename) will fix the merge problem, a lot of the text based information will be lost.

lgewirtz avatar Oct 06 '22 02:10 lgewirtz

@lgewirtz thanks for sharing the notebooks! In the future would you mind tagging my name on your comments ? Also, would you mind elaborating on why you want to merge the two datasets?

From the MergeTwoCol Notebook, it seems like you attempted to join the two datasets by police Precinct and it return ~40 million rows. I noticed that you commented there are too many rows but it would depend on your end goal. Regardless, there are two options to reduce the size:

    1. Aggregation: 311 request data via groupby police precinct and month and get the total number of requests, do the same with crime, then merge on police precinct and month. This enable you to run correlation / regression
    1. Sampling: take a random subset of the 311 requests and crime data (maybe 10%) then continue with your existing method

Sampling part could apply to the spatial join if that's your prefer method. The essential idea is to reduce the size of the original tables so that the combination of two tables won't have as many rows as it currently is. Shoot me a message on slack if you want to schedule a session to walk through it.

joshuayhwu avatar Oct 14 '22 03:10 joshuayhwu

@joshuayhwu thanks for looking at my notebooks.

> Also, would you mind elaborating on why you want to merge the two datasets? I want to merge the 311 Service Requests with the Crime Data so that I can analyze them together. Is there a different way to do it?

> From the MergeTwoCol Notebook, it seems like you attempted to join the two datasets by police Precinct and it return ~40 million rows. I noticed that you commented there are too many rows but it would depend on your end goal. I'm not particularly worried about the size itself. My thought process is that if 2 data frames are merged, the resulting data frame shouldn't contain more rows than the larger data frame (unless it's a cross merge). So I interpreted the resulting ~40 million rows as an error. (Maybe I'm not using unique enough criteria? Should I be merging on more columns for a better merge key?) Please tell me if I'm wrong and my resulting data frame is correct!

Aggregation: 311 request data via groupby police precinct and month and get the total number of requests, do the same with crime, then merge on police precinct and month. This enable you to run correlation / regression This is definitely an option. My concern about this is losing too much categorical data in the groupby. Thoughts?

_> Sampling: take a random subset of the 311 requests and crime data (maybe 10%) then continue with your existing method

Sampling part could apply to the spatial join if that's your prefer method. The essential idea is to reduce the size of the original tables so that the combination of two tables won't have as many rows as it currently is._ My concern about this is that if the way I'm merging is incorrect, a smaller dataset won't fix the problem.

Shoot me a message on slack if you want to schedule a session to walk through it. Thank you very much for offering this. I keep feeling like I'm right on the edge and Anupriya is tossing ideas back and forth. If I'm not done with this merge issue by the end of the weekend, are you free on Monday?

lgewirtz avatar Oct 14 '22 19:10 lgewirtz

@lgewirtz Thanks for the response! See comments below:

  1. Merging two tables to do analysis on both is fine. The next question is on what "granularity" do you want to match the two tables. For instance, every row in 311 request table represents one row while every row in crime table represents one crime rate. Generally, we will first aggregate the two tables onto the same level (e.g. police precinct, neighborhood council, or by city, or some long/lat aggregate) before merging the two tables. I believe that's what you are trying to accomplish with joining by the police precincts.

  2. For the dataframe size, your intuition is correct only if you are doing a left join or right join. Refer here for the different types of join. The reason you have more rows in the resulting table than the original table is because every 311 request (or police precinct) can have multiple crime record with the same police precinct (i.e. a one-to-many relationship). Now repeat that for every 311 request will get you 40 million - which is fine if that's the methodology you decide to settle on. For quick computation or lower memory cost , however, I'd reccomend you either aggregate the tables before joining or sample.

  3. For cost of aggregation, there's always the concern of losing more granular data. One way to overcome this is to count the number of rows that category exist (e.g. number of rows that is "felony" in crime or number of "bulky item" request)

  4. The way you're merging is fine if you're trying to do spatial join. For police precinct you'd need to aggregate first though.

  5. I'm available monday evening after 6pm. Shoot me a message on slack if you need help!

joshuayhwu avatar Oct 15 '22 15:10 joshuayhwu

@joshuayhwu As suggested, I've used groupby and dummy variables on the ServiceRequest and Crime datasets for 2019. I used the following code: dummydf= pd.get_dummies(df, columns=['dumVar1', 'dumVar2']).groupby(['mergeCol1', 'mergeCol2'], as_index=False).sum() Then I did an inner merge on 2 columns - CreatedDate and PolicePrecinct

Crime data - Dummy_Crime_2019.ipynb Service Request data - Dummy_SR_2019.ipynb Merge the files - Merge_SR_CR_Grpd_Dummy.ipynb

lgewirtz avatar Oct 27 '22 00:10 lgewirtz

@joshuayhwu Updated Service Request notebook - Dummy_SR_2019 - so that NC_ID and zipcodes are not dummified. Then re-merged with Crime dataset in Merge_SR_CR_Grpd_Dummy notebook

lgewirtz avatar Nov 02 '22 20:11 lgewirtz