lucky-parking icon indicating copy to clipboard operation
lucky-parking copied to clipboard

NLP analysis of Reddit data

Open gregpawin opened this issue 4 years ago • 7 comments

Dependencies

None

Overview

To help understand what the needs of people living in Los Angeles are, one way to gather more user data is to analyze discussion boards such as Reddit. The Reddit analysis branch contains some tools to help download Reddit data concerning parking issues in Los Angeles

Action Items

  • [x] Download the reddit-analysis branch or download and setup PRAW directly.
  • [ ] Try out different search terms to optimize getting relevant information regarding people's issues with parking in Los Angeles
  • [x] Document exploratory data analysis using Jupyter notebooks:
    • [ ] Starter code: https://github.com/hackforla/lucky-parking/blob/reddit-analysis/notebooks/1.0-gp-initial_eda.ipynb
  • [x] Create some code to clean the data
  • [ ] Do some classic NLP analysis--i.e. TF-IDF
  • [ ] (Optional) Use more modern NLP toolsets--i.e. spaCy

Resources/Instructions

Reddit-analysis branch Starter notebook Python Reddit API Wrapper How to use PRAW NLP Cleaning Wikipedia Using NLTK

gregpawin avatar Dec 28 '20 18:12 gregpawin

Created two reddit-scraping functions and began gathering collection of subreddits/keywords of interest for TF-IDF analysis. Requested subreddit/keyword suggestions from group which can be added to this google sheet. Will be taking a short break and return on 07/07.

KarinaLopez19 avatar Jun 22 '21 08:06 KarinaLopez19

@KarinaLopez19 @zhao-li-github This issue has not had an update since 2021-08-03. If you are no longer working on this issue please let us know. If you are able to give any closing comments related to why this issue stopped being worked on or if there are any other notes that never got added to the issue. We would appreciate it. If you are still working on the issue, please provide update using these guidelines

  1. Progress: "What is the current status of your project? What have you completed and what is left to do?"
  2. Blockers: "Difficulties or errors encountered."
  3. Availability: "How much time will you have this week to work on this issue?"
  4. ETA: "When do you expect this issue to be completed?"
  5. Pictures (if necessary): "Add any pictures that will help illustrate what you are working on."

ExperimentsInHonesty avatar Mar 10 '22 16:03 ExperimentsInHonesty

@gregpawin Please reformat the Overview on this issue to conform to our new template for Lucky parking

### Dependencies
ANY ISSUE NUMBERS THAT ARE BLOCKERS OR OTHER REASONS WHY THIS WOULD LIVE IN THE ICEBOX

### Overview
WE NEED TO DO X FOR Y REASON

### Action Items
A STEP BY STEP LIST OF ALL THE TASK ITEMS THAT YOU CAN THINK OF NOW EXAMPLES INCLUDE: Research, reporting, etc.

### Resources/Instructions
REPLACE THIS TEXT -If there is a website which has documentation that helps with this issue provide the link(s) here.

ExperimentsInHonesty avatar Apr 29 '22 19:04 ExperimentsInHonesty

@gregpawin Is this issue still in progress? I'm looking for projects for the Data Science team and this look like a good one to assign if it's available.

akhaleghi avatar Oct 27 '22 04:10 akhaleghi

→ Used stemming to reduce the bag-of-words to its stem → used this list of words to visualise the top 20 words.

→ parking_subreddit = subreddit.search('parking', time_filter = 'all') As suggested, I tried changing the time filter to ‘all’ to get the older reddit data - did not see much change in the output (top 20 words) - the words mostly indicate of some shooting incident rather than parking issues.

→ Used different search criteria to scrap reddit data and then visualise the top 20 words parking_subreddit = subreddit.search('vandwellers', time_filter = 'all'). Results seem relevant to the search criteria.

→ Going through spacy tutorial and redoing bag-of-words and TF-IDF using spacy

PratibhaNagesh avatar Mar 17 '23 20:03 PratibhaNagesh

Progress: Removed the park names and related titles from the list of words. Working on n-grams. Blockers: Get the parking related words as top words for further analysis. Availability: 6 hrs. ETA: This week.

PratibhaNagesh avatar Mar 26 '23 16:03 PratibhaNagesh

Progress: N-grams (unigram, bigrams and trigrams) before the stopwords were removed. Blockers: Working on understanding LDA topic modelling. Availability: 6 hrs. ETA: This week.

PratibhaNagesh avatar Apr 07 '23 19:04 PratibhaNagesh