yelp_dataset_challenge
yelp_dataset_challenge copied to clipboard
Play around with Yelp dataset in Python (in progress and very messy repo)
Yelp Dataset Challenge for Python
Repository for reading and downloading Yelp Dataset Challenge
round 6 in Pandas pickle format. This repository makes it easy for anyone who want to mess around with Yelp data using Python.
I provide yelp_util Python package that has read and download function.
Datasets repository
The following is structure of S3,
science-of-science-bucket
└─yelp_academic_dataset
├───yelp_academic_dataset_business.pickle (61k rows)
├───yelp_academic_dataset_review.pickle (1.5M rows)
├───yelp_academic_dataset_user.pickle (366k rows)
├───yelp_academic_dataset_checkin.pickle (45k rows)
└───yelp_academic_dataset_tip.pickle (495k rows)
You can download data directly from AWS S3 repository as follows,
import yelp_util
yelp_util.download(file_list=["yelp_academic_dataset_business.pickle",
"yelp_academic_dataset_review.pickle",
"yelp_academic_dataset_user.pickle",
"yelp_academic_dataset_checkin.pickle",
"yelp_academic_dataset_tip.pickle"])
The file will be downloaded to data folder. After finishing download, you can simply read
pickle as follows
import pandas as pd
review = pd.read_pickle('data/yelp_academic_dataset_review.pickle')
review.head()
Structure of Datasets
User table of user's information (366k rows)
| average_stars | compliments | elite | fans | friends | name | review_count | type | user_id | votes | yelping_since |
|---|
Business table of business with its location and city that it locates (61k rows)
| attributes | business_id | categories | city | full_address | hours | latitude | longitude | name | neighborhoods | open | review_count | stars | state | type |
|---|
Review reviews made by users (1.5M rows)
| business_id | date | review_id | stars | text | type | user_id | type | votes_cool | votes_funny | votes_useful |
|---|
Checkin check-in table (45k rows)
| business_id | checkin_info | type |
|---|
Tip tip table (495k rows)
| business_id | date | likes | text | type | user_id |
|---|
Cluster businesses according to how they are tagged
Read the business data
from sklearn.cluster import KMeans
business = pd.read_pickle('data/yelp_academic_dataset_business.pickle')
tags = business.categories.tolist()
then transform tags to matrix count
tag_countmatrix = yelp_util.taglist_to_matrix(tags)
This can be used to cluster businesses
from sklearn.cluster import KMeans
km = KMeans(n_clusters=3)
km.fit(tag_countmatrix)
business['cluster'] = km.predict(tag_countmatrix)
Train word2vec model
review = pd.read_pickle('data/yelp_academic_dataset_review.pickle')
yelp_review_sample = list(review.text.iloc[10000:20000])
model = yelp_util.create_word2vec_model(yelp_review_sample) # word2vec model
Django runserver
All django project is in random_reviews folder. Get started by running python manage.py migrate.
Then for local computer (main aim is to custom css files) run Django project by using python manage.py runserver
Dependencies
- pandas
- scikit-learn
- nltk with
punkt(nltk.download('punkt')) - gensim
- unidecode