PyTorchNLPBook icon indicating copy to clipboard operation
PyTorchNLPBook copied to clipboard

YELP raw_train.csv file no longer available on Google Drive, please provide alternate source

Open richlysakowski opened this issue 3 years ago • 2 comments

raw_train.csv

https://drive.google.com/open?id=1xeUnqkhuzGGzZKThzPeXe2Vf6Uu_g_xM gives a 404 error

Please provide update link to exact dataset used in the book, or to an entirely new set of yelp CSV-formatted datasets (train, test, and reviews_with_splits_lite)

richlysakowski avatar Nov 13 '22 17:11 richlysakowski

@richlysakowski -- I had the same problem. I think this one on Yelp is identical -- that's what I'm going to use. https://www.kaggle.com/datasets/ilhamfp31/yelp-review-dataset

ajhergenroeder avatar Feb 06 '23 02:02 ajhergenroeder

@richlysakowski Here's what worked for me running on Jupyter notebook (Google Colab, June 2023). First, have ~/.kaggle/kaggle.json with 600 permissions.

from pathlib import Path

creds = 'your JSON credentials from Kaggle.com'
cred_path = Path('~/.kaggle/kaggle.json').expanduser()
if not cred_path.exists():
    cred_path.parent.mkdir(exist_ok=True)
    cred_path.write_text(creds)
    cred_path.chmod(0o600)

Then, download directly from Kaggle API:

from kaggle.api.kaggle_api_extended import KaggleApi

api = KaggleApi()
api.authenticate()

dataset_slug = 'ilhamfp31/yelp-review-dataset'
api.dataset_download_files(dataset_slug, unzip=True)

You may have to rename a few files and folders:

mkdir data
mkdir data/yelp
mv yelp_review_polarity_csv/* data/yelp/
mv data/yelp/test.csv data/yelp/raw_test.csv
mv data/yelp/train.csv data/yelp/raw_train.csv
rm -r yelp_review_polarity_csv/

You should be able to run the rest of the Yelp notebooks as per normal.

photomz avatar Jun 11 '23 02:06 photomz