conversational-datasets icon indicating copy to clipboard operation
conversational-datasets copied to clipboard

AmazonQA Data Size

Open jasonwu0731 opened this issue 4 years ago • 2 comments

Hi,

I have downloaded the Amazon data (38 files) and ran the create_data.py by

python amazon_qa/create_data.py --file_pattern AmazonQA/* --output_dir AmazonQA/processed/ --runner DirectRunner --temp_location AmazonQA/processed/temp --staging_location AmazonQA/processed/staging --dataset_format JSON

It results in 100 train*.json and 100 test*.json under the AmazonQA/processed/ folder. After I read all the data, the training set has 158974 samples and the test set has 16763.

What is the number of samples you used in the paper? 3M or 158.9K? I am confused because it is different from the number listed in the repo.

P.S. I saw some filtering functions have been done in the create_data.py file.

Below are some statistics of the conversational dataset:

Input files: 38 Number of QA dictionaries: 1,569,513 Number of tuples: 4,035,625 Number of de-duplicated tuples: 3,689,912 Train set size: 3,316,905 Test set size: 373,007

Thank you in advance for your kind reply.

jasonwu0731 avatar Mar 24 '20 09:03 jasonwu0731

Hi Jason, the training set size should be 3.3M. Maybe check there are indeed 38 input files? TOTAL: 38 objects, 1935927109 bytes (1.8 GiB)

I just re-ran the pipeline (with Google cloud DataflowRunner, and json output) and confirm these numbers. A quick check is wc -l data/test-00099-of-00100.json giving 3729.

matthen avatar Mar 24 '20 10:03 matthen

That's strange. I do have 38 files with around 1.8G. So is it the issue of using --runner DirectRunner?

When I ran wc AmazonQA/processed/test-00099-of-00100.json I got 167 6503 39507 AmazonQA/processed/test-00099-of-00100.json. Also I found that my AmazonQA/processed/ folder only has 41M.

Thanks for helping.

jasonwu0731 avatar Mar 26 '20 07:03 jasonwu0731