conversational-datasets
conversational-datasets copied to clipboard
AmazonQA Data Size
Hi,
I have downloaded the Amazon data (38 files) and ran the create_data.py by
python amazon_qa/create_data.py --file_pattern AmazonQA/* --output_dir AmazonQA/processed/ --runner DirectRunner --temp_location AmazonQA/processed/temp --staging_location AmazonQA/processed/staging --dataset_format JSON
It results in 100 train*.json and 100 test*.json under the AmazonQA/processed/ folder. After I read all the data, the training set has 158974 samples and the test set has 16763.
What is the number of samples you used in the paper? 3M or 158.9K? I am confused because it is different from the number listed in the repo.
P.S. I saw some filtering functions have been done in the create_data.py file.
Below are some statistics of the conversational dataset:
Input files: 38 Number of QA dictionaries: 1,569,513 Number of tuples: 4,035,625 Number of de-duplicated tuples: 3,689,912 Train set size: 3,316,905 Test set size: 373,007
Thank you in advance for your kind reply.
Hi Jason,
the training set size should be 3.3M. Maybe check there are indeed 38 input files? TOTAL: 38 objects, 1935927109 bytes (1.8 GiB)
I just re-ran the pipeline (with Google cloud DataflowRunner, and json output) and confirm these numbers. A quick check is wc -l data/test-00099-of-00100.json
giving 3729
.
That's strange. I do have 38 files with around 1.8G. So is it the issue of using --runner DirectRunner?
When I ran wc AmazonQA/processed/test-00099-of-00100.json
I got 167 6503 39507 AmazonQA/processed/test-00099-of-00100.json
. Also I found that my AmazonQA/processed/ folder only has 41M.
Thanks for helping.