astro-sdk icon indicating copy to clipboard operation
astro-sdk copied to clipboard

Introduce multiple file type options for a dataset of a specific size.

Open utkarsharma2 opened this issue 3 years ago • 0 comments
trafficstars

Please describe the feature you'd like to see Currently, we have only one option per type and size, we should have multiple options for a particular size since there are cases when a dataset is not consumable by a native or default path, which causes issues in running benchmarking.

Describe the solution you'd like If we have multiple options for datasets of different sizes we can probably run benchmarking script smoothly.

Ideal datasets

  1. Clean dataset
  2. Available in all file format supported

Steps to Reproduce

Case 1 - default path:

Using default path - pandas was complaining about below dataset s3://astro-sdk-test/benchmark/trimmed/stackoverflow/stackoverflow_posts.ndjson Worked with s3://astro-sdk-test/benchmark/trimmed/test/stackoverflow_posts.ndjson

Case 2 - Native path: Using the below dataset resulted in an invalid argument in the Data Transfer service - Bigquey s3://astro-sdk-test/benchmark/trimmed/pypi/

Case 3 - Native path: Using the below dataset resulted in an invalid argument in the Data Transfer service - Bigquey s3://astro-sdk-test/benchmark/trimmed/github/github-archive/

Run benchmarking script with an above-mentioned dataset with Bigquery table as output table.

utkarsharma2 avatar Jul 26 '22 12:07 utkarsharma2