Utku Demir

Results 82 comments of Utku Demir

[Amazon Customer Reviews Dataset](https://s3.amazonaws.com/amazon-reviews-pds/readme.html) * 30-50GB, tsv & .parquet.snappy, HTTP & S3

[CommonCrawl](https://commoncrawl.org/the-data/get-started/) * petabytes (partitioned by date), mix of uncommon formats (WARC, WAT, WET...) , HTTP & S3

[Global Database of Events, Languages and Tone](https://www.gdeltproject.org/)

It might be an interesting experiment to implement a dataset of the Bitcoin transactions, if we have a way to process them in a partitioned way.

https://www.data.gov/ has a lot of open datasets. They tend to be small on size, but there probably are exceptions.

Tons of taxi trips, the partitioning seem ideal for distributed-dataset. Will have to investigate how performant the website is. https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

PR #26 is working on this.

This looks great! I'll try to find a dataset online which uses Parquet, so we can have a use case and a good example. I guess most parquet files in...

I found [Amazon Customer Reviews Dataset](https://s3.amazonaws.com/amazon-reviews-pds/readme.html), which is provided as partitioned, snappy compressed parquet files on S3. It's around 50GB in total (compressed). I created https://github.com/utdemir/distributed-dataset/issues/27 to gather public datasets...

Yes, it shouldn't be that hard; we should just be able to take `distribute-dataset-aws-lambda` and rewrite API calls for GCP. Probably the hardest thing would be to define the infrastructure...