Quentin Lhoest

Results 416 comments of Quentin Lhoest

Hi @BenoitDalFerro how do your load your dataset ?

I wasn't able to reproduce this on a toy dataset of around 300GB: ```python import datasets as ds s = ds.load_dataset("squad", split="train") s4000 = ds.concatenate_datasets([s] * 4000) print(ds.utils.size_str(s4000.data.nbytes)) # '295.48...

Just tried on google colab and got ~1min for a 15GB dataset (only 200 times SQuAD), while it should be instantaneous. The time is spent reading the Apache Arrow table...

Unfortunately no. Thanks for running the benchmark though, it shows that you machine does a lot of read operations. This is not expected: in other machines it does almost no...

Hi ! JSON files containing a list of object are not supported yet, you can use JSON Lines files instead in the meantime ```json {"text": "can I know this?", "intent":...

Yes I think it should raise an error. Currently it looks like it instantiates a custom configuration with the name given by the user: https://github.com/huggingface/datasets/blob/ba27ce33bf568374cf23a07669fdd875b5718bc2/src/datasets/builder.py#L391-L397

Thanks for reporting. I think this can be fixed by improving the `CachedDatasetModuleFactory` and making it look into the `parquet` cache directory (datasets from push_to_hub are loaded with the parquet...

We haven't had a chance to fix this yet. If someone would like to give it a try I'd be happy to give some guidance

`importable_directory_path` is used to find a **dataset script** that was previously downloaded and cached from the Hub However in your case there's no dataset script on the Hub, only parquet...

Hi ! Yes you can save your dataset locally with `my_dataset.save_to_disk("path/to/local")` and reload it later with `load_from_disk("path/to/local")` (removing myself from assignees since I'm currently not working on this right now)