Quentin Lhoest comments

Results 416 comments of


                                            Quentin Lhoest

Slow dataloading with big datasets issue persists

Hi @BenoitDalFerro how do your load your dataset ?

Slow dataloading with big datasets issue persists

I wasn't able to reproduce this on a toy dataset of around 300GB: ```python import datasets as ds s = ds.load_dataset("squad", split="train") s4000 = ds.concatenate_datasets([s] * 4000) print(ds.utils.size_str(s4000.data.nbytes)) # '295.48...

Slow dataloading with big datasets issue persists

Just tried on google colab and got ~1min for a 15GB dataset (only 200 times SQuAD), while it should be instantaneous. The time is spent reading the Apache Arrow table...

Slow dataloading with big datasets issue persists

Unfortunately no. Thanks for running the benchmark though, it shows that you machine does a lot of read operations. This is not expected: in other machines it does almost no...

Dataset Viewer issue for asaxena1990/Dummy_dataset

Hi ! JSON files containing a list of object are not supported yet, you can use JSON Lines files instead in the meantime ```json {"text": "can I know this?", "intent":...

load_dataset_builder should error if "name" does not exist?

Yes I think it should raise an error. Currently it looks like it instantiates a custom configuration with the name given by the user: https://github.com/huggingface/datasets/blob/ba27ce33bf568374cf23a07669fdd875b5718bc2/src/datasets/builder.py#L391-L397

Quentin Lhoest

Slow dataloading with big datasets issue persists

Slow dataloading with big datasets issue persists

Slow dataloading with big datasets issue persists

Slow dataloading with big datasets issue persists

Dataset Viewer issue for asaxena1990/Dummy_dataset

load_dataset_builder should error if "name" does not exist?

Datasets created with `push_to_hub` can't be accessed in offline mode

Datasets created with `push_to_hub` can't be accessed in offline mode

Datasets created with `push_to_hub` can't be accessed in offline mode

Datasets created with `push_to_hub` can't be accessed in offline mode