Hwijeen Ahn comments

Results 8 comments of


Hwijeen Ahn

Slow dataloading with big datasets issue persists

Hi @lhoestq, here is the result. I additionally measured time to `load_from_disk`: * 60GB ``` loading took: 22.618776321411133 ramdom indexing 100 times took: 0.10214924812316895 ``` * 600GB ``` loading took:...

Slow dataloading with big datasets issue persists

Here are some details of my 600GB dataset. This is a dataset AFTER the `map` function and once I load this dataset, I do not use `map` anymore in the...

Slow dataloading with big datasets issue persists

Regarding the environment, I am running the code on a cloud server. Here are some info: ``` Ubuntu 18.04.5 LTS # cat /etc/issue pyarrow 3.0.0 # pip list | grep...

Slow dataloading with big datasets issue persists

I am not sure how I could provide you with the reproducible code, since the problem only arises when the data is big. For the moment, I would share the...

Slow dataloading with big datasets issue persists

Hi! I just ran the same code with different datasets (one is 60 GB and another 600 GB), and the latter runs much slower. ETA differs by 10x.

Slow dataloading with big datasets issue persists

Hmm that's different from what I got. I was on Ubuntu when reporting the initial issue.

Data Format

Could you give a more detailed explanation on the data format for dependency parsing? You have already provided an example, but I am still not clear what each column means....

Two tfevent files are being generated for each run of trainer

Also faces this issue As a workaround, I use this command to delete the duplicated directory if somebody is really annoyed by this. ```bash find . -type d -name "*[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9].[0-9][0-9][0-9][0-9][0-9][0-9]"...