Hwijeen Ahn

Results 8 comments of Hwijeen Ahn

Hi @lhoestq, here is the result. I additionally measured time to `load_from_disk`: * 60GB ``` loading took: 22.618776321411133 ramdom indexing 100 times took: 0.10214924812316895 ``` * 600GB ``` loading took:...

Here are some details of my 600GB dataset. This is a dataset AFTER the `map` function and once I load this dataset, I do not use `map` anymore in the...

Regarding the environment, I am running the code on a cloud server. Here are some info: ``` Ubuntu 18.04.5 LTS # cat /etc/issue pyarrow 3.0.0 # pip list | grep...

I am not sure how I could provide you with the reproducible code, since the problem only arises when the data is big. For the moment, I would share the...

Hi! I just ran the same code with different datasets (one is 60 GB and another 600 GB), and the latter runs much slower. ETA differs by 10x.

Hmm that's different from what I got. I was on Ubuntu when reporting the initial issue.

Could you give a more detailed explanation on the data format for dependency parsing? You have already provided an example, but I am still not clear what each column means....

Also faces this issue As a workaround, I use this command to delete the duplicated directory if somebody is really annoyed by this. ```bash find . -type d -name "*[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9].[0-9][0-9][0-9][0-9][0-9][0-9]"...