Thomas Proisl

Results 21 comments of Thomas Proisl

The two names [vb.net](http://vb.net) and [asp.net](https://asp.net) are indeed working URLs (though only one is registered by Microsoft). While they are probably used much more frequently as proper names, recognizing them...

It is currently not possible to perfectly reconstruct the input text from the output tokens as SoMaJo will normalize any whitespace to a single space and will discard things like...

Wow, I didn't expect SoMaJo to be useful for Malaysian! Unfortunately, I am not able to reproduce the problem. I get the following output which looks fine: [['BE', 'kem', 'pertama',...

I’m facing the same issue when loading a 900GB dataset (stored via `save_to_disk`): `load_from_disk(path_to_dir)` takes 1.5 hours and htop consistently shows high IO rates > 120 M/s.

@lhoestq Thank you! The issue is getting more interesting. The second script is still running, but it's definitely taking much longer than 15 seconds.

Okay, here’s the ouput: Blocks read 158396 Elapsed time: 529.10s Also using datasets 1.6.2. Do you have any ideas, how to pinpoint the problem?

The 529.10s was a bit too optimistic. I cancelled the reading process once before running it completely, therefore the harddrive cache probably did its work. Here are three consecutive runs...

See #28: > I’ve decided to explicitly add markdown links, so this should be fixed now, with the caveat that it will fail if the link description contains square brackets...

I’ve decided to explicitly add markdown links, so this should be fixed now, with the caveat that it will fail if the link description contains square brackets or if the...

Thank you for reporting this! I’ll just summarize our off-GitHub discussion here for future reference. The root of the problem is how worker processes are created on different platforms. On...