thomas chaton
thomas chaton
Thanks @rom1504 The machine has 32 CPUs, so I thought it should be fine. I am running inside a docker container, so having some issues to install knot resolver. I...
Hey @rom1504 Any idea what I should be looking for on the docker or cloud provider side as possible source of issues? Also, should I use knot or bind9?
Thanks, @rom1504 I will check this out. I managed to install knot on the host but it isn't visible inside the container and networking seems broken. Have you ever tried?
I am also curious what kind of numbers do you get without using knot resolver ?
Hey @rom1504 I am trying to get it working on https://lightning.ai/, so it runs in docker. Yes, my success rate is far from this. So something is wrong.
@rom1504 Here is the PR I am working on: https://github.com/Lightning-AI/pytorch-lightning/pull/19400 and the API: I am trying to make data processing efficient while easy to hack around. Here is the example...
It seemed Image downloading speeds were quite similar between optimize and img2dataset. But I need to be more principled and collect the same metrics to build a more educated comparison....
The distribution is already fully handled by the `optimize` and `map` operators. Check this example: https://lightning.ai/lightning-ai/studios/prepare-the-tinyllama-1t-token-dataset?view=public§ion=data+processing Example to tokenize SlimPajama. ```python import json from pathlib import Path import zstandard as...
Hey @rom1504 I am able to get 1.1k images/sec. I think I have a version of knot resolver that works. I am also using http2 from httpx client and I...
Hey @rom1504, > Be careful with sorting the urls as you risk to dos the hosts. I had randomly shuffled them in laion datasets to mitigate this. Interesting. Yes, I...