deeplake
deeplake copied to clipboard
[BUG] "processed" transforms significantly slower than "threaded"
🐛🐛 Bug Report
⚗️ Current Behavior
when running a large transform job, scheduler="processed" is significantly slower than "threaded".
tested by having the transform upload to s3.
benchmark results:
https://activeloop.atlassian.net/wiki/spaces/~465006346/pages/95256645/imagenet+transform+-+s3+bottleneck?atlOrigin=eyJpIjoiN2Q2ZmI4NTZmNTA2NDMwMWJlYWNkMTUwNDZmMjQ3ODQiLCJwIjoiYyJ9
@AbhinavTuli
also, no matter how many workers you allocate, threaded always runs the same speed. does hub uploading run synchronously after chunk completion?
my hypothesis is: transforms get asynchronously computed into a chunk of size workers
, then after that chunk is completed it synchronously uploads them.
a better design may be to have each worker upload the data for their own outputs on their own process that way allocation can be done on the fly rather than when all workers have finished.
another solution would be to not set chunk_size = workers
, but rather something like chunk_size = workers * 32
.
@McCrearyD I had experimented with the idea of number of samples per shard = workers * 32 * (number of samples that fit in 16mb)
but the thing was that for datasets that have really huge sample sizes, that are bigger than 16 mb (for example intelinair), this was leading to memory issues.
This was because the number of number of samples that fit in 16mb
came out to be 0 but we then took max of number of samples and 1, giving us 1 sample. This one sample could be 500 mb for example and having 32 of these could lead to memory problems.
One way to solve this is to possibly multiply by a factor (maybe 32) only when the sample fits within 16 mb, otherwise continue with the default logic.
why do we need to auto infer this? we don't have to make any decisions here if they provide us with a batch size @AbhinavTuli
Yup, we do have an argument for this, sample_per_shard, but ideally, we want to auto infer this. It's unintuitive for the end user.
@McCrearyD Did you identify the root cause? Another explanation is that GIL contention could be more pronounced in multiprocessing on multiple cores for io, resulting in thrashing.
@mynameisvinn i had not considered this, i will use this information when tackling new code base
Sounds good @McCrearyD , let me know if you want to chat about this. It might also be as simple as "use multithreaded when I/O bound, multiprocessing when CPU bound". If that is true, we could infer the appropriate parallelization technique.