deeplake icon indicating copy to clipboard operation
deeplake copied to clipboard

[BUG] "processed" transforms significantly slower than "threaded"

Open verbiiyo opened this issue 3 years ago • 8 comments

🐛🐛 Bug Report

⚗️ Current Behavior

when running a large transform job, scheduler="processed" is significantly slower than "threaded".

tested by having the transform upload to s3.

benchmark results:

https://activeloop.atlassian.net/wiki/spaces/~465006346/pages/95256645/imagenet+transform+-+s3+bottleneck?atlOrigin=eyJpIjoiN2Q2ZmI4NTZmNTA2NDMwMWJlYWNkMTUwNDZmMjQ3ODQiLCJwIjoiYyJ9

verbiiyo avatar Mar 30 '21 18:03 verbiiyo

@AbhinavTuli

verbiiyo avatar Mar 30 '21 18:03 verbiiyo

also, no matter how many workers you allocate, threaded always runs the same speed. does hub uploading run synchronously after chunk completion?

my hypothesis is: transforms get asynchronously computed into a chunk of size workers, then after that chunk is completed it synchronously uploads them.

a better design may be to have each worker upload the data for their own outputs on their own process that way allocation can be done on the fly rather than when all workers have finished.

another solution would be to not set chunk_size = workers, but rather something like chunk_size = workers * 32.

verbiiyo avatar Mar 30 '21 18:03 verbiiyo

@McCrearyD I had experimented with the idea of number of samples per shard = workers * 32 * (number of samples that fit in 16mb) but the thing was that for datasets that have really huge sample sizes, that are bigger than 16 mb (for example intelinair), this was leading to memory issues. This was because the number of number of samples that fit in 16mb came out to be 0 but we then took max of number of samples and 1, giving us 1 sample. This one sample could be 500 mb for example and having 32 of these could lead to memory problems.

One way to solve this is to possibly multiply by a factor (maybe 32) only when the sample fits within 16 mb, otherwise continue with the default logic.

AbhinavTuli avatar Mar 31 '21 05:03 AbhinavTuli

why do we need to auto infer this? we don't have to make any decisions here if they provide us with a batch size @AbhinavTuli

verbiiyo avatar Mar 31 '21 05:03 verbiiyo

Yup, we do have an argument for this, sample_per_shard, but ideally, we want to auto infer this. It's unintuitive for the end user.

AbhinavTuli avatar Mar 31 '21 05:03 AbhinavTuli

@McCrearyD Did you identify the root cause? Another explanation is that GIL contention could be more pronounced in multiprocessing on multiple cores for io, resulting in thrashing.

mynameisvinn avatar Apr 08 '21 11:04 mynameisvinn

@mynameisvinn i had not considered this, i will use this information when tackling new code base

verbiiyo avatar Apr 11 '21 05:04 verbiiyo

Sounds good @McCrearyD , let me know if you want to chat about this. It might also be as simple as "use multithreaded when I/O bound, multiprocessing when CPU bound". If that is true, we could infer the appropriate parallelization technique.

mynameisvinn avatar Apr 12 '21 13:04 mynameisvinn