opteryx icon indicating copy to clipboard operation
opteryx copied to clipboard

[FEATURE] Asynchronous reads

Open joocer opened this issue 3 years ago • 4 comments
trafficstars

To improve read speed, the reads should be asynchronous, either using ascyncio, threading or multiprocessing. Consider using Plasma to store the data between threads/processes.

discovered

  • [ ] Caching needs to be enabled to work with async
  • [ ] NO_PARALLEL_READ hint needs to be written
  • [ ] and the DOCS updated with this hint
  • [ ] Check there's enough RAM to enable plasma
  • [ ] Test if disabling parts of Arrow if the threads are faster
  • [ ] Is there a hanging problem? what causes it, get fix of it.

joocer avatar May 14 '22 08:05 joocer

partially implemented - unreliable due to https://github.com/mabel-dev/opteryx/issues/134

joocer avatar May 21 '22 13:05 joocer

Should implement the following

  • do not do any multithreading on host under 2gb RAM
  • do not create a buffer more than 25% of memory
  • turn off some optimization options in arrow which may be trying to multithread too

joocer avatar Aug 24 '22 17:08 joocer

MULTI doesn't use the cache

joocer avatar Aug 26 '22 19:08 joocer

The approach to use here is to go to a batch model rather than try to read ahead.

In a batch model, we async read 4 or more data files simultaneously, process all 4 and then read the next 4 and loop.

This way we're reading faster (not quite 4x faster, but faster), but avoiding the stalls related to trying to read ahead because the time to process each file is a tiny fraction of the time to process, so we'd need to preload 10 or more files to get meaningful speed increase which will increase chances of memory issues.

joocer avatar Feb 10 '23 23:02 joocer