dask-ms icon indicating copy to clipboard operation
dask-ms copied to clipboard

ParallelTable Musings

Open JSKenyon opened this issue 3 years ago • 3 comments

  • [ ] Tests added / passed

    $ py.test -v -s daskms/tests
    

    If the pep8 tests fail, the quickest way to correct this is to run autopep8 and then flake8 and pycodestyle to fix the remaining issues.

    $ pip install -U autopep8 flake8 pycodestyle
    $ autopep8 -r -i daskms
    $ flake8 daskms
    $ pycodestyle daskms
    
  • [ ] Fully documented, including HISTORY.rst for all changes and one of the docs/*-api.rst files for new API

    To build the docs locally:

    pip install -r requirements.readthedocs.txt
    cd docs
    READTHEDOCS=True make html
    

This PR is a WIP demonstrating a possible approach for parallel reads from threads. This approach is reliant on https://github.com/casacore/casacore/pull/1167, which allows me to avoid using soft links. Instead, the changes in that PR mean that when a table is opened from multiple threads, it does not share its underlying plain table object.

The approach that I am attempting here is almost certainly imperfect but it is very simple. It defines a ParallelTable class which inherits from pyrap.tables.table. This, unfortunately, introduces some limitations as the base class is defined in C++. That said, doing this allows us to create a ParallelTable object which masquerades as a normal table - the only difference is that when a read method is called, it first checks if the thread has an open instance of the table. If not, the table is opened in the thread and added to the cache. I make use of weakref to ensure that all tables are closed when the ParallelTable object is GCed.

The changes in this PR seem to work although some tests are broken - I suspect this may have to do with subtables, but I have yet to investigate. Note that there is plenty of ugly debugging code in the PR. I will remove it if this coalesces into a stable approach.

One important thing to note is the fact that the cf.ThreadPoolExecutor has been dummied out with a DummyThreadPoolExecutor and DummyFuture. This seems to work for a simple read case, though further testing is needed. This would be a nice simplification as it suggests that we could get away without internal threadpools. That said, the changes in the PR also work with the internal threadpools with the caveat that those threadpools need more than one thread (as otherwise we serialise).

Finally, one thing to note is that using the processes scheduler does not function optimally for both this PR and master. Both will repeatedly open tables for reasons I don't fully understand. I suspect that the caching mechanism on the TableProxy doesn't function as expected in this specific case. What is particularly confusing is that it does seem to operate correctly in the distributed case using a LocalCluster with multiple workers.

JSKenyon avatar Feb 24 '22 14:02 JSKenyon