dask-ms
dask-ms copied to clipboard
ParallelTable Musings
-
[ ] Tests added / passed
$ py.test -v -s daskms/tests
If the pep8 tests fail, the quickest way to correct this is to run
autopep8
and thenflake8
andpycodestyle
to fix the remaining issues.$ pip install -U autopep8 flake8 pycodestyle $ autopep8 -r -i daskms $ flake8 daskms $ pycodestyle daskms
-
[ ] Fully documented, including
HISTORY.rst
for all changes and one of thedocs/*-api.rst
files for new APITo build the docs locally:
pip install -r requirements.readthedocs.txt cd docs READTHEDOCS=True make html
This PR is a WIP demonstrating a possible approach for parallel reads from threads. This approach is reliant on https://github.com/casacore/casacore/pull/1167, which allows me to avoid using soft links. Instead, the changes in that PR mean that when a table is opened from multiple threads, it does not share its underlying plain table object.
The approach that I am attempting here is almost certainly imperfect but it is very simple. It defines a ParallelTable
class which inherits from pyrap.tables.table
. This, unfortunately, introduces some limitations as the base class is defined in C++. That said, doing this allows us to create a ParallelTable
object which masquerades as a normal table - the only difference is that when a read method is called, it first checks if the thread has an open instance of the table. If not, the table is opened in the thread and added to the cache. I make use of weakref
to ensure that all tables are closed when the ParallelTable object is GCed.
The changes in this PR seem to work although some tests are broken - I suspect this may have to do with subtables, but I have yet to investigate. Note that there is plenty of ugly debugging code in the PR. I will remove it if this coalesces into a stable approach.
One important thing to note is the fact that the cf.ThreadPoolExecutor
has been dummied out with a DummyThreadPoolExecutor
and DummyFuture
. This seems to work for a simple read case, though further testing is needed. This would be a nice simplification as it suggests that we could get away without internal threadpools. That said, the changes in the PR also work with the internal threadpools with the caveat that those threadpools need more than one thread (as otherwise we serialise).
Finally, one thing to note is that using the processes
scheduler does not function optimally for both this PR and master
. Both will repeatedly open tables for reasons I don't fully understand. I suspect that the caching mechanism on the TableProxy
doesn't function as expected in this specific case. What is particularly confusing is that it does seem to operate correctly in the distributed case using a LocalCluster
with multiple workers.