Tom White
Tom White
A Spark executor would be a great addition. I just added some notes about implementing a new executor in #498 if you're interested in having a go at this @rbavery?
@songhan89 you could do this by transforming the Cubed DAG (a NetworkX MultiDiGraph of pipeline objects) into a Spark DAG of RDD objects, then computing the DAG of RDDs in...
@songhan89 Great progress! It looks like the failures are because some of the Cubed unit tests use a small `allowed_mem` setting (100000 - i.e. 100kB), and Spark doesn't allow values...
Thanks for working on this @GenevieveBuckley. > I need some better examples that more thoroughly cover the space of possible input & output shapes. The example I mentioned in https://github.com/dask/dask/issues/7847#issuecomment-874749110...
We could also add a link to `vcftools view` to the documentation for [`display_genotypes`](https://sgkit-dev.github.io/sgkit/latest/generated/sgkit.display_genotypes.html#sgkit.display_genotypes) as a suggested alternative.
Superceded by #1264
I've opened https://github.com/dask/dask/issues/11416
Unfortunately, it looks like Dask 2024.10.0 doesn't fix this, see https://github.com/sgkit-dev/sgkit/actions/runs/11551276595 which is taking 19 minutes to run, rather than 6 (with Dask 2024.08.0).
On further investigation what's happening is that locally defined functions that are passed to Dask `map_blocks` and that wrap Numba functions are being recompiled every time the (genomics) method is...
I've fixed the non-distance functions in this commit: https://github.com/sgkit-dev/sgkit/pull/1261/commits/e83b52cdf1ef1b305eefdd8bcaca55b437cc4e4b I'm not sure what to do about the distance functions at this point.