sgkit
sgkit copied to clipboard
Run on Pyodide
Pyodide uses WebAssembly to run Python in the browser. It has support for a lot of the PyData stack, so I wondered how easy it would be to get sgkit running on it. It would be a nice way to share demos and notebooks (see JupyterLite). (This is work I did last year but didn't get round to sharing.)
The following libraries are not supported yet:
- Dask distributed. There is some discussion on https://github.com/dask/dask/issues/7764. The synchonous scheduler does work though, with a small workaround.
- Numba. Ideally numba decorators would be ignored, but for the demo below I just commented them out. (There is a problem with doing this for
guvectorize
since it generates code with a new signature, so anything that uses these functions won't work.) - IO libraries. In principle these could be submitted to Pyodide as a new package.
I created a branch with the above changes (and a few others), then built a wheel and uploaded to GCP in order to load it with micropip. Then using https://pyodide.org/en/latest/console.html, I managed to create an sgkit Dataset:
Welcome to the Pyodide terminal emulator 🐍
Python 3.9.5 (default, Jan 17 2022 04:07:25) on WebAssembly VM
Type "help", "copyright", "credits" or "license" for more information.
>>> import micropip
>>> import zarr # not sure why this is needed before installing sgkit
>>> import sklearn # needed since sgkit doesn't explicitly declare it as a dependency (need to fix)
>>> await micropip.install("https://storage.googleapis.com/tomwhite_test/sgkit-0.3.1.dev5%2Bg59736c0-py3-none-any.whl")
>>> # needed to import dask, see https://github.com/pyodide/pyodide/issues/1603
>>> import sys
sys.modules['_multiprocessing'] = object
>>>
>>> import dask
>>> dask.config.set(scheduler='synchronous')
<dask.config.set object at 0x347de48>
>>> import sgkit as sg
/lib/python3.9/site-packages/pandas/compat/__init__.py:124: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma com
pression will result in a RuntimeError.
warnings.warn(msg)
>>> ds = sg.simulate_genotype_call_dataset(n_variant=1000, n_sample=250, n_contig=23, missing_pct=.1)
>>> ds
<xarray.Dataset>
Dimensions: (variants: 1000, alleles: 2, samples: 250, ploidy: 2)
Dimensions without coordinates: variants, alleles, samples, ploidy
Data variables:
variant_contig (variants) int32 0 0 0 0 0 0 0 ... 22 22 22 22 22 22 22
variant_position (variants) int32 0 1 2 3 4 5 6 ... 36 37 38 39 40 41 42
variant_allele (variants, alleles) |S1 b'G' b'A' b'T' ... b'A' b'T'
sample_id (samples) <U4 'S0' 'S1' 'S2' ... 'S247' 'S248' 'S249'
call_genotype (variants, samples, ploidy) int8 0 0 1 0 1 ... 0 0 0 0 1
call_genotype_mask (variants, samples, ploidy) bool False False ... False
Attributes:
contigs: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, ...
>>>
I updated this to use the latest code, and stubbed out some of the numba calls: https://github.com/tomwhite/sgkit/tree/pyodide-latest. This simplifies its usage a bit:
Welcome to the Pyodide terminal emulator 🐍
Python 3.9.5 (default, Jan 17 2022 04:07:25) on WebAssembly VM
Type "help", "copyright", "credits" or "license" for more information.
>>> import micropip
>>> await micropip.install("https://storage.googleapis.com/tomwhite_test/sgkit-0.4.1.dev20%2Bg839eb9a9-py3-none-any.whl")
>>> import sys
sys.modules['_multiprocessing'] = object
>>> import dask
dask.config.set(scheduler='synchronous')
<dask.config.set object at 0x2071538>
>>> import sgkit as sg
/lib/python3.9/site-packages/pandas/compat/__init__.py:124: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma com
pression will result in a RuntimeError.
warnings.warn(msg)
>>> sg.simulate_genotype_call_dataset(n_variant=1000, n_sample=250, n_contig=23, missing_pct=.1)
<xarray.Dataset>
Dimensions: (variants: 1000, alleles: 2, samples: 250, ploidy: 2)
Dimensions without coordinates: variants, alleles, samples, ploidy
Data variables:
variant_contig (variants) int32 0 0 0 0 0 0 0 ... 22 22 22 22 22 22 22
variant_position (variants) int32 0 1 2 3 4 5 6 ... 36 37 38 39 40 41 42
variant_allele (variants, alleles) |S1 b'G' b'A' b'T' ... b'A' b'T'
sample_id (samples) <U4 'S0' 'S1' 'S2' ... 'S247' 'S248' 'S249'
call_genotype (variants, samples, ploidy) int8 0 0 1 0 1 ... 0 0 0 0 1
call_genotype_mask (variants, samples, ploidy) bool False False ... False
Attributes:
contigs: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, ...
source: sgkit-0.4.1.dev20+g839eb9a9
>>>
I built the wheel with
python setup.py bdist_wheel
And I used GCS since it makes it easy to set CORS:
gsutil cors set cors.json gs://tomwhite_test
Where cors.json
is
[
{
"origin": ["https://pyodide.org/"],
"method": ["GET"],
"responseHeader": ["Content-Type"],
"maxAgeSeconds": 3600
}
]
One day it might be possible just to use the standard sgkit wheel from PyPi, in which case there would be no need to worry about CORS.