sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

Run on Pyodide

Open tomwhite opened this issue 2 years ago • 1 comments

Pyodide uses WebAssembly to run Python in the browser. It has support for a lot of the PyData stack, so I wondered how easy it would be to get sgkit running on it. It would be a nice way to share demos and notebooks (see JupyterLite). (This is work I did last year but didn't get round to sharing.)

The following libraries are not supported yet:

  • Dask distributed. There is some discussion on https://github.com/dask/dask/issues/7764. The synchonous scheduler does work though, with a small workaround.
  • Numba. Ideally numba decorators would be ignored, but for the demo below I just commented them out. (There is a problem with doing this for guvectorize since it generates code with a new signature, so anything that uses these functions won't work.)
  • IO libraries. In principle these could be submitted to Pyodide as a new package.

I created a branch with the above changes (and a few others), then built a wheel and uploaded to GCP in order to load it with micropip. Then using https://pyodide.org/en/latest/console.html, I managed to create an sgkit Dataset:

Welcome to the Pyodide terminal emulator 🐍
Python 3.9.5 (default, Jan 17 2022 04:07:25) on WebAssembly VM
Type "help", "copyright", "credits" or "license" for more information.
>>> import micropip
>>> import zarr # not sure why this is needed before installing sgkit
>>> import sklearn # needed since sgkit doesn't explicitly declare it as a dependency (need to fix)
>>> await micropip.install("https://storage.googleapis.com/tomwhite_test/sgkit-0.3.1.dev5%2Bg59736c0-py3-none-any.whl")
>>> # needed to import dask, see https://github.com/pyodide/pyodide/issues/1603
>>> import sys
sys.modules['_multiprocessing'] = object
>>> 
>>> import dask
>>> dask.config.set(scheduler='synchronous')
<dask.config.set object at 0x347de48>
>>> import sgkit as sg
/lib/python3.9/site-packages/pandas/compat/__init__.py:124: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma com
pression will result in a RuntimeError.
  warnings.warn(msg)
>>> ds = sg.simulate_genotype_call_dataset(n_variant=1000, n_sample=250, n_contig=23, missing_pct=.1)
>>> ds
<xarray.Dataset>
Dimensions:             (variants: 1000, alleles: 2, samples: 250, ploidy: 2)
Dimensions without coordinates: variants, alleles, samples, ploidy
Data variables:
    variant_contig      (variants) int32 0 0 0 0 0 0 0 ... 22 22 22 22 22 22 22
    variant_position    (variants) int32 0 1 2 3 4 5 6 ... 36 37 38 39 40 41 42
    variant_allele      (variants, alleles) |S1 b'G' b'A' b'T' ... b'A' b'T'
    sample_id           (samples) <U4 'S0' 'S1' 'S2' ... 'S247' 'S248' 'S249'
    call_genotype       (variants, samples, ploidy) int8 0 0 1 0 1 ... 0 0 0 0 1
    call_genotype_mask  (variants, samples, ploidy) bool False False ... False
Attributes:
    contigs:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, ...
>>> 
 

tomwhite avatar Jan 17 '22 12:01 tomwhite

I updated this to use the latest code, and stubbed out some of the numba calls: https://github.com/tomwhite/sgkit/tree/pyodide-latest. This simplifies its usage a bit:

Welcome to the Pyodide terminal emulator 🐍
Python 3.9.5 (default, Jan 17 2022 04:07:25) on WebAssembly VM
Type "help", "copyright", "credits" or "license" for more information.
>>> import micropip
>>> await micropip.install("https://storage.googleapis.com/tomwhite_test/sgkit-0.4.1.dev20%2Bg839eb9a9-py3-none-any.whl")
>>> import sys
sys.modules['_multiprocessing'] = object
>>> import dask
dask.config.set(scheduler='synchronous')
<dask.config.set object at 0x2071538>
>>> import sgkit as sg
/lib/python3.9/site-packages/pandas/compat/__init__.py:124: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma com
pression will result in a RuntimeError.
  warnings.warn(msg)
>>> sg.simulate_genotype_call_dataset(n_variant=1000, n_sample=250, n_contig=23, missing_pct=.1)
<xarray.Dataset>
Dimensions:             (variants: 1000, alleles: 2, samples: 250, ploidy: 2)
Dimensions without coordinates: variants, alleles, samples, ploidy
Data variables:
    variant_contig      (variants) int32 0 0 0 0 0 0 0 ... 22 22 22 22 22 22 22
    variant_position    (variants) int32 0 1 2 3 4 5 6 ... 36 37 38 39 40 41 42
    variant_allele      (variants, alleles) |S1 b'G' b'A' b'T' ... b'A' b'T'
    sample_id           (samples) <U4 'S0' 'S1' 'S2' ... 'S247' 'S248' 'S249'
    call_genotype       (variants, samples, ploidy) int8 0 0 1 0 1 ... 0 0 0 0 1
    call_genotype_mask  (variants, samples, ploidy) bool False False ... False
Attributes:
    contigs:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, ...
    source:   sgkit-0.4.1.dev20+g839eb9a9
>>> 
 

I built the wheel with

python setup.py bdist_wheel

And I used GCS since it makes it easy to set CORS:

gsutil cors set cors.json gs://tomwhite_test

Where cors.json is

[
    {
      "origin": ["https://pyodide.org/"],
      "method": ["GET"],
      "responseHeader": ["Content-Type"],
      "maxAgeSeconds": 3600
    }
]

One day it might be possible just to use the standard sgkit wheel from PyPi, in which case there would be no need to worry about CORS.

tomwhite avatar Jan 17 '22 17:01 tomwhite