polars icon indicating copy to clipboard operation
polars copied to clipboard

WebAssembly + NPM

Open ritchie46 opened this issue 5 years ago • 54 comments

See if we can support this with an optional feature.

ritchie46 avatar Sep 22 '20 20:09 ritchie46

Good feature to support, and I'd be interested at taking a stab.

Dependency-wise, here is how things look for wasm support:

Out of the box:

num = "^0.2.1"
fnv = "^1.0.7"
unsafe_unwrap = "^0.1.0"
thiserror = "^1.0.16"
itertools = "^0.9.0"
prettytable-rs = { version="^0.8.0", features=["win_crlf"], optional = true, default_features = false}
parquet = {version = "1", optional = true}
packed_simd_2 = "0.3.4"

Require tweaking:

ndarray = {version = "0.13", optional = true, default_features = false}
chrono = {version = "^0.4.13", optional = true}  // via a flag
arrow = {version = "1.0.1", default_features = false} // need to disable pretty-print via a feature flag
rayon = "^1.3.1" // need to use cond_iter or a cfg flag - wasm doesn't have threads

Might have issues:

rand = {version = "0.7", optional = true}               // wasm support being ruled out in 0.8, use getrandom crate instead
rand_distr = {version = "0.3", optional = true}     // similar to rand

All in all, from a dependency stand point, things look good. I'll see if I can get a PR up that flips the right flags and uses cond_iter instead of the normal rayon iter.

jkelleyrtp avatar Oct 12 '20 07:10 jkelleyrtp

Cool.. I am really excited about this one. I couldn't find anything about cond_iter. Is it also conditional compilation like cfg?

ritchie46 avatar Oct 12 '20 14:10 ritchie46

Cool.. I am really excited about this one. I couldn't find anything about cond_iter. Is it also conditional compilation like cfg?

https://github.com/cuviper/rayon-cond

It's just the same as a conditional compilation with cfg, and by the looks of it, using cfg might be better because rayon-cond is out of date (2 years!)

jkelleyrtp avatar Oct 12 '20 18:10 jkelleyrtp

It's just the same as a conditional compilation with cfg, and by the looks of it, using cfg might be better because rayon-cond is out of date (2 years!)

Yes.. I'd rather have that, as it won't increase the compilation times.

ritchie46 avatar Oct 13 '20 17:10 ritchie46

rayon-cond is runtime-conditional, not at compilation time. The idea was that it might help with dynamic decisions about parallelism, but if you want a static choice, cfg would be a lot better.

cuviper avatar Oct 20 '20 15:10 cuviper

https://github.com/jkelleyrtp/polars/blob/jk/wasm/wasm-test/src/lib.rs

I had a slight hiccup, but the basic examples work with some tweaking of the feature flags. Disabling pretty and simd make the basic examples work on wasm. Not sure if I just haven't ran into the right conditions to trip up and result in a panic.

In terms of npm... is the goal to release a "polars.js" package that exposes a rust-based dataframe? It would be nice if the series could be coerced into TypedArrayBuffers with some metadata so that the data could move in and out of wasm with little serialization/deserialization overhead.

jkelleyrtp avatar Oct 20 '20 18:10 jkelleyrtp

Nice work! Do you know why pretty gave problems?

In terms of npm... is the goal to release a "polars.js" package that exposes a rust-based dataframe? It would be nice if the series could be coerced into TypedArrayBuffers with some metadata so that the data could move in and out of wasm with little serialization/deserialization overhead.

My idea was a bit like I've done in Python. As much of the memory and operations in rust and an option to get data out to python. It depenends on the memory layout of the Series if this can be done without copy. If it is a single chunk and there are no null's in the array, it can be don zero-copy by giving ownership to Python/numpy. Otherwise I just allocate a new array.

I don't know about JS/WASM. Is everything in rust memory also WASM linear memory and thus accessible from JS? If it is, it can be zero copy if before mentioned conditions are right.

Getting data in from JS will probably always be copy as Arrow memory is 64-byte aligned which most allocations aren't.

ritchie46 avatar Oct 21 '20 07:10 ritchie46

With regards to the question:

I don't know about JS/WASM. Is everything in rust memory also WASM linear memory and thus accessible from JS? If it is, it can be zero copy if before mentioned conditions are right.

I am not a WebAssembly expert, but I recently did the wasm-pack (tutorial)[https://rustwasm.github.io/book/game-of-life/implementing.html] and read this that may is useful for you:

JavaScript's garbage-collected heap — where Objects, Arrays, and DOM nodes are allocated — is distinct from WebAssembly's linear memory space, where our Rust values live. WebAssembly currently has no direct access to the garbage-collected heap (as of April 2018, this is expected to change with the "Interface Types" proposal). JavaScript, on the other hand, can read and write to the WebAssembly linear memory space, but only as an ArrayBuffer of scalar values (u8, i32, f64, etc...). WebAssembly functions also take and return scalar values. These are the building blocks from which all WebAssembly and JavaScript communication is constituted.

Then, what I understand is that you can only communicate JavaScript and Web-Assembly using Array of scalar. Moreover, Rust is compiled to web-assembly, then, as long as your object lives in Rust also will live in Web-Assembly.

marioloko avatar Oct 24 '20 10:10 marioloko

Then it seems that we can access all data with minimal overhead, so that's great.

ritchie46 avatar Oct 25 '20 14:10 ritchie46

It seems that in more recent versions since 8.1, some new dependencies makes the wasm compilation more difficult, ie: comfy-table. It would be nice if polars-core could contain only computations dependencies, getting rid of all IOs formatting and compressions from it.

jcheype avatar Feb 13 '21 21:02 jcheype

Some dependencies have gotten easier. For example, arrow compiles to wasm with the default configurations now.

domoritz avatar Mar 13 '21 05:03 domoritz

As I understand, rayon is not an issue anymore, and Polars can be ran without SIMD, so the dependencies that are still bothering compilation are trivially replaced/ turned off.

It seems that in more recent versions since 8.1, some new dependencies makes the wasm compilation more difficult, ie: comfy-table. It would be nice if polars-core could contain only computations dependencies, getting rid of all IOs formatting and compressions from it.

I will make sure that IO and all formatting libraries are optional. Formatting could best be done in JS.

ritchie46 avatar Apr 12 '21 16:04 ritchie46

This appears to be mostly working. What tasks do you need help with?

tbro avatar May 18 '21 00:05 tbro

Yes, I made a small POC.

I wanted to mimic the python api, but some things were not yet possible in wasm bindgen, such as sending a Vec<Series>, to a DataFrame. So I was thinking that we probably want some javascript wrapper DataFrame and wrapper Series (I also have that in Python) that can use things like builder patterns under the hood to mimic the Python API.

What it boils down to is that there is quite some work to do, and I think we should split it up in 2 packages.

  • js-polars-core -> backend /core wasm
  • js-polars -> written in js/ts that creates a nice api around js-polars-core

ritchie46 avatar May 18 '21 06:05 ritchie46

Very cool. What do you think about returning Arrow from the WASM context to the JS context and then exposing it to users via the Arrow JS library? The idea is to use Arrow as an IPC format between WASM and JS. You could also use Arrow as an IPC between a web worker and the main thread. We've done something similar in another WASM project with great success.

domoritz avatar May 19 '21 16:05 domoritz

@domoritz I was thinking about the WASM solution as a replacement for Arrow JS, but your proposal actually makes sense. There is no need for duplicate ChunkedArray->Primitive->ChunkedArray implementation in polars, that can be shared with Arrow JS. I assume it's not hot code path and when it is (strings, array of numbers) it'll have to be in JS land anyways. I'm not sure about the schema, dictionary and recordbatch header handling though. In which library would you handle (parse) it?

alippai avatar May 19 '21 17:05 alippai

Glad you like the proposal. We've actually had our own iterator implementation first as well and then switched to Arrow JS so we don't duplicate work. It's been a good decision and I agree that the performance should be almost unaffected (if not better since we avoid repeated calls into wasm).

I'm not sure about the schema, dictionary and recordbatch header handling though. In which library would you handle (parse) it?

Not sure I understand the question but I'll try to answer it. If you send record batches from wasm to js, arrow js would construct the schema from the IPC.

domoritz avatar May 19 '21 17:05 domoritz

My question was that would we use the full Arrow IPC for messaging or a simpler / lower level component, specific "ChunkedArray" types (as pyarrow refers to them). I don't think eg. pyarrow uses Apache IPC for arrow<->pyarrow communication. Is pyarrow <-> arrow communication a wrong model here (it has similar calling cost, primitives, lista and complex types are different etc)?

alippai avatar May 19 '21 17:05 alippai

You could probably use the arrow vectors (which we are changing to be always chunked) but I'm not sure of the benefits. The difference in Python, I think, is that communication between contexts is cheaper. In WASM, you still would need to get e.g. the schema across the boundary, and Arrow's binary format would be more efficient than say JSON. But I might be wrong. I'd say try the simplest solution first and then see whether there are bottlenecks.

domoritz avatar May 19 '21 17:05 domoritz

This was my question. Does the JS part have to know anything about headers, footers and metadata? I don't think a python<->c++ call is cheaper than JS<->WASM. I might be wrong, but I know that WASM functions are cheaper to call in NodeJS than their C++ implementations.

alippai avatar May 19 '21 19:05 alippai

Very cool. What do you think about returning Arrow from the WASM context to the JS context and then exposing it to users via the Arrow JS library? The idea is to use Arrow as an IPC format between WASM and JS. You could also use Arrow as an IPC between a web worker and the main thread. We've done something similar in another WASM project with great success.

I think that whatever is feasible we should investigate. Ideally I'd like to have a seamless interop with js-arrow and Polars Series similar to how that works in Python Polars.

My question was that would we use the full Arrow IPC for messaging or a simpler / lower level component, specific "ChunkedArray" types (as pyarrow refers to them). I don't think eg. pyarrow uses Apache IPC for arrow<->pyarrow communication. Is pyarrow <-> arrow communication a wrong model here (it has similar calling cost, primitives, lista and complex types are different etc)?

For interop with pyarrow (e.g. C++ arrow) / Rust arrow we use the arrow C data interface. This is zero-copy and we just send some pointers around. I don't think it can get much faster than that.

ritchie46 avatar May 19 '21 19:05 ritchie46

Very cool. What do you think about returning Arrow from the WASM context to the JS context and then exposing it to users via the Arrow JS library? The idea is to use Arrow as an IPC format between WASM and JS. You could also use Arrow as an IPC between a web worker and the main thread. We've done something similar in another WASM project with great success.

I think that whatever is feasible we should investigate. Ideally I'd like to have a seamless interop with js-arrow and Polars Series similar to how that works in Python Polars.

My question was that would we use the full Arrow IPC for messaging or a simpler / lower level component, specific "ChunkedArray" types (as pyarrow refers to them). I don't think eg. pyarrow uses Apache IPC for arrow<->pyarrow communication. Is pyarrow <-> arrow communication a wrong model here (it has similar calling cost, primitives, lista and complex types are different etc)?

For interop with pyarrow (e.g. C++ arrow) / Rust arrow we use the arrow C data interface. This is zero-copy and we just send some pointers around. I don't think it can get much faster than that.

You are right that the C data interface is the best way to interop with languages that can somehow consume these C headers. But arrow-js cannot read the C data interface today. Conceptually, this would require the arrow-js devs to interpret the C headers and all the pointers manually out of your wasm heap which is rather unrealistic. A compromise would be to consume the C data interface on the wasm side and point arrow-js to the right arrays (maybe just dump the schema and all relevant offsets as json or thrift) but that's also code that does not exist today.

I'd recommend to just pack your buffers via the IPC format. That's what we do and it works quite well. It allows you to expose your buffers as real record batch streams and you can still eliminate the explicit ipc packing later without anyone noticing.

Also everything that consumes your buffers will likely be javascript which will quickly engage any handbrakes it can find. This will outweigh the additional IPC packing by a lot.

ankoh avatar May 19 '21 22:05 ankoh

@ankoh, is there a working alpha of the Javascript wrapper?

stordopoulos avatar Jul 08 '21 13:07 stordopoulos

@ankoh, is there a working alpha of the Javascript wrapper?

Not for polar-rs. We faced a very similar problem with the upcoming WebAssembly version of DuckDB. We're just packing SQL results as arrow record batch streams there, which works fairly well.

ankoh avatar Jul 08 '21 13:07 ankoh

Thanks for the update, @ankoh. I hope there's some substantial progress soon. I've tried any JS-based Dataframe library I could find, and all of them felt underwhelming, if not just a hobby project.

stordopoulos avatar Jul 08 '21 13:07 stordopoulos

Has anyone considered using FFI or Neon for JS bindings instead of WASM? Obviously it wouldnt work with browser side JS, but I think the larger use case would be for NodeJS anyways.

Id be happy to start working on some nodejs bindings if that is a direction the core devs would be okay with.

universalmind303 avatar Nov 07 '21 05:11 universalmind303

Y'all might be interested in https://github.com/duckdb/duckdb-wasm and the release post at https://duckdb.org/2021/10/29/duckdb-wasm.html.

domoritz avatar Nov 07 '21 13:11 domoritz

@ritchie46 Was hoping to get some feedback on the POC i suggested when you have some time. I didn't know if you had strong opinions on using WASM and supporting the browser, or if supporting only nodejs was sufficient.

universalmind303 avatar Nov 09 '21 02:11 universalmind303

@ritchie46 Was hoping to get some feedback on the POC i suggested when you have some time. I didn't know if you had strong opinions on using WASM and supporting the browser, or if supporting only nodejs was sufficient.

I will. I had surgery yesterday, need some recovery time. Thank you for the contribution!

ritchie46 avatar Nov 09 '21 06:11 ritchie46

@universalmind303 I'd be glad to help you with Neon bindings. I came here specifically to see if there are efforts going in that direction and saw that you wish to start it.

dashmug avatar Nov 23 '21 17:11 dashmug