polars
polars copied to clipboard
WebAssembly + NPM
See if we can support this with an optional feature.
Good feature to support, and I'd be interested at taking a stab.
Dependency-wise, here is how things look for wasm support:
Out of the box:
num = "^0.2.1"
fnv = "^1.0.7"
unsafe_unwrap = "^0.1.0"
thiserror = "^1.0.16"
itertools = "^0.9.0"
prettytable-rs = { version="^0.8.0", features=["win_crlf"], optional = true, default_features = false}
parquet = {version = "1", optional = true}
packed_simd_2 = "0.3.4"
Require tweaking:
ndarray = {version = "0.13", optional = true, default_features = false}
chrono = {version = "^0.4.13", optional = true} // via a flag
arrow = {version = "1.0.1", default_features = false} // need to disable pretty-print via a feature flag
rayon = "^1.3.1" // need to use cond_iter or a cfg flag - wasm doesn't have threads
Might have issues:
rand = {version = "0.7", optional = true} // wasm support being ruled out in 0.8, use getrandom crate instead
rand_distr = {version = "0.3", optional = true} // similar to rand
All in all, from a dependency stand point, things look good. I'll see if I can get a PR up that flips the right flags and uses cond_iter instead of the normal rayon iter.
Cool.. I am really excited about this one. I couldn't find anything about cond_iter. Is it also conditional compilation like cfg?
Cool.. I am really excited about this one. I couldn't find anything about
cond_iter. Is it also conditional compilation likecfg?
https://github.com/cuviper/rayon-cond
It's just the same as a conditional compilation with cfg, and by the looks of it, using cfg might be better because rayon-cond is out of date (2 years!)
It's just the same as a conditional compilation with
cfg, and by the looks of it, usingcfgmight be better because rayon-cond is out of date (2 years!)
Yes.. I'd rather have that, as it won't increase the compilation times.
rayon-cond is runtime-conditional, not at compilation time. The idea was that it might help with dynamic decisions about parallelism, but if you want a static choice, cfg would be a lot better.
https://github.com/jkelleyrtp/polars/blob/jk/wasm/wasm-test/src/lib.rs
I had a slight hiccup, but the basic examples work with some tweaking of the feature flags. Disabling pretty and simd make the basic examples work on wasm. Not sure if I just haven't ran into the right conditions to trip up and result in a panic.
In terms of npm... is the goal to release a "polars.js" package that exposes a rust-based dataframe? It would be nice if the series could be coerced into TypedArrayBuffers with some metadata so that the data could move in and out of wasm with little serialization/deserialization overhead.
Nice work! Do you know why pretty gave problems?
In terms of npm... is the goal to release a "polars.js" package that exposes a rust-based dataframe? It would be nice if the series could be coerced into TypedArrayBuffers with some metadata so that the data could move in and out of wasm with little serialization/deserialization overhead.
My idea was a bit like I've done in Python. As much of the memory and operations in rust and an option to get data out to python. It depenends on the memory layout of the Series if this can be done without copy. If it is a single chunk and there are no null's in the array, it can be don zero-copy by giving ownership to Python/numpy. Otherwise I just allocate a new array.
I don't know about JS/WASM. Is everything in rust memory also WASM linear memory and thus accessible from JS? If it is, it can be zero copy if before mentioned conditions are right.
Getting data in from JS will probably always be copy as Arrow memory is 64-byte aligned which most allocations aren't.
With regards to the question:
I don't know about JS/WASM. Is everything in rust memory also WASM linear memory and thus accessible from JS? If it is, it can be zero copy if before mentioned conditions are right.
I am not a WebAssembly expert, but I recently did the wasm-pack (tutorial)[https://rustwasm.github.io/book/game-of-life/implementing.html] and read this that may is useful for you:
JavaScript's garbage-collected heap — where Objects, Arrays, and DOM nodes are allocated — is distinct from WebAssembly's linear memory space, where our Rust values live. WebAssembly currently has no direct access to the garbage-collected heap (as of April 2018, this is expected to change with the "Interface Types" proposal). JavaScript, on the other hand, can read and write to the WebAssembly linear memory space, but only as an ArrayBuffer of scalar values (u8, i32, f64, etc...). WebAssembly functions also take and return scalar values. These are the building blocks from which all WebAssembly and JavaScript communication is constituted.
Then, what I understand is that you can only communicate JavaScript and Web-Assembly using Array of scalar. Moreover, Rust is compiled to web-assembly, then, as long as your object lives in Rust also will live in Web-Assembly.
Then it seems that we can access all data with minimal overhead, so that's great.
It seems that in more recent versions since 8.1, some new dependencies makes the wasm compilation more difficult, ie: comfy-table. It would be nice if polars-core could contain only computations dependencies, getting rid of all IOs formatting and compressions from it.
Some dependencies have gotten easier. For example, arrow compiles to wasm with the default configurations now.
As I understand, rayon is not an issue anymore, and Polars can be ran without SIMD, so the dependencies that are still bothering compilation are trivially replaced/ turned off.
It seems that in more recent versions since 8.1, some new dependencies makes the wasm compilation more difficult, ie: comfy-table. It would be nice if polars-core could contain only computations dependencies, getting rid of all IOs formatting and compressions from it.
I will make sure that IO and all formatting libraries are optional. Formatting could best be done in JS.
This appears to be mostly working. What tasks do you need help with?
Yes, I made a small POC.
I wanted to mimic the python api, but some things were not yet possible in wasm bindgen, such as sending a Vec<Series>, to a DataFrame. So I was thinking that we probably want some javascript wrapper DataFrame and wrapper Series (I also have that in Python) that can use things like builder patterns under the hood to mimic the Python API.
What it boils down to is that there is quite some work to do, and I think we should split it up in 2 packages.
- js-polars-core -> backend /core wasm
- js-polars -> written in js/ts that creates a nice api around js-polars-core
Very cool. What do you think about returning Arrow from the WASM context to the JS context and then exposing it to users via the Arrow JS library? The idea is to use Arrow as an IPC format between WASM and JS. You could also use Arrow as an IPC between a web worker and the main thread. We've done something similar in another WASM project with great success.
@domoritz I was thinking about the WASM solution as a replacement for Arrow JS, but your proposal actually makes sense. There is no need for duplicate ChunkedArray->Primitive->ChunkedArray implementation in polars, that can be shared with Arrow JS. I assume it's not hot code path and when it is (strings, array of numbers) it'll have to be in JS land anyways. I'm not sure about the schema, dictionary and recordbatch header handling though. In which library would you handle (parse) it?
Glad you like the proposal. We've actually had our own iterator implementation first as well and then switched to Arrow JS so we don't duplicate work. It's been a good decision and I agree that the performance should be almost unaffected (if not better since we avoid repeated calls into wasm).
I'm not sure about the schema, dictionary and recordbatch header handling though. In which library would you handle (parse) it?
Not sure I understand the question but I'll try to answer it. If you send record batches from wasm to js, arrow js would construct the schema from the IPC.
My question was that would we use the full Arrow IPC for messaging or a simpler / lower level component, specific "ChunkedArray" types (as pyarrow refers to them). I don't think eg. pyarrow uses Apache IPC for arrow<->pyarrow communication. Is pyarrow <-> arrow communication a wrong model here (it has similar calling cost, primitives, lista and complex types are different etc)?
You could probably use the arrow vectors (which we are changing to be always chunked) but I'm not sure of the benefits. The difference in Python, I think, is that communication between contexts is cheaper. In WASM, you still would need to get e.g. the schema across the boundary, and Arrow's binary format would be more efficient than say JSON. But I might be wrong. I'd say try the simplest solution first and then see whether there are bottlenecks.
This was my question. Does the JS part have to know anything about headers, footers and metadata? I don't think a python<->c++ call is cheaper than JS<->WASM. I might be wrong, but I know that WASM functions are cheaper to call in NodeJS than their C++ implementations.
Very cool. What do you think about returning Arrow from the WASM context to the JS context and then exposing it to users via the Arrow JS library? The idea is to use Arrow as an IPC format between WASM and JS. You could also use Arrow as an IPC between a web worker and the main thread. We've done something similar in another WASM project with great success.
I think that whatever is feasible we should investigate. Ideally I'd like to have a seamless interop with js-arrow and Polars Series similar to how that works in Python Polars.
My question was that would we use the full Arrow IPC for messaging or a simpler / lower level component, specific "ChunkedArray" types (as pyarrow refers to them). I don't think eg. pyarrow uses Apache IPC for arrow<->pyarrow communication. Is pyarrow <-> arrow communication a wrong model here (it has similar calling cost, primitives, lista and complex types are different etc)?
For interop with pyarrow (e.g. C++ arrow) / Rust arrow we use the arrow C data interface. This is zero-copy and we just send some pointers around. I don't think it can get much faster than that.
Very cool. What do you think about returning Arrow from the WASM context to the JS context and then exposing it to users via the Arrow JS library? The idea is to use Arrow as an IPC format between WASM and JS. You could also use Arrow as an IPC between a web worker and the main thread. We've done something similar in another WASM project with great success.
I think that whatever is feasible we should investigate. Ideally I'd like to have a seamless interop with js-arrow and Polars
Seriessimilar to how that works in Python Polars.My question was that would we use the full Arrow IPC for messaging or a simpler / lower level component, specific "ChunkedArray" types (as pyarrow refers to them). I don't think eg. pyarrow uses Apache IPC for arrow<->pyarrow communication. Is pyarrow <-> arrow communication a wrong model here (it has similar calling cost, primitives, lista and complex types are different etc)?
For interop with pyarrow (e.g. C++ arrow) / Rust arrow we use the arrow C data interface. This is zero-copy and we just send some pointers around. I don't think it can get much faster than that.
You are right that the C data interface is the best way to interop with languages that can somehow consume these C headers. But arrow-js cannot read the C data interface today. Conceptually, this would require the arrow-js devs to interpret the C headers and all the pointers manually out of your wasm heap which is rather unrealistic. A compromise would be to consume the C data interface on the wasm side and point arrow-js to the right arrays (maybe just dump the schema and all relevant offsets as json or thrift) but that's also code that does not exist today.
I'd recommend to just pack your buffers via the IPC format. That's what we do and it works quite well. It allows you to expose your buffers as real record batch streams and you can still eliminate the explicit ipc packing later without anyone noticing.
Also everything that consumes your buffers will likely be javascript which will quickly engage any handbrakes it can find. This will outweigh the additional IPC packing by a lot.
@ankoh, is there a working alpha of the Javascript wrapper?
@ankoh, is there a working alpha of the Javascript wrapper?
Not for polar-rs. We faced a very similar problem with the upcoming WebAssembly version of DuckDB. We're just packing SQL results as arrow record batch streams there, which works fairly well.
Thanks for the update, @ankoh. I hope there's some substantial progress soon. I've tried any JS-based Dataframe library I could find, and all of them felt underwhelming, if not just a hobby project.
Has anyone considered using FFI or Neon for JS bindings instead of WASM? Obviously it wouldnt work with browser side JS, but I think the larger use case would be for NodeJS anyways.
Id be happy to start working on some nodejs bindings if that is a direction the core devs would be okay with.
Y'all might be interested in https://github.com/duckdb/duckdb-wasm and the release post at https://duckdb.org/2021/10/29/duckdb-wasm.html.
@ritchie46 Was hoping to get some feedback on the POC i suggested when you have some time. I didn't know if you had strong opinions on using WASM and supporting the browser, or if supporting only nodejs was sufficient.
@ritchie46 Was hoping to get some feedback on the POC i suggested when you have some time. I didn't know if you had strong opinions on using WASM and supporting the browser, or if supporting only nodejs was sufficient.
I will. I had surgery yesterday, need some recovery time. Thank you for the contribution!
@universalmind303 I'd be glad to help you with Neon bindings. I came here specifically to see if there are efforts going in that direction and saw that you wish to start it.