dozer
dozer copied to clipboard
Add incipient wasm_udf support
This PR adds incipient support for wasm_udf, as defined in #1653. A new feature is added, so compile with --features wasm, if you want to test it.
I used Wasmtime as the wasm engine to run the wasm code. Wasmer or WasmEdge or some other engine can be chosen as well, with not many modifications, in theory.
The most important part is the type conversion from Dozer types to wasm types, which natively are very few (only i64 and f64 could be used for now). Wasm doesn't natively support i128, and Dozer does not support i32, so there were not many type that matched.
More types can of course be added (Boolean for example is trivial to add, even if it wastes an entire i32). For composite types, like String or Text, binding generation support has to be added.
I could not find the parsing of the dozer-config.yml, so I was inspired by the python_udf module to use environment variables. You can define DOZER_WASM_UDF to point to a .wasm file with the exported functions.
Testplan:
An AssemblyScript example has been added to ./dozer-tests/wasm_udf/assemblyscript. The simples way to test the feature is to use the dozer-samples sql join test.
Change the sql to include a simple wasm_udf call:
sql: |
SELECT t.tpep_pickup_datetime AS pickup_time, z.Zone AS zone, wasm_addf(9.5, 10.7, 'float') as WASM
INTO pickup
FROM trips t JOIN taxi_zone_lookup z ON t.PULocationID = z.LocationID;
Follow the rest of the instructions in that example.
To build the AssemblyScript that contains the wasm_addf function:
cd ./dozer-tests/wasm_udf/assemblyscript
npm install && npm run asbuild
The module is now at ./dozer-tests/wasm_udf/assemblyscript/build/debug.wasm. Set DOZER_WASM_UDF before you run the example:
DOZER_WASM_UDF=./path/to/debug.wasm dozer
/claim #1653
Thanks. Can you review @chubei ?
Thanks. Can you review @chubei ?
Sure.
@tachyonicbytes what do you mean you could not find the parsing of dozer-config.yml? @chubei can guide you here. We are trying to centralize the UDFs definitions. Check out also #1632 as well for UDFs definition
@snork-alt I checked #1632 while developing. Is the definition you refer to the
udfs:
wasm:
my_function: module.wasm
part of the config?
In that case, that is precisely what I didn't get. It's not the parsing itself that gets me @chubei, but I don't know what I should import to get the module path "path/to/debug.wasm" in wasm_udf.rs. The python module didn't have it, it used env vars, and the onnx module is not ready yet, so I couldn't see how they do it.
@chubei, can you help @tachyonicbytes? Thanks
@tachyonicbytes are you planning to implement those changes and close the PR ?
Yes, I am in the process of writing a bigger message
Thank you so much for the review, @chubei and @snork-alt.
Ok, so I understand that, in order to pass the config variables to the wasm_udf module, we need to propagate those values through the necessary functions. I thought that the project had something akin to a global config module that I import and use. That's what I was looking for, and could not find.
Yeah, the python_udf module has been a semi-reference for me, because when I looked at it in more depth, it seemed incomplete. But it was still a nice starting point.
Now, for the data type conversion:
So, for WebAssembly, this is what we work with: i32, i64, f32, f64, v128, and linear memory for arbitrary amounts of memory. In Dozer, we have (please correct me if I am wrong with any of it):
- [ ]
UInt - [ ]
U128 - [x]
Int - [ ]
I128 - [x]
Float - [ ]
Boolean - [ ]
String - [ ]
Text - [ ]
Binary - [ ]
Decimal - [ ]
Timestamp - [ ]
Date - [ ]
Json
Int and Float are done. Boolean is trivial, albeit it wastes 31 bits for every type. UInt should be doable, even though we may have to impose certain restrictions on its usage. String, Text, Binary all involve the linear memory in some capacity, so you kinda have to solve them all at once, so it's only one data type's worth of work, plus idiosyncrasies.
Now, there is a catch. In theory, U128 and I128 are easily solvable with v128, which, even though is a simd type, can support the usual scalar operations that we do on I128 and U128 (with the exception of division, which can be done slower or it can be restricted). But the wasmtime crate does not implement v128. It is in AssemblyScript, but we can't use it because of that. The wasmer crate, on the other hand, supports it.
If this is solved, UInt can be half of a v128, even though that is, again, not fully efficient and I128 and U128 have the caveat above, no or slow division.
I don't know how Decimal works inside Dozer. Is it an Int type that you don't fully use? That's how it was implemented in intro CS classes, if I recall correctly. If so, that is an easy conversion, but it's not natively supported in any langauge, so users will have to implement their own semantics. If it is implemented as a String, then it falls in the linear memory category.
Timestamp and Date should be strings, I guess, so linear memory for them as well.
The last one, Json. This is the one that I would personally suggest skipping at least for now. The problem with WebAssembly and json is not that people don't need it, but it is kinda hard to do without proper reflection. AssemblyScript has no native json support. It has a json package that is great in its own way, but it doesn't support exception handling, so it's kinda hard to validate your json on the udf end. TinyGo does not support the standard library json, although it has some non-reflective packages.
Of course, there is also the possibility of splitting the data type conversion by parameter or return value. We can decide to implement Json for parameters, but not for return values, for example, so you always have valid jsons in your udf code.
So, please let me know which of the types above sound doable for this PR. Please let me know of the wasmtime <-> wasmer decision. And please further explain what is needed of Timestamp and Date.
I'll also repair the python_udf module and add a TinyGo wasm example, if you don't mind.
@tachyonicbytes we just merged a PR that allows return types of SQL UDFs to be specified as function<Type>. This can be useful for WASM functions.
Perfect. Less matching on strings for wasm_udfs as well
@tachyonicbytes are you still working on this ?
Yes. So the config part seems to be more complicated than I anticipated. The problem is that a new section has to be created for wasm udfs. I added it to api_config, because it seems to have more in common with the other api_config options than the others.
Another problem is that, in the ExecutorOptions at dozer-core/src/executor.rs the wasm udfs need to be added. I opted for this: pub udfs: HashMap<String, HashMap<String, String>>. But the HashMap is not serializable with prost, so another solution has to be found, as all the other types are simple types.
Changing that type also means going back to the UdfOptions struct I put in dozer-types/src/models/api_config.rs and change the type there as well.
The problem is that the free format
wasm_udfs:
- module1: function1
- module1: function2
- module2: function3
is hard to serialize to simple types, if you don't store it in a string, for example, in an ad-hoc format, like "module1|function1|module1|function2" etc. But maybe that's an option as well.
Is there a discord server or some synchronous communication mechanism for outsiders that contribute to dozer? I think some of the problems that I face can be easily solved that way.
Hi @tachyonicbytes , let's talk on Discord. https://discord.gg/3eWXBgJaEQ
And we've added udfs config section in https://github.com/getdozer/dozer/pull/1831. You can follow format in that PR.
@tachyonicbytes any plan for.this ?
Hi @tachyonicbytes is this ready for review ? If yes please ask @chubei . Thanks
@snork-alt I talked with @chubei, and the agreed strategy is to wait for this https://github.com/getdozer/dozer/pull/1838 to be merged and then rebase wasm udfs upon it. Otherwise it would contain a lot of duplicated work that would need to be resolved in conflicts.
The PR is functional now. I usually check it with the dozer-samples/sql/join sample:
version: 1
app_name: sql-join-sample
connections:
- config : !LocalStorage
details:
path: data
tables:
- !Table
name: taxi_zone_lookup
config: !CSV
path: zones
extension: .csv
- !Table
name: trips
config: !Parquet
path: trips
extension: .parquet
name: ny_taxi
sql: |
SELECT t.tpep_pickup_datetime AS pickup_time, z.Zone AS zone, fib<Int>(10) as WASM
INTO pickup
FROM trips t JOIN taxi_zone_lookup z ON t.PULocationID = z.LocationID;
sources:
- name: taxi_zone_lookup
table_name: taxi_zone_lookup
connection: ny_taxi
- name: trips
table_name: trips
connection: ny_taxi
endpoints:
- name: pickup
path: /pickup
table_name: pickup
udfs:
- config: !Wasm
path: /path/to/dozer/dozer-tests/wasm_udf/assemblyscript/build/debug.wasm
name: fib
@tachyonicbytes is it possible to infer return type and parameter types from the wasm module itself, rather than having to declare it during the function call ?
I am 90% sure that you can infer it, but I will have to thoroughly check the docs for that. The other thing is that return_type is necessary there so that Dozer requests from the wasm module the actual type it wants, so that there is less confusion during the conversion.
@chubei can you review ?
@chubei can you review ?
I'm leaving it for later of this week.
@tachyonicbytes can we fix the last things so we can merge ? Thanks
@tachyonicbytes shall we close this or you plan to wrap it up so it can be merged and your bounty paid?
Yeah, sorry for the delay, I'm wrapping it up.
Any progress @tachyonicbytes ?
Yes, I've updated to the new json schemas, as well as working on the new errors, like onnx does. With testing and all it should be ready in maybe a couple of days, maybe Sunday.