dozer icon indicating copy to clipboard operation
dozer copied to clipboard

Add incipient wasm_udf support

Open tachyonicbytes opened this issue 2 years ago • 27 comments

This PR adds incipient support for wasm_udf, as defined in #1653. A new feature is added, so compile with --features wasm, if you want to test it.

I used Wasmtime as the wasm engine to run the wasm code. Wasmer or WasmEdge or some other engine can be chosen as well, with not many modifications, in theory.

The most important part is the type conversion from Dozer types to wasm types, which natively are very few (only i64 and f64 could be used for now). Wasm doesn't natively support i128, and Dozer does not support i32, so there were not many type that matched.

More types can of course be added (Boolean for example is trivial to add, even if it wastes an entire i32). For composite types, like String or Text, binding generation support has to be added.

I could not find the parsing of the dozer-config.yml, so I was inspired by the python_udf module to use environment variables. You can define DOZER_WASM_UDF to point to a .wasm file with the exported functions.

Testplan:

An AssemblyScript example has been added to ./dozer-tests/wasm_udf/assemblyscript. The simples way to test the feature is to use the dozer-samples sql join test.

Change the sql to include a simple wasm_udf call:

sql: |
  SELECT t.tpep_pickup_datetime AS pickup_time, z.Zone AS zone, wasm_addf(9.5, 10.7, 'float') as WASM
  INTO pickup
  FROM trips t JOIN taxi_zone_lookup z ON t.PULocationID = z.LocationID;

Follow the rest of the instructions in that example.

To build the AssemblyScript that contains the wasm_addf function:

cd ./dozer-tests/wasm_udf/assemblyscript
npm install && npm run asbuild

The module is now at ./dozer-tests/wasm_udf/assemblyscript/build/debug.wasm. Set DOZER_WASM_UDF before you run the example:

DOZER_WASM_UDF=./path/to/debug.wasm dozer

/claim #1653

tachyonicbytes avatar Jun 19 '23 23:06 tachyonicbytes

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Jun 19 '23 23:06 CLAassistant

Thanks. Can you review @chubei ?

snork-alt avatar Jun 20 '23 05:06 snork-alt

Thanks. Can you review @chubei ?

Sure.

chubei avatar Jun 20 '23 05:06 chubei

@tachyonicbytes what do you mean you could not find the parsing of dozer-config.yml? @chubei can guide you here. We are trying to centralize the UDFs definitions. Check out also #1632 as well for UDFs definition

snork-alt avatar Jun 20 '23 07:06 snork-alt

@snork-alt I checked #1632 while developing. Is the definition you refer to the

udfs:
  wasm:
    my_function: module.wasm

part of the config?

In that case, that is precisely what I didn't get. It's not the parsing itself that gets me @chubei, but I don't know what I should import to get the module path "path/to/debug.wasm" in wasm_udf.rs. The python module didn't have it, it used env vars, and the onnx module is not ready yet, so I couldn't see how they do it.

tachyonicbytes avatar Jun 20 '23 08:06 tachyonicbytes

@chubei, can you help @tachyonicbytes? Thanks

snork-alt avatar Jun 20 '23 13:06 snork-alt

@tachyonicbytes are you planning to implement those changes and close the PR ?

snork-alt avatar Jun 22 '23 15:06 snork-alt

Yes, I am in the process of writing a bigger message

tachyonicbytes avatar Jun 22 '23 15:06 tachyonicbytes

Thank you so much for the review, @chubei and @snork-alt.

Ok, so I understand that, in order to pass the config variables to the wasm_udf module, we need to propagate those values through the necessary functions. I thought that the project had something akin to a global config module that I import and use. That's what I was looking for, and could not find.

Yeah, the python_udf module has been a semi-reference for me, because when I looked at it in more depth, it seemed incomplete. But it was still a nice starting point.

Now, for the data type conversion:

So, for WebAssembly, this is what we work with: i32, i64, f32, f64, v128, and linear memory for arbitrary amounts of memory. In Dozer, we have (please correct me if I am wrong with any of it):

  • [ ] UInt
  • [ ] U128
  • [x] Int
  • [ ] I128
  • [x] Float
  • [ ] Boolean
  • [ ] String
  • [ ] Text
  • [ ] Binary
  • [ ] Decimal
  • [ ] Timestamp
  • [ ] Date
  • [ ] Json

Int and Float are done. Boolean is trivial, albeit it wastes 31 bits for every type. UInt should be doable, even though we may have to impose certain restrictions on its usage. String, Text, Binary all involve the linear memory in some capacity, so you kinda have to solve them all at once, so it's only one data type's worth of work, plus idiosyncrasies.

Now, there is a catch. In theory, U128 and I128 are easily solvable with v128, which, even though is a simd type, can support the usual scalar operations that we do on I128 and U128 (with the exception of division, which can be done slower or it can be restricted). But the wasmtime crate does not implement v128. It is in AssemblyScript, but we can't use it because of that. The wasmer crate, on the other hand, supports it.

If this is solved, UInt can be half of a v128, even though that is, again, not fully efficient and I128 and U128 have the caveat above, no or slow division.

I don't know how Decimal works inside Dozer. Is it an Int type that you don't fully use? That's how it was implemented in intro CS classes, if I recall correctly. If so, that is an easy conversion, but it's not natively supported in any langauge, so users will have to implement their own semantics. If it is implemented as a String, then it falls in the linear memory category.

Timestamp and Date should be strings, I guess, so linear memory for them as well.

The last one, Json. This is the one that I would personally suggest skipping at least for now. The problem with WebAssembly and json is not that people don't need it, but it is kinda hard to do without proper reflection. AssemblyScript has no native json support. It has a json package that is great in its own way, but it doesn't support exception handling, so it's kinda hard to validate your json on the udf end. TinyGo does not support the standard library json, although it has some non-reflective packages.

Of course, there is also the possibility of splitting the data type conversion by parameter or return value. We can decide to implement Json for parameters, but not for return values, for example, so you always have valid jsons in your udf code.

So, please let me know which of the types above sound doable for this PR. Please let me know of the wasmtime <-> wasmer decision. And please further explain what is needed of Timestamp and Date.

I'll also repair the python_udf module and add a TinyGo wasm example, if you don't mind.

tachyonicbytes avatar Jun 22 '23 16:06 tachyonicbytes

@tachyonicbytes we just merged a PR that allows return types of SQL UDFs to be specified as function<Type>. This can be useful for WASM functions.

snork-alt avatar Jun 27 '23 10:06 snork-alt

Perfect. Less matching on strings for wasm_udfs as well

tachyonicbytes avatar Jun 27 '23 11:06 tachyonicbytes

@tachyonicbytes are you still working on this ?

snork-alt avatar Aug 14 '23 16:08 snork-alt

Yes. So the config part seems to be more complicated than I anticipated. The problem is that a new section has to be created for wasm udfs. I added it to api_config, because it seems to have more in common with the other api_config options than the others.

Another problem is that, in the ExecutorOptions at dozer-core/src/executor.rs the wasm udfs need to be added. I opted for this: pub udfs: HashMap<String, HashMap<String, String>>. But the HashMap is not serializable with prost, so another solution has to be found, as all the other types are simple types.

Changing that type also means going back to the UdfOptions struct I put in dozer-types/src/models/api_config.rs and change the type there as well.

The problem is that the free format

wasm_udfs:
    - module1: function1
    - module1: function2
    - module2: function3

is hard to serialize to simple types, if you don't store it in a string, for example, in an ad-hoc format, like "module1|function1|module1|function2" etc. But maybe that's an option as well.

Is there a discord server or some synchronous communication mechanism for outsiders that contribute to dozer? I think some of the problems that I face can be easily solved that way.

tachyonicbytes avatar Aug 15 '23 00:08 tachyonicbytes

Hi @tachyonicbytes , let's talk on Discord. https://discord.gg/3eWXBgJaEQ

And we've added udfs config section in https://github.com/getdozer/dozer/pull/1831. You can follow format in that PR.

chubei avatar Aug 15 '23 02:08 chubei

@tachyonicbytes any plan for.this ?

snork-alt avatar Aug 25 '23 04:08 snork-alt

Hi @tachyonicbytes is this ready for review ? If yes please ask @chubei . Thanks

snork-alt avatar Aug 30 '23 16:08 snork-alt

@snork-alt I talked with @chubei, and the agreed strategy is to wait for this https://github.com/getdozer/dozer/pull/1838 to be merged and then rebase wasm udfs upon it. Otherwise it would contain a lot of duplicated work that would need to be resolved in conflicts.

tachyonicbytes avatar Sep 01 '23 09:09 tachyonicbytes

The PR is functional now. I usually check it with the dozer-samples/sql/join sample:

version: 1
app_name: sql-join-sample
connections:
  - config : !LocalStorage
      details:
        path: data
      tables:
        - !Table
          name: taxi_zone_lookup
          config: !CSV
            path: zones
            extension: .csv
        - !Table
          name: trips
          config: !Parquet
            path: trips
            extension: .parquet
    name: ny_taxi

sql: |
  SELECT t.tpep_pickup_datetime AS pickup_time, z.Zone AS zone, fib<Int>(10) as WASM
  INTO pickup
  FROM trips t JOIN taxi_zone_lookup z ON t.PULocationID = z.LocationID;

sources:
  - name: taxi_zone_lookup
    table_name: taxi_zone_lookup
    connection: ny_taxi
  - name: trips
    table_name: trips
    connection: ny_taxi

endpoints:
  - name: pickup
    path: /pickup
    table_name: pickup

udfs:
  - config: !Wasm
      path: /path/to/dozer/dozer-tests/wasm_udf/assemblyscript/build/debug.wasm
    name: fib

tachyonicbytes avatar Sep 02 '23 17:09 tachyonicbytes

@tachyonicbytes is it possible to infer return type and parameter types from the wasm module itself, rather than having to declare it during the function call ?

snork-alt avatar Sep 03 '23 18:09 snork-alt

I am 90% sure that you can infer it, but I will have to thoroughly check the docs for that. The other thing is that return_type is necessary there so that Dozer requests from the wasm module the actual type it wants, so that there is less confusion during the conversion.

tachyonicbytes avatar Sep 03 '23 18:09 tachyonicbytes

@chubei can you review ?

snork-alt avatar Sep 05 '23 13:09 snork-alt

@chubei can you review ?

I'm leaving it for later of this week.

chubei avatar Sep 05 '23 13:09 chubei

@tachyonicbytes can we fix the last things so we can merge ? Thanks

snork-alt avatar Sep 16 '23 05:09 snork-alt

@tachyonicbytes shall we close this or you plan to wrap it up so it can be merged and your bounty paid?

snork-alt avatar Sep 25 '23 03:09 snork-alt

Yeah, sorry for the delay, I'm wrapping it up.

tachyonicbytes avatar Sep 25 '23 07:09 tachyonicbytes

Any progress @tachyonicbytes ?

snork-alt avatar Sep 28 '23 19:09 snork-alt

Yes, I've updated to the new json schemas, as well as working on the new errors, like onnx does. With testing and all it should be ready in maybe a couple of days, maybe Sunday.

tachyonicbytes avatar Sep 28 '23 21:09 tachyonicbytes