Lukasz Stafiniak issues

Results 191 issues of


                                            Lukasz Stafiniak

Separate out a package `arrayjit` containing `Ndarray`, `Code`, `Exec_as_gccjit`, `Exec_as_cuda`

Also, inside `arrayjit`, split off the high-level representation into its own small module. Original plan was to split off a library `nndarray` first, containing `Ndarray` and `Node`, with global state...

Enable (and ensure proper) memory management for arrays embedded in temporary or only-non-top-level tensors

Fortunately, by their nature such arrays are not needed on the host, so we just need to make sure they are correctly initialized on the devices (i.e. the `from_host` direction)....

bug

Consider renaming `Tensor.param` to `Tensor.mutable_`

Since it's meant for both trainable parameters, and non-trainable in-place-updatable inputs.

Consider programmatically enforcing that the order of first-time executions is the same as of jitting

This assumption simplifies optimizations. We already make it in some cases of inferring memory modes.

Consider adding a "hosted on device" setup for uniform-memory devices

Still need to designate one of the devices as hosting the ndarray. A device can refuse to host. The interface is simply the device returning an ndarray -- only uniform-memory...

enhancement

Does the OCaml string passed to `RValue.string_literal` need to be null-terminated?

That's what the doc says, but it works with regular OCaml strings.

bug

Integrate with `polars-ocaml`

[polars-ocaml is a project to provide idiomatic OCaml bindings to the Polars dataframe library.](https://github.com/mt-caret/polars-ocaml)

explore

Optimize shape inference

The size of environments grows too big -- too many row variables. Maybe also consider hashconsing for computing substitutions.

enhancement

Rename `Assignments.to_low_level` to `reference_compile`, and introduce `cpu_friendly_compile` (later also `cuda_friendly`)

Implement as many optimizations as reasonable from these posts: * [Fast Multidimensional Matrix Multiplication on CPU from Scratch](https://siboehm.com/articles/22/Fast-MMM-on-CPU) * [How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a...

enhancement

Consider incorporating the (runtime) argument's label for tensors generated via `let%op f x = ... in ...`

The `%op` extension keeps track of the to-be-tensor's label via `?ident_label`. We would need to modify: ```ocaml | [%expr fun [%p? pat] -> [%e? body]] -> let vbs, body =...

enhancement