explorer Encoding strings using enif_make_sub

Encoding strings using enif_make_sub_binary

Open joshuataylor opened this issue 2 years ago • 5 comments

When returning a large amount of strings, this can be very slow. See the discussion here https://github.com/rusterlium/rustler/discussions/463.

There was a great idea there to use the offsets and values to create subbinaries. Doing this has dramatically decreased the time it takes to return binaries back to Elixir for Utf8.

Here is what I do:

                    // This is goofy and stupid, but drastically speeds up converting binaries.
                    // What we do here is take all the values/offsets, create a binary, then subslice into it.
                    // @todo fix this later to be less terrible.
                    let utf8 = series.utf8().unwrap();
                    let len = utf8.len();
                    let mut values = Vec::with_capacity(len);
                    let mut offsets: Vec<usize> = Vec::with_capacity(len);
                    let mut last_offset: usize = 0;

                    for array in utf8.downcast_iter() {
                        let mut offset_loop: usize = 0;
                        for offset in array.offsets().iter().skip(1) {
                            offset_loop = *offset as usize;
                            offsets.push(last_offset + offset_loop);
                        }

                        last_offset += offset_loop;
                        values.extend(array.values().as_slice());
                    }

                    // Can we make a binary straight from the slice?
                    // This **might** be faster, but I think that allocating it this way is okay
                    // as we know the length already?
                    let mut values_binary = NewBinary::new(env, values.len());
                    values_binary.copy_from_slice(values.as_slice());

                    let binary: Binary = values_binary.into();

                    // Now we slice into the binary.
                    let mut last_offset: usize = 0;
                    let mut binaries: Vec<NIF_TERM> = Vec::with_capacity(offsets.len());
                    for offset in offsets {
                        if last_offset == offset {
                            last_offset = offset;
                            binaries.push(nil);
                        } else {
                            let slice =
                                make_subbinary(env, &binary, last_offset, offset - last_offset)
                                    .unwrap()
                                    .as_c_arg();
                            binaries.push(slice);
                            last_offset = offset;
                        }
                    }

                    binaries

pub fn make_subbinary<'a>(
    env: Env<'a>,
    binary: &Binary,
    offset: usize,
    length: usize,
) -> NifResult<Term<'a>> {
    let min_len = length.checked_add(offset);
    if min_len.ok_or(Error::BadArg)? > binary.len() {
        return Err(Error::BadArg);
    }
    let term = binary.to_term(env);

    let raw_term = unsafe {
        rustler_sys::enif_make_sub_binary(
            term.get_env().as_c_arg(),
            term.as_c_arg(),
            offset,
            length,
        )
    };

    Ok(unsafe { Term::new(env, raw_term) })
}

I'd be happy to create a PR for this.

Jul 03 '22 04:07 joshuataylor

I don't know how Rustler works, but if we were writing C code, we should be using enif_make_resource_binary. The idea is that we would create a resource in Rust (I assume Arrow already keeps the strings as binary blobs, so that would be our resource) and then directly reference parts of said blobs.

The potential issue with this approach is that, if a single Erlang string sits across two Rust blobs, then it won't work. In those very specific cases though, we can allocate a new binary. This approach would be the one with the last amount of memory copying. Your proposed approach still has to copy the whole binary once, right?

Jul 03 '22 06:07 josevalim

Just to clarify this is all using Rust. You basically create a binary, then use make_subbinary

Jul 03 '22 07:07 joshuataylor

Yes, I understand. My point is that, instead of creating a large binary and slicing it, you should slice the original Arrow binary/resource, without creating a new large binary. I just don't how to do it in Rust. :) Does that make sense?

Jul 03 '22 07:07 josevalim

Yep, it does. I don't know how to do that either, I'll see if it's "free" to get out of arrow2 and do this conversion. I think it is.

Jul 03 '22 08:07 joshuataylor

https://github.com/rusterlium/rustler/issues/452 I raised this issue to expose it from the C code wrapper but haven't had a response and don't feel 100% confident on the approach. WDYT?

Jul 03 '22 20:07 cigrainger

So using enif_make_resource_binary is not practical because we don't have a single large binary.

We could point to each individual buffer as a resource, but that is blocked by #452, so for now I am using binary references. It makes strings 4-5x faster.

Thanks @joshuataylor!

Sep 08 '22 15:09 josevalim

So the way Rustler works with Elixir is great if you're returning something small (a struct, a string, an integer, whatever). Want to return back "hello" or 42 or something? That's fine, you'll have a great time with Rustler, it's a fantastic library.

The problem comes when returning large Vecs, which is all a Polars series really is.

If you have a series that can be copied instead of cloned, such as integers [1,2,3,4] this is fine.

However if you have a series containing strings, this becomes slow, as it has to encode each string. The subslicing trick I shared works well, another alternative I tried was to return the entire buffer to Elixir then doing it via Elixir using pattern matching and slicing (though I bet someone can make this wildly faster using erlang pattern matching, from what I've read from articles), but doing it this way was a fair bit slower than just doing it in Rust.

The other really painful part is encoding structs, which is really only used for Dates and Datetimes in Explorer, so we can do some tricks such as only encoding the atom once, then the rest are fairly easy.

I believe there are some optimisiations that can be done to improve these times without blowing the memory out of the water, especially around allocations and the way Rustler works. Explorer does some tricky things to return data (the encoding work is.. not great, so I'm really glad Jose is looking into this :)) and I believe some of this could be moved upstream, depending how Erlang expects data I reckon we could use something like once-cell to do this in a static manner which would drastically improve memory usage and speed up series return back to Elixir.

Sidenote -- I've been far too obsessed with Rust the last few months, and have been digging fairly heavily into the way Arrow2 and Polars does things, as I'm working on a side project that is a rewrite of dbt in Rust. My cold project static compile times are insane for this, it's <10ms, where most of it is parsing YAML (we have ~600 models, 590 of them are static, where they don't need do anything funky).

Sep 09 '22 05:09 joshuataylor

@joshuataylor I have just learned that Polars uses its own reference counting system. This means that we could directly point to the underlying Polars/Arrow binaries by using make_resource_binary, then bumping the ref count on ARC, and decreasing the ARC refcount when the binary is GCed.

Something else to explore to reduce allocation is to allow Iterators to be converted to lists. I believe today we allocate a vector and then we allocate a list, but I keep wondering if we can iterate in reverse and allocate the linked list in one pass.

Unfortunately both of these are above my paygrade. :)

Sep 09 '22 12:09 josevalim

explorer explorer copied to clipboard

Encoding strings using enif_make_sub_binary

explorer
explorer copied to clipboard