aerospike-client-rust icon indicating copy to clipboard operation
aerospike-client-rust copied to clipboard

Adding derive macros to parse structs to bins and values

Open jonas32 opened this issue 3 years ago • 34 comments

This PR adds the ToBins and ToValue derive macros. Manually parsing large structs into Aerospike Bins or Values is a lot of boilerplate work currently.

The macros can simply be attached to any struct. They will add a to_bins or to_value function. The to_bins function returns a Vec<Bin> that can be passed to write functions like the client::put function. The to_vec function builds a HashMap Value for the struct.

ToBins is based on ToValue so sub-structs are also parsed. The ToValue trait has implementations for the common data types including variations with HashMaps or Vecs (if i didn't forget any, otherwise please tell me).

I also added a client::put_struct method to make it more convenient.

Caution, this branch is based on #108 (Async Client)

jonas32 avatar Sep 11 '22 01:09 jonas32

Wow, this is such a good PR that makes me consider merging the whole thing before I finish my changeset just for the heck of it. What came of your experiments with the your new async changesets?

khaf avatar Sep 11 '22 01:09 khaf

We migrated everything to the async branch and haven't experienced any problems yet. That's also the reason why my latest PRs are based on the async branch. Having it based on the current release would make it unavailable for our stuff.

I never made a general independent benchmark since I was missing the time, but the performance I see from our services looks fine.

My (and i guess that counts for many other users) usecase includes using this behind a bigger framework like actix. In that case, actix manages event loops and pools, so the client will run on multiple loops. It still might make sense to add a feature flag where the client does that itself, but that's probably a lot of work again. Even on a single loop, the client still is at least near to the old sync base in terms of performance, depending on the system probably faster. But there is potential for more.

There is also more potential to this PR. Currently all functions of the Client expect Value type parameters. Value itself also implements ToValue, so it would be possible to to change a lot of the functions to accept anything that implements ToValue without breaking the code base even more.

About the failing CI Tests: Looks like the expressions module is broken... Has there been any Server change that could cause that? I don't see that error with Aerospike v5.

jonas32 avatar Sep 11 '22 12:09 jonas32

any chance u have a Value -> T deserializer as well?

databasedav avatar Dec 22 '22 06:12 databasedav

can we merge this into the async branch?

databasedav avatar May 14 '23 07:05 databasedav

@databasedav I'm not so sure that would be a good idea. @jonas32 have taken a practical route in this PR, converting structs to intermediate Bins and Values, which makes it simple. But it would allocate a lot of single-use objects on the heap. He also had to introduce a new API method for put_struct which I'm not a fan of. I'd rather have only one method for everything, especially now that we are releasing a major new version. I know it's annoying to wait for the better solution while the practical one already exists, but please give me some time to see if I can engineer a more efficient solution first.

khaf avatar May 14 '23 12:05 khaf

I'm going to improve this one to make it less memory heavy. Did you already start to implement a method for generic input on the value part? Otherwise I would add it here to reduce it to one function again.

jonas32 avatar May 14 '23 12:05 jonas32

Thanks @jonas32 . I have not done anything in this regard, but I don't think this approach would yield the desired outcome, although any improvements are welcome. I think the best way to go about it is to have traits like eg. BinWireWriter and BinWireReader so that they would directly read and write from the buffer without needing the intermediate representation. We can even have metadata fields that way to allow generation, expiration and key to be read and written. Of course we'll provide derive macros for that to automate the process. That method would be stack only and would not need to allocate memory at all, outside of what's necessary for the data (strings, vecs, maps, etc.) We can implement those traits for the BinMap so that it is also a BinWireWriter and then change the put so that it takes that trait as input.

khaf avatar May 14 '23 17:05 khaf

I'm going to improve this one to make it less memory heavy. Did you already start to implement a method for generic input on the value part? Otherwise I would add it here to reduce it to one function again.

Something that just occurred to me is that using const generics to return an array instead of a vec. That would be allocated on the stack and theoretically much more efficient. The issue with that is longer compile times and bloated code.

khaf avatar May 14 '23 17:05 khaf

The BinWireWriter idea sounds good to me. So the API functions would accept BinWireWriter and BinWireReader Traits as parameters instead of the usual BinMap and Values, but instead we just implement the Traits for the Value and BinMap structs/enums?

Sounds possible, but I'm not sure if that's in scope of this PR. Maybe worth a second PR to first introduce the Traits and afterwards patching this one. If you want, i can start to work on the Traits first.

jonas32 avatar May 14 '23 17:05 jonas32

The BinWireWriter idea sounds good to me. So the API functions would accept BinWireWriter and BinWireReader Traits as parameters instead of the usual BinMap and Values, but instead we just implement the Traits for the Value and BinMap structs/enums?

Exactly, although I guess we probably have to come up with better names.

Sounds possible, but I'm not sure if that's in scope of this PR. Maybe worth a second PR to first introduce the Traits and afterwards patching this one. If you want, i can start to work on the Traits first.

I agree. I was just bouncing ideas off you and the rest of the community, not particularly asking for changes in this PR. With this new scheme, we could even alias the struct fields to custom bin names and a lot more capabilities.

khaf avatar May 14 '23 22:05 khaf

Yes, i think we could use the serde crate and its derive features as a reference. I agree the names are bit too technical. Most Users probably cant do much with the "Wire" if they never worked on the client or server. I would suggest:

  • general derive macro to mark structs as Read/Write for the client
  • general macros for record meta (or maybe with dedicated function params?)
  • Additional macro on field level to rename fields
  • Additional macro on field level to imply Value type. (HashMap as geojson for example)

any other suggestions?

jonas32 avatar May 14 '23 22:05 jonas32

How should we handle structs that cant be parsed by the derive macro or need custom logic? I think the trait function will need the buffer as parameter so in theory users could just encode it manually. But I'm not sure if that's not too complicated

jonas32 avatar May 15 '23 22:05 jonas32

Yes it needs the buffer, and that's the best part. For writes, we could provide a write_bin(&Bin) method for the buffer which would take care of the rest. For reads, we need a read_bins(fields) method that iterates over the number of fields and then calls the read_bin -> Bin method that returns the next bin so that the struct can set it. The only issue being that I converted the Bin.name type to String, which probably allocates on the heap for the String.

I looked around yesterday and find this library very useful in understanding how this could be implemented:

https://github.com/not-fl3/nanoserde

khaf avatar May 15 '23 22:05 khaf

I'm thinking about adding a second abstraction layer over the buffer. That could provide "easy" functions for wire writing instead. If we just add the write_bin function to the buffer, the user still gets confronted with all the parsing and writing logic in the buffer implementation. I see a big potential for mistakes by using the "lower level" functions instead of the right ones and encoding invalid data there. But that would introduce overhead again.

jonas32 avatar May 15 '23 22:05 jonas32

I don't see how the user would have access to any of the lower level API on the buffer. Aren't they all private? The only public API would be those required to read and write Bins. What am I missing?

khaf avatar May 15 '23 22:05 khaf

no, a lot of them are public. The encoder for operations works that way. https://github.com/aerospike/aerospike-client-rust/blob/master/src/commands/buffer.rs#L1366 https://github.com/aerospike/aerospike-client-rust/blob/master/src/operations/mod.rs#LL124C30-L124C30

If we give the user the ability to directly interact with the buffer instance, they can also access them.

jonas32 avatar May 15 '23 23:05 jonas32

Wow, what a noob mistake on my part. We need to mark them pub(crate) and call it a day I guess. I'll go through the API visibility in the library at some point and make sure these mistakes are addressed before release.

Now that I think of it, we probably do not need to pass Bins to these newly discussed APIs. Simple tuples would do.

khaf avatar May 16 '23 10:05 khaf

I tested a bit how this could be done. Maybe you want to give some thoughts on it before i continue to implement it everywhere. In this version, i completely work around the original encoders and do the encoding completely in the trait implementations. The focus on this was primarily to deprecate Bins and Value and only work in this style. But to keep the API more or less stable, its easily possible to also implement the Traits on Bins and Values like i did in traits.rs.

Currently this is only implemented for WriteCommand. There is a unit test that passes for it, but most other tests are probably broken now.

Here you see the Traits for Bins and Values as Writers. At first i wanted to make it one Trait, but this gives a lot of potential for passing wrong arguments to some functions. https://github.com/jonas32/aerospike-client-rust/blob/async-derive/aerospike-core/src/traits.rs The biggest disadvantage is having two writer functions on values in my opinion. But i didn't find a way to solve this without having a Value step in between yet, since Values are packed differently when they are in a CDT context...

Client is modified to directly accept the Trait instances: https://github.com/jonas32/aerospike-client-rust/blob/async-derive/aerospike-core/src/client.rs#L268

This is the derive part that builds the functions. Nearly all allocations are compile time now. https://github.com/jonas32/aerospike-client-rust/blob/async-derive/aerospike-macro/src/traits/writable.rs

In the end, it can be used like this: https://github.com/jonas32/aerospike-client-rust/blob/async-derive/tests/src/derive.rs

The code is not beautiful yet and not really optimized. Its just a "working poc" version.

jonas32 avatar May 31 '23 01:05 jonas32

I tested a bit how this could be done. Maybe you want to give some thoughts on it before i continue to implement it everywhere. In this version, i completely work around the original encoders and do the encoding completely in the trait implementations. The focus on this was primarily to deprecate Bins and Value and only work in this style. But to keep the API more or less stable, its easily possible to also implement the Traits on Bins and Values like i did in traits.rs.

Thank you @jonas32 , this is more or less what I had in mind.

The biggest disadvantage is having two writer functions on values in my opinion. But i didn't find a way to solve this without having a Value step in between yet, since Values are packed differently when they are in a CDT context...

I don't think there's a clean way around this, but you could have one writer with a parameter that is passed to it selecting one or the other if you wanted. I prefer the current distinction myself.

Client is modified to directly accept the Trait instances: https://github.com/jonas32/aerospike-client-rust/blob/async-derive/aerospike-core/src/client.rs#L268

Great. I don't like the idea of having a different set of APIs as much as we can.

This is the derive part that builds the functions. Nearly all allocations are compile time now. https://github.com/jonas32/aerospike-client-rust/blob/async-derive/aerospike-macro/src/traits/writable.rs

In the end, it can be used like this: https://github.com/jonas32/aerospike-client-rust/blob/async-derive/tests/src/derive.rs

The code is not beautiful yet and not really optimized. Its just a "working poc" version.

I believe this is a step in the right direction and we are closer to the final design now. We have to come up with better names for the macros and traits though.

There are a few more ideas I have that we may want to pursue. Off the top of my head:

  • Having modifier macros on the fields themselves. Things for field aliases and type conversions (Aerospike is schemaless, so if someone changed a type from int to float, or date strings to unix timestamps, etc, this could help them a ton with backwards compatibility, assuming that we have a general custom trait for the conversion as well).
  • We may want to allow easy integration with data validation libraries (or do we?)
  • Metadata field markers (TTL/Expiration, Generation)
  • Out of the box std::Option<T> support?
  • Flattening the structs: So if we have an embedded struct, it would flatten the fields into the upper container (recursively) in write, and lift them up during read. (Apparently helps with refactoring and is pretty popular in the Go world)

ps. Jst a little nit: I changed the [Bin, X] definitions into a single const generic in the async branch.

khaf avatar May 31 '23 12:05 khaf

Having modifier macros on the fields themselves

yes, definitely. This will be needed. Edit: Check out the Links again. Working now.

We may want to allow easy integration with data validation libraries

I'm not sure if that's in scope for the normal lib. I feel like this is rather the job of a webserver etc. instead of the DB layer. But we could think about adding it with feature flag maybe?

Out of the box std::Option<T> support?

Yes, will come too. Worked on it a little already. I think ill introduce another function to check if a bin should be parsed at all. That would make Result or Option easy to build.

Flattening the structs

Maybe as a additional field modifier? I'm not a big fan of that, since i think parsing structs from and to maps is a huge use case. For me probably the primary one since building nested structures manually on Values is not very friendly right now.

jonas32 avatar May 31 '23 16:05 jonas32

@jonas32 can WritableBins/rename include a compile time check that the bin name is less than the 15 byte maximum? i haven't had the chance to try it out yet and wasn't able to find a check like that here (although i just checked around the panic!'s)

databasedav avatar Jun 01 '23 00:06 databasedav

great idea, didn't think of that yet. Will come!

jonas32 avatar Jun 01 '23 01:06 jonas32

I just found another problem while working on the read macro. Stream read is processed differently from normal read. Normal read initially reads the full buffer and then works on it. Stream read does not do that. It reads the buffer piece by piece as the info is needed, while deleting all the previous data. If i want to do it without copying parts of the buffer and allocating twice, i need to keep the full buffer until the record is parsed. After that, it can be deleted. @khaf do you see any potential problems with that change? Edit: Instead the normal read one could also be changed to reading piece by piece.

As far as i can see, the required full size information should be included in this part of the buffer https://github.com/aerospike/aerospike-client-rust/blob/async/aerospike-core/src/commands/stream_command.rs#L67

jonas32 avatar Jun 04 '23 16:06 jonas32

Just to clarify: Do you mean that you keep the buffer engaged until the record is dropped to avoid memory allocation at all cost?

khaf avatar Jun 05 '23 09:06 khaf

No, the record is not getting dropped. I just need it until the Record is fully parsed. But i already came to the conclusion that its better the other way around and changing the normal read command to smaller Buffers. Now the smaller Buffers with just the Value Data inside get cloned, since they are allocated anyway and the original Buffer will discard them right after the clone. Should result with nearly the same memory usage and makes it a lot easier.

jonas32 avatar Jun 05 '23 11:06 jonas32

@jonas32 I see, you want to move to a streaming paradigm to reduce the memory usage. That's fantastic, but with caveats:

  • It does not really work for strings and blobs unless you have a way to allocate their buffers in advance. We need some kind of string builder that allows pre-allocation.
  • It increases the number of times you'll have to read from the socket, which would increase the number of syscalls. That significantly increases the latency as I learned in the Go client. I had to implement a kind of a buffered chunk reader to reduce the number of syscalls. The way I did it was to have a fixed buffer, and then when running out of data, that reader would read a big chunk ( max(required, min(buf size, message size)) ). If Rust supports increasing the buffer size of the socket, you could go that way and just let it buffer for you internally and save yourself some headache, but I have not checked the docs myself.

I would love to see the streaming API implemented, that would allow us to have constant sized, small buffers and get rid of one the most annoying parts of the client tuning. As great would be to support this in sending commands as well, circumventing the buffer allocation, sizing, resizing, pooling and deallocation. My initial design of the client was to support this, that's why the buffer struct is there, but we deviated from that idea at some point.

ps. Calling socket.read in an async system is even worse than a syscall. You'll yield from the coroutine to the runtime scheduler, with all the consequences.

khaf avatar Jun 05 '23 13:06 khaf

Per this post, the first issue is easily fixable: https://stackoverflow.com/questions/19076719/how-do-i-convert-a-vector-of-bytes-u8-to-a-string

khaf avatar Jun 05 '23 15:06 khaf

I'm still fully relying on the Buffer and encoder/decoder methods that are in place, so this should already solve it: https://github.com/aerospike/aerospike-client-rust/blob/async/aerospike-core/src/commands/buffer.rs#L1288 I already tried it that way and it seems to be fine.

I also thought about the consequences of reading so often in small pieces. Ill test some things and check if we can increase the performance there.

jonas32 avatar Jun 05 '23 20:06 jonas32

I'm still fully relying on the Buffer and encoder/decoder methods that are in place, so this should already solve it: https://github.com/aerospike/aerospike-client-rust/blob/async/aerospike-core/src/commands/buffer.rs#L1288 I already tried it that way and it seems to be fine.

This still requires an intermediary buffer to read into. My solution would allow reading directly into the string builder's buffer, and would remove the need to have a big intermediary buffer completely. All other data types would require mere bytes of buffer to read, and would work with buffers allocated on stack. Anyway, I still do not understand what you meant, since in the streaming implementation we also read chunks into the buffer and decode them off of it.

I also thought about the consequences of reading so often in small pieces. Ill test some things and check if we can increase the performance there.

I think you should first take the naive approach. Improve after you hit performance issues.

khaf avatar Jun 06 '23 10:06 khaf

Ah ok, now i get what you mean with the Buffer Strings. I think there its a question of usability vs memory again. If we pre-parse strings at that point, the Trait implementations cant all work with Buffer, but rather enums. At that point, we basically start to reintroduce Value.rs. Additionally, it would not work anymore if any user wants to do custom parsing for String data to an own datatype.

jonas32 avatar Jun 06 '23 10:06 jonas32