arrow-julia icon indicating copy to clipboard operation
arrow-julia copied to clipboard

Show Map example in documentation?

Open bdklahn opened this issue 2 years ago • 0 comments

I'm so glad someone implemented Arrow for Julia. Thanks!

And I think the intro to the User Manual is the clearest I've come across to help understand the what and why of Arrow.

It looks like we can create a map array from a collection (Vector) of Dict items: https://github.com/apache/arrow-julia/blob/532b89b2c5740124cadca632a14ebb6cc9a0dca5/src/arraytypes/map.jl#L49-L65

I have been considering storing (caching, really) graph data in Arrow structures. My thought has been to create a Map of Int to Struct, where the Struct would define a node type. I saw that one can define and create an array of structs (with registering a custom type with the schema). I wonder if someone could post a quick example of creating an array of Dict, where the key is an Int and the value is a user-defined struct. I failed at a first attempt, but I think it might be because the Arrow schema needs to know about the node struct Dict type, not (only) the struct type.

Now "if" I should be doing this is another question, because I wonder:

  1. Will I lose some benefit of Arrow if (de)serialization will need to be done to convert between Arrow and Julia struct types?
  2. Does a Map type really give much benefit?

Someone here probably can easily answer the first one. Perhaps I am better off storing in more primitive types, then constructing my structs on ingress.

For two, I am not sure what exactly a map type gets you, in terms of Arrow, since, as I understand, everything is contiguous and read-only, anyway. I.e., it is not like a Dict which us hashed out and in from memory (right?). Does anyone know . . . does using the map Arrow type do something like create separate, but linked, arrays to make indexing faster (implicit?), because the keys and values can be in their own homogeneous type arrays? I think BadgerDB, for example, gets some performance benefit from separating key and value storage. I wonder if it is something like that. Maybe they apply some implicit Red Black tree logic to (sorted) keys (in any of their processing functions)?

I also think Julia visibility might benefit from having the clearest, most comprehensive, set of examples (referenced) in the Arrow documentation. I think the Python ones there are currently the most complete, user-friendly, language example ones. I guess that must be because the Arrow folks implemented that library(?). I bet Julia folks could do even better, to make trying/using Julia for Arrow much more friction-less.

bdklahn avatar Aug 11 '22 16:08 bdklahn