unconf16 Of bits and men and women (in the SEXP header)

The SEXP header takes up 32 bits of memory, even though it only has 31 bits of information in it (the reasons it does this are not important here, but they are good). I think we should use this unclaimed bit for something. Lets brainstorm ideas about simple, true/false information that it would be cool for R objects to hang onto about themselves, or other uses for a single bit.

Some ideas so far

a sortedness bit - Have R vectors know when they have already been sorted, making extraneous sort() calls instantaneous. Merge calls, and other things, even some types of subsetting, could massively benefit from this as well, I think
Splitting the header and data of a SEXP. This one is more complicated, but given a dedicated bit, we could make it possible for vectors to be views into other vectors that share the same memory. This would make certain things very cool and fast (no data copying for continuous, read-only subsets of R vectors)

Any other ideas?

P.S. yes, this will require changing R itself, but that is not the insurmountable obstacle it sounds. @lawremi brought @mattdowle's super fast radix sort into R recently. Furthermore, @lawremi himself was only invited onto R-core after implementing shallow copying in R. If we can show a big win, our ideas will be considered.

Mar 22 '16 15:03 gmbecker

More of a question than a comment: I've heard several times over the years that the bits in SEXP structure are precious and we have to be conservative with the few bits left. Why is this? Why not just expand it to, say, 40 bits or 64 bits? I could see that a transition break things such as serialized objects, but it sounds like a problem that could be dealt with (e.g. that the remaining bit could be a backward-compatible flag indicating that it's a 64-bit SEXP structure we're dealing with).

Mar 22 '16 20:03 HenrikBengtsson

Expanding the header is definitely on the table. It will likely happen this year and when it does it will almost surely include support for tracking sortedness and missingness (anyNA). The changes to memory management are much more significant and although some on R core is think the strategy is viable, there is no endorsement.

Mar 22 '16 21:03 lawremi

@HenrikBengtsson I'm quite excited about having a larger header as well, that is simply a wider-scale change and not one that will come from the outside. I'm simply proposing that we experiment with using the one bit that is there to do something useful as preparation for that, so we can already see if a particular thing seems to give wins or not.

I think there are powerful things we can do even with just one bit.

Mar 23 '16 13:03 gmbecker

+1 for using it for sorting. Being able to do a binary rather than linear search could theoretically speed up all the comparison tests as well (although perhaps allocating the logical vectors takes up the majority of the time in those cases?)

Mar 29 '16 18:03 jimhester

To avoid allocation overhead, it should use an RLE to encode the response. Not sure base R is ready for the level of abstraction needed though.

Mar 29 '16 23:03 lawremi

Excited by the proposed ideas too!

Both sortedness and pointerizing DATAPTR sound great and I have no inkling which might be better if one has to be chosen. memcpy() is surprisingly fast for contiguous vectors (much faster than a for loop) so I'm not sure the speedup would be massive there. For lots of very small groups (say 1 or 2 row groups) just the call overhead of memcpy() itself might come into it - would be avoided by a pointer assign instead. Also it would have to be somehow known or prepared (and marked) that the data was contiguously grouped already (what setkey does in data.table).

On the other hand, using the bit for sortedness comes with the risk that somehow a state ends up that it is marked sorted when it is not, resulting in incorrect results. All writes are internal to R so that should be manageable though. Other than data.table, which does write by reference to vectors. So data.table would need to be changed to maintain the sortedness bit too - which we're happy to do of course. There is already a line which unsets the key on any := which touches a key column, so that could unset the sortedness flag in the header of any column too (whether a key column or not). We could add the code to data.table, push to CRAN and it would switch over whenever the change was made so as to not be in your way.

Pointerizing DATAPTR would allow mmap'd vectors. That'd be awesome. Package mmap already does similar and I don't know how ... I guess it uses EXTPTR. Going out-of-memory at the vector level might have more bang-for-the-buck (than sortedness bit).

Mar 30 '16 08:03 mattdowle

Avoiding copies will of course save space in addition to time. It also enables more seamless integration with external data sources. Both R and the external system could operate on the same object without any conversion.

Btw, R already supports mmap() of vectors, as long as the header is included. See allocVector3() in memory.c. One can implement a custom R_allocator_t. I think the mmap package is doing something else though.

Mar 30 '16 09:03 lawremi

On copy saving yes. I just had in mind dogroup.c in data.table which copies the groups into a static area as big as the largest group. Anything else which currently takes lots of copies would have a dramatic memory saving yes. A great thing. Not seen allocVector3() before - thanks. Will take a look.

Mar 30 '16 09:03 mattdowle

unconf16 unconf16 copied to clipboard

Of bits and men and women (in the SEXP header)

unconf16
unconf16 copied to clipboard