unconf16
unconf16 copied to clipboard
Of bits and men and women (in the SEXP header)
The SEXP header takes up 32 bits of memory, even though it only has 31 bits of information in it (the reasons it does this are not important here, but they are good). I think we should use this unclaimed bit for something. Lets brainstorm ideas about simple, true/false information that it would be cool for R objects to hang onto about themselves, or other uses for a single bit.
Some ideas so far
- a sortedness bit - Have R vectors know when they have already been sorted, making extraneous sort() calls instantaneous. Merge calls, and other things, even some types of subsetting, could massively benefit from this as well, I think
- Splitting the header and data of a SEXP. This one is more complicated, but given a dedicated bit, we could make it possible for vectors to be views into other vectors that share the same memory. This would make certain things very cool and fast (no data copying for continuous, read-only subsets of R vectors)
Any other ideas?
P.S. yes, this will require changing R itself, but that is not the insurmountable obstacle it sounds. @lawremi brought @mattdowle's super fast radix sort into R recently. Furthermore, @lawremi himself was only invited onto R-core after implementing shallow copying in R. If we can show a big win, our ideas will be considered.
More of a question than a comment: I've heard several times over the years that the bits in SEXP structure are precious and we have to be conservative with the few bits left. Why is this? Why not just expand it to, say, 40 bits or 64 bits? I could see that a transition break things such as serialized objects, but it sounds like a problem that could be dealt with (e.g. that the remaining bit could be a backward-compatible flag indicating that it's a 64-bit SEXP structure we're dealing with).
Expanding the header is definitely on the table. It will likely happen this year and when it does it will almost surely include support for tracking sortedness and missingness (anyNA). The changes to memory management are much more significant and although some on R core is think the strategy is viable, there is no endorsement.
@HenrikBengtsson I'm quite excited about having a larger header as well, that is simply a wider-scale change and not one that will come from the outside. I'm simply proposing that we experiment with using the one bit that is there to do something useful as preparation for that, so we can already see if a particular thing seems to give wins or not.
I think there are powerful things we can do even with just one bit.
+1 for using it for sorting. Being able to do a binary rather than linear search could theoretically speed up all the comparison tests as well (although perhaps allocating the logical vectors takes up the majority of the time in those cases?)
To avoid allocation overhead, it should use an RLE to encode the response. Not sure base R is ready for the level of abstraction needed though.
Excited by the proposed ideas too!
Both sortedness and pointerizing DATAPTR sound great and I have no inkling which might be better if one has to be chosen. memcpy() is surprisingly fast for contiguous vectors (much faster than a for loop) so I'm not sure the speedup would be massive there. For lots of very small groups (say 1 or 2 row groups) just the call overhead of memcpy() itself might come into it - would be avoided by a pointer assign instead. Also it would have to be somehow known or prepared (and marked) that the data was contiguously grouped already (what setkey does in data.table).
On the other hand, using the bit for sortedness comes with the risk that somehow a state ends up that it is marked sorted when it is not, resulting in incorrect results. All writes are internal to R so that should be manageable though. Other than data.table, which does write by reference to vectors. So data.table would need to be changed to maintain the sortedness bit too - which we're happy to do of course. There is already a line which unsets the key on any := which touches a key column, so that could unset the sortedness flag in the header of any column too (whether a key column or not). We could add the code to data.table, push to CRAN and it would switch over whenever the change was made so as to not be in your way.
Pointerizing DATAPTR would allow mmap'd vectors. That'd be awesome. Package mmap already does similar and I don't know how ... I guess it uses EXTPTR. Going out-of-memory at the vector level might have more bang-for-the-buck (than sortedness bit).
Avoiding copies will of course save space in addition to time. It also enables more seamless integration with external data sources. Both R and the external system could operate on the same object without any conversion.
Btw, R already supports mmap()
of vectors, as long as the header is included. See allocVector3()
in memory.c. One can implement a custom R_allocator_t
. I think the mmap package is doing something else though.
On copy saving yes. I just had in mind dogroup.c in data.table which copies the groups into a static area as big as the largest group. Anything else which currently takes lots of copies would have a dramatic memory saving yes. A great thing.
Not seen allocVector3()
before - thanks. Will take a look.