stringi icon indicating copy to clipboard operation
stringi copied to clipboard

Building in ALT-REP to stringi

Open traversc opened this issue 2 years ago • 19 comments

Have you considered implementing an ALT-REP string class? I think done properly, you'd see a large increase in performance across the board. There are many reasons why:

  • Simpler data structures compared to R's heavy CHARSXP and R's global string cache
  • Short string optimization
  • The possibility of true multithreading (you can't multithread R internals)

If there's interest, I'd be happy to develop and work on it.

To flesh it out a bit, I think you could use an ALT-REP class that's represented by standard STL structures:

std::vector<std::string>

You don't need to keep track of encoding, if you can assume UTF-8.

You'd probably want some global configuration parameter:

stri_use_alt_rep(bool)

You'd have to replace every interaction with R string memory with a conditional.

CHAR
SET_STRING_ELT
STRING_ELT
mkChar*
Rf_allocVector(STRSXP,...)

And replace any comparison of address for testing string equality (not sure if stringi does so).

There are probably things I'm forgetting and it's a lot of work, but I think clearly defined.

traversc avatar Mar 24 '22 20:03 traversc

I was actually thinking about giving stringi a major re-write for quite a long time. Now that the Windows-UCRT build of R assumes all strings are natively UTF-8, and the users are likely to transition to it in the future, it has started to make more sense.

And, yes, I am sure that ALTREP should be supported too, I could use your help in the future, thanks!

I took a look at your stringfish package – nice work! I was wondering what code did you use for generating the benchmarks summarised at https://github.com/traversc/stringfish/blob/master/vignettes/bench_v2.png ? I'd like to play with it a bit.

gagolews avatar Mar 25 '22 00:03 gagolews

Here's the code for that plot: https://github.com/traversc/stringfish/blob/master/inst/extra_tests/benchmark_test.r

(I updated it a bit to make it easier to include stringi in the plot output)

Glad you're interested in ALTREP support. I'll fork stringi and add in ALT-REP to a few functions, as a proof of concept.

traversc avatar Mar 25 '22 01:03 traversc

Okay, nice, I'll be happy to take a look at the prototype

I see that stringfish uses PCRE. To be a bit fairer, I think you should be comparing the timings for gsub and friends with perl=TRUE set. Can you generate the benchmark results featuring a base vs stringi vs stringfish comparison and post them here? Thanks

gagolews avatar Mar 25 '22 02:03 gagolews

Here's the plot setting perl = TRUE. Much less of a difference for regex substitution, but still lots of performance to be gained in various places. I believe the difference between the blue and yellow bars would be what you could expect to gain in stringi.

traversc avatar Mar 25 '22 06:03 traversc

So the most significant speed-up gain would be due to not using the R's CHARSXP cache? (for the 1-threaded version)

Another question: is your ALTREP-based stringfish framework compatible with read/saveRDS?

gagolews avatar Mar 26 '22 01:03 gagolews

Another question: is your ALTREP-based stringfish framework compatible with read/saveRDS?

Yes, all ALTREP frameworks are inherently compatible wiht read/saveRDS.

Here's the prototype: https://github.com/traversc/stringi

Benchmark code: https://gist.github.com/traversc/f357c5f1a4b0368649849dd3d1f49d14

Dataset used for benchmark:

cd ~/Downloads
wget https://data.deepai.org/enwik8.zip
unzip enwik8.zip

Running the benchmark:

Rscript bench_stringi.R
# 2.616 sec elapsed

Rscript bench_stringi_with_alt_rep.R
# 1.412 sec elapsed

# stringi CRAN version, comment out stri_use_alt_rep(FALSE)
Rscript bench_stringi.R
# 2.562 sec elapsed

PS: I think it's important to run the benchmark in separate R sessions for a true comparison. I've found that even after proper garbage collection, there seems to be a big difference running a command multiple times in a row.

But even if you run the benchmark in the same R session, ALTREP is faster.

traversc avatar Mar 27 '22 01:03 traversc

Thanks for the working prototype. I will get back to you some time later this year when I hope I will be able to come up with a prototype of a re-write (probably something a'la stringi2 - that would be a separate project) of the package. I definitely do not want to make any major changes is the upstream version of stringi, where stability is of great importance.

gagolews avatar Mar 29 '22 00:03 gagolews

Sounds good, looking forward to it!

traversc avatar Mar 29 '22 02:03 traversc

Just coming in here waving :wave: -- @traversc and I were chatting about possible fast and lightweight (enough) containers for string vectors. I am currently working a bit arrow objects that have character vectors (in their encoding of contiguous vector plus a vector of offsets, ie c("the", "black", "cat") becomes "theblackcat" and c(0,3,8,11) ) which is quick to pass around -- and it would be nice to have something quick and light and powerful to work with it. Of course once we mod the vector the contiguous blob may no longer be viable but at least for read access before that a std::vector<string_view> should be easy and could be ALTREP backed. Anyway, food for thought. Happy to help, time permitting.

eddelbuettel avatar May 17 '23 01:05 eddelbuettel

Might be a good one!

My three Aussie cents:

  • always use UTF-8
  • consider including null-bytes at the end of each substring (makes processing by other functions easier; the\0black\0cat\0 above

Possible issue worth considering: R's string cache is bypassed... (can be a good thing)

Would be nice to agree on a common representation across many packages.

gagolews avatar May 17 '23 05:05 gagolews

The c("the", "black", "cat") with c(0,3,8,11) ) is a given and cannot be changed. I do not know who started it but it seems common, it is how TileDB stores on disks / retrieves and also what arrow does. (There is one fine difference whether vector offsets is length n or n+1 as shown here. The latter is easier and preferred.

And no null termination anywhere :crying_cat_face: But that is what is out there and what would be most efficient to use.

Now, we could of course define another representation standard but that would start as an uphill battle with wind in our face :crying_cat_face:

eddelbuettel avatar May 17 '23 12:05 eddelbuettel

Both @eddelbuettel and I have run into bottlenecks dealing with large amounts of string processing. Existing ALTREP frameworks (e.g. in vrooom, arrow, etc.) don't really help because materialization happens too often, e.g. whenever you use dplyr or data.table.

So an ALTREP "common representation across many packages" is very much the goal and would be huge for the R community :)

Figuring out the very best optimal representation does not need to be a bottleneck to getting started. We can hide the implementation behind a set of access and modify API functions and test out various implementations without too much work.

Like @eddelbuettel I would also have some time to help.

PS: I believe the Rstudio folks would also be interested (and hopefully supportive)

traversc avatar May 17 '23 19:05 traversc

What about doing this at the R, not just package level? Maybe we should ping Tomas Kalibera and ask what he thinks about it... I'm a big proponent of unity..

gagolews avatar May 17 '23 22:05 gagolews

I always have Duncan Murdoch in the back of my ear: "if something can be done at the package level ..." It's just easier that way.

You raise a fair point. It may just make everything a tad harder to pull off.

eddelbuettel avatar May 17 '23 23:05 eddelbuettel

I agree.

We also need to think about how NA_character_s would be represented. NAs in the index vector (like c(0,3,NA,8,11))? But then it would slow down string extraction for "sparse" vectors (with many missing strings).

gagolews avatar May 18 '23 01:05 gagolews

We probably have to do what Arrow and others do with is an extra (bitmap) vector to signal it. nanoarrow has helpers.

(It took me some time to come around to this as I actually truly madly deeply love how R has NA/NaA in ints and chars (and bools (!!) etc). But these days interop is likely as if not more important and if we want to do this for 'medium data at scale' we probably have to go with the times and have Arrow interop anyway.)

I'd have to double check but I think in the offsets vector it then simply repeat the last position implying length zero of the NA element. But when nullabillity is set (Arrow makes that optional) then there is an additional vector flagging this.

eddelbuettel avatar May 18 '23 01:05 eddelbuettel

A zero-length string with an additional NA-marker makes sense to me too. This will not paralyse other algorithms.

gagolews avatar May 18 '23 02:05 gagolews

Hello all, I don't have any knowledge in C/C++ or other low-level languages (just dabbling a bit in Rust) but I thought you might be interested in the string type in polars, since they rewrote it very recently and had important performance gains: https://pola.rs/posts/polars-string-type/

I hope this fits in this conversation, and sorry for the spam otherwise

etiennebacher avatar Feb 06 '24 09:02 etiennebacher

Thanks!

gagolews avatar Feb 12 '24 00:02 gagolews