Marek Gagolewski comments

Results 118 comments of


Marek Gagolewski

Split string into pieces of fixed length and computing q-grams

Related idea (not yet implemented): #31 But yeah, the question is why would anyone need it? Computing q-grams maybe?

Split string into pieces of fixed length and computing q-grams

There's also this: ``` stringi::stri_split_boundaries(c("ab", "def", "g"), type="character") [[1]] [1] "a" "b" [[2]] [1] "d" "e" "f" [[3]] [1] "g" ``` which extracts [grapheme clusters](https://unicode-org.github.io/icu/userguide/boundaryanalysis/)

Split string into pieces of fixed length and computing q-grams

I might implement both here (the overlapping and non-overlapping splits), but not today :)

Split string into pieces of fixed length and computing q-grams

Yep, good point. Plus, I guess it'd be nice to have an options for handling chunks of different lengths (e.g., first 2 code points, then 3, then 1, etc.)

match, pmatch

Current version of `stri_in_fixed` (with `boost::unordered_map`): ``` Unit: microseconds expr min lq median uq max neval match(x100, x100) 10.080 14.2310 37.0440 45.2725 121.462 100 match(x1000, x1000) 67.136 77.4465 106.8275 118.6175...

match, pmatch

R's `match()` calls `do_match5`. It uses a R internal string hashtable directly. So I doubt whether we can get any faster that it. Should `stri_in_fixed` then be implemented as `match(stri_enc_toutf8(x),...

match, pmatch

to be done: `stri_is_coll` + `pmatch` + `%in%`?

match, pmatch

also for sorted haystacks (bin search/...)

`stri_datetime_parse()` add base date/time argument

This is an intended behaviour. I now mention it explicitly in the manual. The assumption is that if you wish to get any other defaults (say, midnight in the current...

`stri_datetime_parse()` add base date/time argument

Actually, now I see I have a bug there, as the 'base' time is not re-set on every parsing activity. Thanks ```r > stri_datetime_parse(c('1970-01-01', '12'), c('yyyy-MM-dd', 'HH')) [1] "1970-01-01 09:30:13...