posterior icon indicating copy to clipboard operation
posterior copied to clipboard

Improve speed of check_existing_variables when regex=TRUE

Open jgabry opened this issue 4 years ago • 7 comments

In #130 the speed when regex=FALSE was improved. When regex=TRUE @mjskay suggested:

You might be able to speed up the regex version by translating the input list of regexes into a single regex that matches if any one of them does.

jgabry avatar May 13 '21 15:05 jgabry

Relevant comment from @Ozan147: https://github.com/stan-dev/posterior/pull/130#issuecomment-840628346

@mjskay and @jgabry (#131), I tested the single regex pattern with | yesterday but it turns out there is a size limit. Even with perl = TRUE, it starts throwing errors after certain number of variables. One idea might be to divide that large pattern string into valid-size chunks to perform multiple matches.

jgabry avatar May 13 '21 15:05 jgabry

Moving my comment here...

Ah damn. Could also fall back to the for loop version in that case I suppose... Honestly I'd guess most use cases of the regex version have a small number of input regexes (else why use regex in the first place?)

mjskay avatar May 13 '21 15:05 mjskay

There is a potential size limitation that would cause an issue with large number of variables to match.

?regex

Long regular expression patterns may or may not be accepted: the POSIX standard only requires up to 256 bytes.

We could still speed it up by dividing any long strings into valid chunks and running multiple matches as mentioned in #130.

ozanstats avatar May 13 '21 15:05 ozanstats

Just to potentially uncomplicate our lives: is there a strong use case for 256 byte+ regex here, whether passed by the user or assembled by us internally from a list of regexes the user passes? My guess is regex = TRUE is mostly used for subsetting with a handful of expressions at most (e.g. on things like "b_.*"), not passing big lists of regexes. So we might be overcomplicating our lives to support a rare-to-nonexistent edge case...

mjskay avatar May 19 '21 03:05 mjskay

So we might be overcomplicating our lives to support a rare-to-nonexistent edge case...

Yeah I think you’re right. Not worrying about it seems fine and we can always revisit it in the future if an unexpected use case arises.

jgabry avatar May 19 '21 03:05 jgabry

I agree.

Jonah Gabry @.***> schrieb am Mi., 19. Mai 2021, 05:29:

So we might be overcomplicating our lives to support a rare-to-nonexistent edge case...

Yeah I think you’re right. Not worrying about it seems fine and we can always revisit it in the future if an unexpected use case arises.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stan-dev/posterior/issues/131#issuecomment-843715977, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCW2ACZATM3PBHASBYAIQDTOMWATANCNFSM442XBW3Q .

paul-buerkner avatar May 19 '21 05:05 paul-buerkner

I tested the single regex pattern with | yesterday but it turns out there is a size limit. Even with perl = TRUE, it starts throwing errors after certain number of variables.

Given the consensus that we're okay with letting it fail on large numbers of input regexes, @Ozan147 do you have the code you used for testing your implementation handy?

mjskay avatar May 23 '21 22:05 mjskay