Create NON_PRINTABLE regex at runtime
This moves Reader.NON_PRINTABLE to a property, instead of a class variable.
We're testing the 5.1 release on Python 2.7 with UCS4 support, and we saw an increase of 30M due to this single regular expression. As we're using the C extension, this code path is not even triggered, so making it evaluated at runtime works around the issue.
I'm curious how the C extension does the job and if there isn't a difference in behavior.
This isn't an issue with Python3.
I submitted a similar PR last year, where I replaced the regex with proper unicode character class queries: #124
That would work for me too. I'm still concerned with the behavior of the C code, but that removes the problematic regex.
AFAICT the C module relies on the libyaml implementation, which does not use regular expressions but a simple range check
@nitzmahone Any chance we can get this merged? Or #124 for that matter. It impacts both the resource usage and load time.
We're planning on poking at stuff for the next release pretty soon- I need to go back and untangle why the current split impl was done the way it was before deciding what makes the most sense here (vs #124). Ideally we shouldn't need a regex at all, because clearly there's a significant expense to it, but I'd need to measure to see which is worse.
@nitzmahone Any chance this can be merged? The regex changed so it's conflicting now :(
@nitzmahone Are there any blockers to merge this PR (once the merge conflict is solved)? We would benefit from it, having to manually patch pyyaml is not ideal