Missing named properties
The official ruby documentation does not include a comprehensive list of all named properties supported by the language. Some examples:
/\p{Age=6.0}/
/\p{In Miscellaneous Mathematical Symbols-B}/
/\p{Transport and Map Symbols}/
/\p{Emoji}/ # <-- A valid unicode property name, but NOT(!!) supported by ruby
Thankfully, the onigmo docs do provide a full list (but not all of these are supported by the latest ruby!)
Possible paths to take:
- Refresh the
db/*.pstorefiles with a more comprehensive list - Has this problem been solved before? Research other libraries.
- Consider directly referencing RFCs or similar, rather than dynamically generating the lists? (Is this practical?)
Also worth noting:
- This library does not yet "officially" support jruby, because the test suite fails in relation to named properties. (The list supported by this implementation differs to MRI.) Maybe try wrapping the tests in a
rescue SyntaxError... with caution!! (Define a list of known errors; don't justrescueblindly.) - Arbitrary whitespace, underscores and hyphens can be included in unicode property names. This library does not yet account for this.
Notes to self, after a little more research --
Ruby already directly references the Unicode RFC. For example, for the \p{In Miscellaneous Mathematical Symbols-B} example above, we have: https://github.com/ruby/ruby/blob/3628eae2e754a7489feebc6f41371d42d2efcf3c/enc/unicode/11.0.0/name2ctype.h#L34478-L34482
Then, there is this tool in the ruby source code to parse and map property names according to the unicode docs: https://github.com/ruby/ruby/blob/7aaf5b2878210d4df03a84be8d514a553839a5ba/tool/enc-unicode.rb and a template for decomposing the properties: https://github.com/ruby/ruby/blob/4444025d16ae1a586eee6a0ac9bdd09e33833f3c/template/unicode_norm_gen.tmpl.
Lastly, note that ruby defines its unicode version here. And, due to this, we can access it at runtime via: RbConfig::CONFIG['UNICODE_VERSION'].
So in conclusion, I believe it should be possible to access ruby's unicode mappings which are built at compilation; probably with a native C extension. Failing that, the above code should provide enough hints to reproduce the mapping generation as part of the gem installation process if necessary.
Either way, this solution would be far superior to the current db/ folder implementation.