regexp-examples icon indicating copy to clipboard operation
regexp-examples copied to clipboard

Missing named properties

Open tom-lord opened this issue 8 years ago • 1 comments

The official ruby documentation does not include a comprehensive list of all named properties supported by the language. Some examples:

/\p{Age=6.0}/
/\p{In Miscellaneous Mathematical Symbols-B}/
/\p{Transport and Map Symbols}/
/\p{Emoji}/ # <-- A valid unicode property name, but NOT(!!) supported by ruby

Thankfully, the onigmo docs do provide a full list (but not all of these are supported by the latest ruby!)

Possible paths to take:

  • Refresh the db/*.pstore files with a more comprehensive list
  • Has this problem been solved before? Research other libraries.
  • Consider directly referencing RFCs or similar, rather than dynamically generating the lists? (Is this practical?)

Also worth noting:

  • This library does not yet "officially" support jruby, because the test suite fails in relation to named properties. (The list supported by this implementation differs to MRI.) Maybe try wrapping the tests in a rescue SyntaxError... with caution!! (Define a list of known errors; don't just rescue blindly.)
  • Arbitrary whitespace, underscores and hyphens can be included in unicode property names. This library does not yet account for this.

tom-lord avatar Oct 19 '17 09:10 tom-lord

Notes to self, after a little more research --

Ruby already directly references the Unicode RFC. For example, for the \p{In Miscellaneous Mathematical Symbols-B} example above, we have: https://github.com/ruby/ruby/blob/3628eae2e754a7489feebc6f41371d42d2efcf3c/enc/unicode/11.0.0/name2ctype.h#L34478-L34482

Then, there is this tool in the ruby source code to parse and map property names according to the unicode docs: https://github.com/ruby/ruby/blob/7aaf5b2878210d4df03a84be8d514a553839a5ba/tool/enc-unicode.rb and a template for decomposing the properties: https://github.com/ruby/ruby/blob/4444025d16ae1a586eee6a0ac9bdd09e33833f3c/template/unicode_norm_gen.tmpl.

Lastly, note that ruby defines its unicode version here. And, due to this, we can access it at runtime via: RbConfig::CONFIG['UNICODE_VERSION'].


So in conclusion, I believe it should be possible to access ruby's unicode mappings which are built at compilation; probably with a native C extension. Failing that, the above code should provide enough hints to reproduce the mapping generation as part of the gem installation process if necessary.

Either way, this solution would be far superior to the current db/ folder implementation.

tom-lord avatar Jan 02 '19 09:01 tom-lord