hermes icon indicating copy to clipboard operation
hermes copied to clipboard

Add Unicode property escape support to RegExp

Open jonathanj opened this issue 4 months ago • 1 comments

Summary

Extends RegExp to handle \p and \P Unicode property escapes, standalone and within character classes, for Unicode mode. The implementation extends the existing idea of tables of codepoint ranges to cover all of the properties explicitly required by ES262.

The genUnicodeTable.py script was updated to generate code related to the binary and non-binary Unicode properties explicitly mentioned in ES262 (ref); based on the Unicode 15.1.0 data files.

Closes #1027

Test Plan

  • Added regexp_unicode_properties.js to the Hermes test suite
    • Binary properties and the General_Category properties
    • Non-binary properties (Script=Latin, Script_Extensions=Thai)
    • Inverted character class escapes \p{…} and \P{…}
    • All of the above atom escapes in and out of character classes
    • Exercise the parser for the above forms, and incomplete forms
  • test262 suite via hermes/utils/testsuite/run_testsuite.py
    • I had to use --test-skiplist otherwise around 417 tests (in test262/test/built-ins/RegExp/property-escapes) are skipped with the reason "Skipping test with 'const'". This was confusing to me since most of these do pass, but I couldn't find an explainer for this skip.
    • There are some failures here, at least one of which is caused by test262 not being updated for Unicode 15.1.0, which I see as test suite issue rather than an implementation issue.

jonathanj avatar Feb 03 '24 13:02 jonathanj

(I'm including this outside of the PR description, in case that's used for automation purposes.)

Looking for Feedback

  • I have modified genUnicodeTable.py to fetch from HTTP URLs instead of FTP, because it is significantly faster and that starts to matter now that there are about a dozen fetches, just checking if there's a reason these were specifically FTP before?
  • I'm sure there is low-hanging performance fruit (I'm not a C++ dev), I mostly extended what already existed to achieve a working result, but am very keen to hear any ideas.
  • UnicodeData.inc is now several thousand lines longer than it was previously, because of all the category tables, but still far smaller than including ICU. I'm wondering if this comes as a shock, or mostly in line with what was expected to support Unicode properties?
  • The BuildingAndRunning.md docs suggest to use hermes/utils/format.sh to format all code, but that fails for me expecting clang-format version 12, which isn't in macOS Homebrew any more. I configured VS Code to use clang-format 17 (from Homebrew), but I'm not sure if the version difference will pose an issue?
  • The contribution guidelines suggest "squashing your commits", which I took to mean "squash your PR to a single commit", but I realise now that maybe it means "squash commits for related concepts". I'm happy to rebase and break this into a few commits for each piece of work, if it'll make things easier.

jonathanj avatar Feb 03 '24 13:02 jonathanj

Hi! Thank you for this contribution, it must have been a lot of work.

We would like to land this in Hermes, but since this is a lot of code, I hope you can bear through our review process, which can be very detail oriented.

At first glance I see some things that likely need to be tweaked, like the static constructors of std::unordered_map and std::string. Another concern is that at first glance it looks like this change adds 300KB of binary size, which is quite significant - a "lean" version of Hermes compiled for Arm32 is about 800KB, so this would be close to a 40% increase. (BTW, I suspect part of the increase is due to the static constructors, so it will likely go down.)

In any case, this is required ECMAScript functionality, so we are very willing to work with you to land this, and I hope you are willing to work with us.

The contribution guidelines suggest "squashing your commits", which I took to mean "squash your PR to a single commit", but I realise now that maybe it means "squash commits for related concepts". I'm happy to rebase and break this into a few commits for each piece of work, if it'll make things easier.

As a first step, I think it would be definitely very helpful if you break this into commits, to make it easier to review and iterate on each.

The contribution guideline to squash everything is generally well intentioned, because it assumes most contributions will be smaller and because Github does not provide the best UI for reviewing a stack of commits. But a sizable contribution like this really needs independent review of its parts.

tmikov avatar Feb 05 '24 23:02 tmikov

We would like to land this in Hermes, but since this is a lot of code, I hope you can bear through our review process, which can be very detail oriented.

In any case, this is required ECMAScript functionality, so we are very willing to work with you to land this, and I hope you are willing to work with us.

I'm more than happy to go through your review process, and look forward to it!

At first glance I see some things that likely need to be tweaked, like the static constructors of std::unordered_map and std::string. Another concern is that at first glance it looks like this change adds 300KB of binary size

This is exactly the kind of feedback I was hoping for, thank you. I knew writing this that this wouldn't necessarily be the best approach, but it was fairly easy to understand and iterate on. I'd be glad to see this evolve into a better solution.

As a first step, I think it would be definitely very helpful if you break this into commits, to make it easier to review and iterate on each.

Of course, I'll get to that as soon as I can.

jonathanj avatar Feb 06 '24 07:02 jonathanj

@tmikov I've restructured the original giant commit into several more digestible ones, I hope it helps make the review easier.

jonathanj avatar Feb 06 '24 19:02 jonathanj

Thanks for splitting this up, I think the first thing to do here is to try and reduce the size of the additional data so that we get a more realistic picture of the final size. The idea is to eliminate the need to emit static constructors to populate the data structures you're using, particularly unordered_map, vector, and string. Ideally, all of the data should be represented efficiently as constexpr data structures, optionally with some small runtime code to process them if necessary.

To pack all of the data into constexpr data structures, we can make the following transformations:

  1. Replace any std::vector with a separate static constexpr array, since we know their size and contents statically. Then, wherever they are needed, we can use an llvh::ArrayRef to refer to them. (we may need to pull in that llvh header if it isn't already available)
  2. Replace any std::string with a std::string_view. Since these strings are all pointing to constant data, a constexpr string_view referencing the constant string will be much more efficient than dynamically constructing the string.
  3. Convert any unordered_map into a sorted array of key,value pairs. This allows the map to be encoded efficiently, and we can perform binary search to find an element.

I've demonstrated these transformations on a small snippet in https://godbolt.org/z/Mvr7qKxnT, showing how the current data structures can be changed into this format. Notice that the generated output is dramatically smaller, while encoding the same information.

As you make these changes, you can measure the effect they have on the final size. You can configure and build Hermes as follows:

cmake -S hermes -B build_opt -G Ninja -DCMAKE_BUILD_TYPE=MinSizeRel -DHERMES_ENABLE_DEBUGGER=OFF -DHERMES_IS_MOBILE_BUILD=ON
cd build_opt
ninja libhermes

To give you a rough sense of the numbers, the generated binary (on my M1 Mac) goes from 2.4MB before this PR to 2.6MB after it. Hopefully with these changes, the size gap will shrink considerably.

neildhar avatar Feb 13 '24 23:02 neildhar

Thanks for the detailed feedback @neildhar, the godbolt output for those two snippets was really interesting! I've made the changes to genUnicodeTable.py and the utility functions to use the proposed data structures.

I ran optimized build you suggested for main, this branch after the changes, and this branch before the changes.

-rwxr-xr-x@ 1 jonathan  staff  1728920 Feb 15 18:30 hermesc    # main
-rwxr-xr-x@ 1 jonathan  staff  1957320 Feb 15 19:48 hermesc    # this branch (after changes)
-rwxr-xr-x@ 1 jonathan  staff  1978584 Feb 15 17:38 hermesc    # this branch (before changes)

The difference between main and this branch after the changes is 249664 bytes, which is about the difference you mentioned in your own builds. The difference between this branch before and after the changes is 21264 bytes, which is smaller than I expected (although I guess ~20KB of assembly just to initialize static data is a lot). Is this about the size difference you expected?

I ran some rough calculations for the size of the generated Unicode data:

# Rough number of 32-bit ints
$ cat hermes/lib/Platform/Unicode/UnicodeData.inc | grep -o '0x' | wc -l
   36742
# Bytes of data
$ echo '36742 * 8' | bc 
293936

Seems like the ~200KB increase in binary size roughly matches up with the amount of raw data that has been added.

As far as I can tell the code now matches your suggested structures, but I'd love to find out that I overlooked something obvious. What do you think the next steps are, trying to improve how the data is stored?

jonathanj avatar Feb 15 '24 18:02 jonathanj

Thanks for making these changes so quickly, and for being so receptive iterating on this! I agree that the improvement is somewhat smaller than I would have expected. The next step is probably to try and cut down dynamic relocations, it looks like they add tens of kilobytes to the size, and will also slow down startup.

This will be a slightly more involved transformation, since we're trying to have as few pointers as possible in the constant data. Right now, every string and array pointer in the data needs a dynamic relocation to set the actual pointer at runtime, based on where the library is loaded into memory.

The basic idea is to turn the array and string pointers into offsets instead of pointers. We need to merge all of the arrays of a given type into a single giant pool (same for strings). Then in each place where we currently have a pointer (like each string_view and ArrayRef), we would instead keep an offset into that large array, and a length. Depending on how large the arrays and elements are, we can use either 16 bit or 32 bit values.

This adds some marginal cost when accessing the data, but it represents the data much more efficiently (since the offsets will be smaller than pointers), and avoids the startup and size overhead of the dynamic relocations.

I repurposed the same example to demonstrate how this would look: https://godbolt.org/z/novqf4PdK

Note that the size win here will be less obvious in godbolt, but we should see an improvement in the final binary. I expect this to be on the same order as the code size reduction you observed from the previous change.

In the case of strings, it will also be worth doing some basic deduplication, since many of the strings are repeated.

This should get us fairly close to the bare minimum needed to encode this data, which will guide how to include this in the build, and then we can proceed with reviewing the actual logic.

neildhar avatar Feb 15 '24 21:02 neildhar

Thank you again for the excellent feedback! Near the beginning of this change, having a single large pool did occur to me but I didn't know about dynamic reallocations, so I decided to avoid the complexity and would probably not have had the 3-stage lookup anyway.

Sorry about the longer turnaround on this iteration, between a busy day job and wrapping my head around how to restructure the Python logic, it just took me longer to get to. I ran the optimized build again and this is the new size:

-rwxr-xr-x@ 1 jonathan  staff  1880152 Feb 24 01:26 hermesc

Looks like a 75KB improvement, I ended up going with 16-bit values since none of the offset+size values were larger than 65535:

string_offset_bits: 12
string_size_bits: 5
range_pool_offset_bits: 15
range_pool_size_bits: 10
range_array_pool_offset_bits: 9
range_array_pool_size_bits: 3

Interesting (but maybe not useful) things:

  • clang-format is no longer reformatting the ranges into columns of 3, so the file is now ±20K LOC. If I avoid inserting comments into the range pools, it goes back to creating columns of 3 again and the file is down to ±6K LOC.
  • The original UNICODE_LETTERS, etc. tables (used in the lexer and other places) do duplicate around 500 ranges, they seem to contribute ~200 bytes to the binary size

jonathanj avatar Feb 24 '24 00:02 jonathanj

@neildhar Are there any other changes you'd like me to make, prior to moving on to reviewing the logic?

jonathanj avatar Mar 05 '24 15:03 jonathanj

@jonathanj Sorry for the delay, nope, I'll take a look at the actual logic next.

Given the size regression, we'll also want to retain the option to disable this. There are two steps to doing this (which you can do in parallel while I review the rest):

  1. Add a new CMake option HERMES_ENABLE_REGEXP_UNICODE_CHARACTER_CLASS that we can use to turn the feature off.
  2. Dump whether this is available under Features: when you run hermes -version. Check for this in the test262 runner and conditionally exclude this feature when it is not available.

neildhar avatar Mar 08 '24 23:03 neildhar

Thanks @neildhar!

  1. Add a new CMake option HERMES_ENABLE_REGEXP_UNICODE_CHARACTER_CLASS that we can use to turn the feature off.

I ended up naming the option HERMES_ENABLE_UNICODE_REGEXP_PROPERTY_ESCAPES, I thought that aligned better with ES262's language and covers both atom and class escapes. Happy to change it if you feel I made a mistake. I think I defined the option with a default of ON, I'm not sure if that was the right default.

Do I need to specify this option in the gradle, podspec, Circle CI config, and other build files, like is done for HERMES_ENABLE_INTL or HERMES_ENABLE_DEBUGGER?

  1. Dump whether this is available under Features: when you run hermes -version. Check for this in the test262 runner and conditionally exclude this feature when it is not available.

Done. There are still quite a few regexp-unicode-property-escapes test262 tests skipped because of the const issue mentioned in the PR description.

jonathanj avatar Mar 13 '24 13:03 jonathanj

@jonathanj I think we will keep this option always enabled for OSS releases (we are much more sensitive to binary size internally). If we set it on to default, there is no need to worry about gradle, podspec, etc. About CircleCI: I think we should also keep it on (though I am curious what @neildhar thinks).

tmikov avatar Mar 14 '24 02:03 tmikov

I agree, it seems simplest to turn this on by default, and not override it in any of our build configs. We can specify different defaults when building internally.

neildhar avatar Mar 14 '24 17:03 neildhar

I'm working through the generation script now, and have some suggestions that would make it much easier to review:

  1. Could we update the unicode version in a separate commit from all of the changes to the script? This will be a useful sanity check that the changes to the script do not change the data emitted for existing fields like UNICODE_LETTERS.
  2. It would be very helpful to have high level comments in the script about the transformations that some of these functions do. Specifically, the format of the incoming data, and the structure that it produces.
  3. The new output should use the same 3-column style for the data as the old output does.

neildhar avatar Mar 29 '24 19:03 neildhar

  1. Could we update the unicode version in a separate commit from all of the changes to the script? This will be a useful sanity check that the changes to the script do not change the data emitted for existing fields like UNICODE_LETTERS.

Good idea! I've split out the commits and used Unicode 14.0.0 (which was the version at the time) for the first ones.

  1. It would be very helpful to have high level comments in the script about the transformations that some of these functions do. Specifically, the format of the incoming data, and the structure that it produces.

Sorry about this, I think in wrapping my head around the transformations, I forgot to actually document them along the way. I've added docstrings to most functions now, and fleshed out the documentation on the UnicodeProperties class to try give some context, as well as added some contextual comments where it felt like code seemed to "just do something" without futher insight. I hope these help, I wasn't sure exactly which transformations you had in mind with your comment.

  1. The new output should use the same 3-column style for the data as the old output does.

Yes, this bothered me a lot. I left it alone because clang-format was fighting me on this, but of course I can manually format these into a 3-column style, and it's so much better for it.

Thank you for the suggestions, I'll make my way through the C++ review comments as soon as I can.

jonathanj avatar Mar 31 '24 08:03 jonathanj

clang-format was fighting me on this

Note that you can use the directive // clang-format off to disable clang-format for a section of code. We do this in many places in the code, for things where we want to control the format.

neildhar avatar Apr 01 '24 15:04 neildhar

@neildhar Just checking in if there's anything I still need to address, or can help with for this review.

jonathanj avatar Apr 23 '24 22:04 jonathanj

@jonathanj Thanks for the ping, there isn't anything outstanding, let me import it so we can more easily do a quick review of the python changes

neildhar avatar Apr 23 '24 22:04 neildhar

@neildhar has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Apr 23 '24 22:04 facebook-github-bot

@jonathanj has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot avatar May 04 '24 17:05 facebook-github-bot

Everything looks good, apart from the minor comments, the only real ask is to add a file-level doc-comment in genUnicodeTable.py explaining what the steps are in producing the final output. Once that's done, this should be good to merge.

I added instructions to the end of the existing file-level comment, that explain invoking the script from the project working root, use with clang-format, and the redirect path. I think I sort of just figured this out, so I actually have no idea if this was the intended process! Let me know if I need to adjust it.

Thank you for your patience with this process!

Thank you for your patience with my submission, I'm very appreciative of the insightful comments and explanations along the way, I learned some interesting new things! It feels good to contribute to an exciting project like Hermes, with supportive maintainers.

jonathanj avatar May 04 '24 17:05 jonathanj

@neildhar has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar May 09 '24 21:05 facebook-github-bot