Experimental support for unicode identifiers
Rebase of #1407. Copied from there:
I know for a fact that this requires a few changes in stan-dev/stan's json data handler to recognize unicode names, which is just one of several reasons this is a draft.
The basic overview:
OCaml strings should be treated mostly like arrays of bytes, and ocamllex handles inputs as sets of bytes. We can define rules that recognize UTF-8-compatible bytes, and then do validation on them after the fact based on the the Unicode Annex 31: Unicode Identifiers standard.
We then pretend for most of the compiler like it's just bytes, which is fine, because we never do things like subslice variable names.
Finally, at output time, we already had string escaping (since #952), so most of the code-gen works fine. Recent C++ standards require that compilers support UTF-8 names based on the same UAX31 rules linked above, but older ones may not. For now I've got it generating "Universal character names" which seem like the legacy version of this, which hopefully means older compilers will be happy with it.
Submission Checklist
- [x] Run unit tests
- Documentation
- [ ] If a user-facing facing change was made, the documentation PR is here: <LINK>
Release notes
stanc3 can now accept a flag --allow-unicode which enables the use of non-ascii characters in Stan files. All files are expected to be encoded in UTF-8. This is experimental and may not work with older C++ compilers.
Copyright and Licensing
By submitting this pull request, the copyright holder is agreeing to license the submitted work under the BSD 3-clause license (https://opensource.org/licenses/BSD-3-Clause)
Codecov Report
Attention: Patch coverage is 74.39024% with 21 lines in your changes missing coverage. Please review.
Project coverage is 89.41%. Comparing base (
3aa50a9) to head (68db0d8).
| Files with missing lines | Patch % | Lines |
|---|---|---|
| src/common/Unicode.ml | 55.81% | 19 Missing :warning: |
| src/frontend/Identifiers.ml | 91.66% | 2 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #1499 +/- ##
==========================================
- Coverage 89.52% 89.41% -0.12%
==========================================
Files 66 68 +2
Lines 9684 9757 +73
==========================================
+ Hits 8670 8724 +54
- Misses 1014 1033 +19
| Files with missing lines | Coverage Δ | |
|---|---|---|
| src/driver/Entry.ml | 93.75% <100.00%> (ø) |
|
| src/driver/Flags.ml | 100.00% <ø> (ø) |
|
| src/frontend/Errors.ml | 100.00% <100.00%> (ø) |
|
| src/stan_math_backend/Cpp.ml | 88.91% <100.00%> (+0.11%) |
:arrow_up: |
| src/stan_math_backend/Cpp_Json.ml | 100.00% <100.00%> (ø) |
|
| src/stanc/CLI.ml | 98.11% <100.00%> (+0.01%) |
:arrow_up: |
| src/frontend/Identifiers.ml | 91.66% <91.66%> (ø) |
|
| src/common/Unicode.ml | 55.81% <55.81%> (ø) |
🚀 New features to boost your workflow:
- ❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.