Updating unicode versions
Unicode is not a static standard, as such the tables need to be outdated from time to time. However for compatability, you sometimes need to use a specfic version of unicode for a given function.
One thing I've been thinking about with Zig is if we could have a function that parses the official unicode tables (https://www.unicode.org/Public/11.0.0/ucd/). Most people would use it at compile time, but if it could be used at run-time too then we have a very powerful tool!
I've been thinking about writing such a tool/library myself, but I just found your repository and thought it would be a bad idea to duplicate work/have competing libraries.
Just saw your post at https://www.reddit.com/r/Zig/comments/9u2qnu/github_gernestzunicode_port_of_go_standard/e9mqbtf which indicates this might already be on your mind?
@daurnimator as I said on ^^ that reddit comment. We dynamically generate tables. also it is on the tables.zig file https://github.com/gernest/zunicode/blob/c0764bcdc9e79f186263e2c6a9ed592033259956/src/tables.zig#L5
We can easily update to new versions of unicode any time. The only reason the script is not here is it is written in Go so I didn't want to pollute the repo, since I was hopping to port it to zig one day.
For now we are using unicode 10.0.0 , I am still working on stabilizing the API and making sure all the tests are accounted for. Upgrading to 11.0.0 will be a matter of running the script, but it is not important to me for now.
Do you have special need for 11.0.0? I can upgrade for you.
This is the tool/script that is doing what you said https://github.com/gernest/matrix/blob/master/script/make_tables.go . And this library is using it to generate the tables.zig file.
Do you have special need for
11.0.0? I can upgrade for you.
Actually I need Unicode 3.2.0 to implement XMPP's nodeprep function.
so the unicode versions are not backward compatible?
I can generate the tables.zig file for 3.2.0(not now though, I'm on mobile) and you can replace it ,the rest of the lib will just work. However I have no plan to support older version or more than one version, so I will be upgrading to latest unicode versions from time to time.
so the unicode versions are not backward compatible?
Correct. Every upgrade will have the potential to break libraries/applications.
I can generate the tables.zig file for 3.2.0(not now though, I'm on mobile) and you can replace it ,the rest of the lib will just work.
No hurry, I have weeks until I actually need it.
I have no plan to support older version or more than one version, so I will be upgrading to latest unicode versions from time to time.
If you implement the parsing in zig then we get support for all versions! I'm happy to wait for you, or possibly do this work myself.
If you implement the parsing in zig then we get support for all versions! I'm happy to wait for you, or possibly do this work myself.
It will take time before I port the script(It is super low in priority list) , so I will appreciate if you take a stab at it. I will always be around If there is anything you need to know to help with porting.
FWIW I started playing around with it at https://github.com/daurnimator/zig-unicode but ran into some issues. I asked in the zig irc channel and got this reply from @andrewrk:
< andrewrk> | this use case of using
@embedFileand parsing the stuff at comptime, is a good use case of zig. but I think zig is too immature to handle it right now. it'll be worth trying this again when self hosted is done
So I guess I'll put this project on hold for a while.
@daurnimator Maybe I misunderstood your concerns, is there something else that this lib is lacking or not doing right? You will still need to generate the tables/symbols and doing it at runtime is just not cool(expensive etc).
I mean I get it when you said you wanted to use older versions of unicode, which I believe is possible ( just generate the tables.zig with the old unicode version.
I kinda worked hard on this, so any feedback that will help me improve it is highly appreciated. That way I can see if we can add/resolve the issue and I can feel much better about myself(yeah, just don't wanna be be the guy who build stuff that no one never uses)
Maybe I misunderstood your concerns, is there something else that this lib is lacking or not doing right? You will still need to generate the tables/symbols and doing it at runtime is just not cool(expensive etc
I mainly wanted a place to play with writing my own table code in zig. I attempted to do it with zunicode but it got in the way more than it helped, so I started fresh.
I kinda worked hard on this, so any feedback that will help me improve it is highly appreciated.
A few misc things:
- the codebase has lots of inconsistencies between using
u32andi32, they should really beu21 - https://github.com/gernest/zunicode/blob/c0764bcdc9e79f186263e2c6a9ed592033259956/src/tables.zig#L57 could just be a call to
std.meta.tagName - I don't understand your split between Range16 and Range32
- Unicode attributes are missing e.g.
NumericValue
Thanks for the feedback
I mainly wanted a place to play with writing my own table code in zig. I attempted to do it with >zunicode but it got in the way more than it helped, so I started fresh.
I see, table generation is completely handled by go, I think I already said this before. The limitation isn't on zunicode but zig, else I would have ported it to zig already.
the codebase has lots of inconsistencies between using u32 and i32, they should really be u21
Remember that this is a direct port of golang unicode std lib. I'm not a domain expert in unicode and I'm also not a zig expert too. From my reddit post you linked I was calling for help to improve. I really don't mind using u21 I just don't know how so we can just collaborate where I can do my best to help, so long it works.
Note that I also had to port the test suite to ensure I was achieving correct behaviour .
could just be a call to std.meta.tagName
std.meta.tagName returns []const u8 but the parent fn wants to return *RangeTable symbol, using switch sounded more cleaner, because I would avoid multiple std.mem.eql to check which tag is which. Again , this is my first month of zig, if you don't mind can you show me a snippet where std.meta.tagName will fit better? I will update the table generator ASAP.
I don't understand your split between Range16 and Range32
Me neither, I took it from Go, and it works. I will ditch it in a heartbeat if there is another way.
Unicode attributes are missing e.g. NumericValue
Maybe naming? There is isNumber fn for checking numerical values and the test suite for it.
std.meta.tagNamereturns[]const u8but the parent fn wants to return*RangeTablesymbol, using switch sounded more cleaner, because I would avoid multiplestd.mem.eqlto check which tag is which. Again , this is my first month of zig, if you don't mind can you show me a snippet wherestd.meta.tagNamewill fit better? I will update the table generator ASAP.
I think this will work?
@field(RangeTable, std.meta.tagName(x))