rust-unic
rust-unic copied to clipboard
Initial implementation for unic-ucd-unihan
This is a partial implementation of ucd-unihan
#224 .
Changed areas
-
gen
- New subcrate
ucd/unihan
Notes
One other thing to consider is - as Unihan is a CJK centric module in the Unicode standard - maybe we could make this crate an optional subcrate of the rust-unic
super crate and user needs to opt-in explicitly to use it.
Failures are unrelated; I'll fix those as soon as I get a chance this weekend.
At a glance-over, this looks good; I'll do a more detailed pass this weekend.
Thanks for building this, @eyeplum!
I'm in the process of moving source data files out of the repository, to be imported as submodules. (The download+unzip scripts will be in Python, I guess.)
Having the data externally, we won't have to deal with downloading/unzipping ourselves, which would be much better for this repo. Also, allowing easier addition of other sources and models.
If you like, we can rebase and try to land this work, and drop the data retrieving parts later. Or, we can just wait for the external sources and drop the data source work from this PR. What do you think?
Sure, wait for the external sources sounds like a better option 👍
Ping ?
In #247, I have added the complete Unicode UCD data package, under the new address: /external/unicode/ucd/data/
, which also contains a Unihan/
directory with all of its txt files. (See https://github.com/open-i18n/data-unicode-ucd/tree/master/data/Unihan)
So, there's no data
package anymore, and you just need to write the gen
rules for the tables, and implement the algorithm in a new component.
Also, as a reminder, since the new source data files are imported as submodules, you need to do git submodule update --init
to get the files, before running /etc/generate-tables.sh
.
Hey @behnam , thanks for the review!
I have updated the pull request to read Unihan data from ./external/unicode/ucd/data/Unihan/
and migrated related code changes to Rust 2018.
Hey @behnam , I'm planning to merge this in this weekend. Although the functionalities are very limited at the moment, I imagine this might be somewhat useful for users that needs Unihan. I'm planning to map most of the Unihan tables later this year, hopefully I will have enough time to make it happen.
As of now, before I start mapping more Unihan contents, I'm thinking about tackling the Unicode 11.0 upgrade for rust-unic first. Mainly because Unicode 12.0 is coming and I think it would be nice for us to at least update to Unicode 11.0 so future updates will be more manageable. I may need some help planning the work as there seems to be a lot involved. I will probably create a new issue so we can have more detailed discussions there.
What do you think?
Hey @behnam and @CAD97 , are you happy if I merge this in?
Is there any news?
Is this still being worked on?
@asg0451, I'm not planning to merge this anytime soon, you could try it out in my fork if you are interested in Unihan https://github.com/eyeplum/rust-unic