unicodetools
unicodetools copied to clipboard
Use ICU4X to run parts of util.unicode.org
Currently util.unicode.org runs on top of ICU4J. It works fine, but sometimes it is slow or hits rate limits that we've imposed to cap server costs, as it is doing as I write this message:
We should add ICU4X-backed tooling to parts of util.unicode.org via WebAssembly. This has the benefit of reducing latency (all calculations are client-side) and serving costs (the ICU4X wasm file can be cost-efficiently cached and served in a CDN).
The Unicode Tools are designed to run on the latest (even unreleased) version of the Unicode Standard, and so part of this project may involve improving some of the ICU4X tooling so that it can read raw UCD files. See https://github.com/unicode-org/icu4x/issues/4602
CC @josh-hadley @eggrobin
There is much more to it than using draft data. We also expose properties that should never be part of APIs, things that are not properties, etc., and we use an extremely featureful version of UnicodeSet that should not be part of general-purpose libraries (in particular, this allows you to look at past versions of Unicode, or search property values using regular expressions).
In particular, the fact that the properties library is the same that we use to actually generate and test the standard matters when it comes to being confident that we know what we are publishing.
I suspect that the traffic we are seeing is some sort of crawling bot though:
All of these queries are « the characters that have some specific value of some property property » (typically with one result), but without much rhyme or reason to what values and properties are queried. Queries of this form are linked from the character.jsp page, so I suspect this is something following the links there.
Here’s the current traffic from one specific (slightly odd) user agent:
Hmmm, maybe we could block that user, and throttle anyone with more than 1 query per 10 seconds?
On Sun, Jan 26, 2025, 06:30 Robin Leroy @.***> wrote:
Here’s the current traffic from one specific (slightly odd) user agent:
image.png (view on web) https://github.com/user-attachments/assets/b70a6309-d506-45f9-b321-959704609b7c
— Reply to this email directly, view it on GitHub https://github.com/unicode-org/unicodetools/issues/1004#issuecomment-2614448129, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMCIWJO35MX5HAIJ2ID2MTWPHAVCNFSM6AAAAABV4DY2O2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJUGQ2DQMJSHE . You are receiving this because you are subscribed to this thread.Message ID: @.***>
@sffc is this ticket a dup? or have we just talked about it without an issue?
if node is nodejs maybe someone is scraping.
This issue is to track migrating parts of until.unicode.org to ICU4X, which has been discussed in various forums, but I couldn't find a canonical issue.
Investigation on other server cost mitigation or rate limiting techniques could be discussed elsewhere. That doesn't however invalidate the motivation for making popular parts of the site run client-side.
Summarizing a discussion with @sffc in Zürich:
It would likely be useful to publish, somewhere on unicode.org, a limited ICU4X-based properties inspection and UnicodeSet query tool based on current published data, but this should be independent of the existing tools.
When it comes to « the whole UCD (including Unihan, Unikemet, etc.) and the kitchensink, for all past versions and for draft data » that the JSPs provide for UTC work, reimplementing that (in ICU4X or anywhere) would be a very difficult project. In addition, the benefits are not so clear; historical UCD data is measured in gigabytes, so sending it to the client to perform the query locally is not very practical.
In addition, for UTC work, we want the properties that are displayed by these tools to be based on the same implementation as the invariant tests and the data file generation. I have been slowly removing parts of the tools that were using properties from ICU rather than from implementations in the tools, see https://github.com/unicode-org/unicodetools/issues/502, https://github.com/unicode-org/unicodetools/pull/835, etc., as well as moving as much as possible to the modern (2011) properties implementation from the older (1996) one, see https://github.com/unicode-org/unicodetools/pull/488, etc. Adding another UCD parser and UnicodeSet parser in the mix would be unhelpful.
As for the rate limiting issues mentioned in the OP, they came from queries for niche properties that ICU4X probably shouldn’t support, and whose implementation in the unicodetools is ridiculously inefficient, see https://github.com/unicode-org/unicodetools/pull/1018. Eventually we should write some reasonable data structures in the unicodetools to properly support queries on multivalued properties with many different values.