icu
icu copied to clipboard
ICU-22080 Add plain ASCII as an explicitly detected type
Checklist
- [x] Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22080
- [x] Required: The PR title must be prefixed with a JIRA Issue number.
- [x] Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
- [x] Required: Each commit message must be prefixed with a JIRA Issue number.
- [ ] Issue accepted (done by Technical Committee after discussion)
- [x] Tests included, if applicable
- [ ] API docs and/or User Guide docs changed or added, if applicable
Notice: the branch changed across the force-push!
- icu4j/main/classes/core/src/com/ibm/icu/text/CharsetRecog_ASCII.java is different
- icu4j/main/tests/charset/src/com/ibm/icu/dev/test/charset/TestCharSetRecognition.java is no longer changed in the branch
- icu4j/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/TestCharsetDetector.java is now changed in the branch
~ Your Friendly Jira-GitHub PR Checker Bot
Notice: the branch changed across the force-push!
- icu4j/main/classes/core/src/com/ibm/icu/text/CharsetRecog_ASCII.java is different
- icu4j/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/CharsetDetectionTests.xml is now changed in the branch
~ Your Friendly Jira-GitHub PR Checker Bot
Notice: the branch changed across the force-push!
- icu4j/main/classes/core/src/com/ibm/icu/text/CharsetDetector.java is different
- icu4j/main/classes/core/src/com/ibm/icu/text/CharsetRecog_ASCII.java is different
- icu4j/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/CharsetDetectionTests.xml is different
- icu4j/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/TestCharsetDetector.java is different
~ Your Friendly Jira-GitHub PR Checker Bot
I notice the implementation is only in Java. To put a change like this into ICU we'll also need it ported back to C/C++.
I notice the implementation is only in Java. To put a change like this into ICU we'll also need it ported back to C/C++.
Hi @koppor I see that you gave Rich's comment a 👍 but I think @richgillam was really asking whether you would be willing to add the C/C++ port in this PR...
FYI @srl295 @aheninger
Also, should we return "US-ASCII" which is more specific than "ASCII"? Or is that too pedantic?
I think ASCII is sufficiently unambiguous.
On Fri, Sep 9, 2022 at 5:23 PM Markus Scherer @.***> wrote:
FYI @srl295 https://github.com/srl295 @aheninger https://github.com/aheninger
Also, should we return "US-ASCII" which is more specific than "ASCII"? Or is that too pedantic?
— Reply to this email directly, view it on GitHub https://github.com/unicode-org/icu/pull/2127#issuecomment-1242571959, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMER6NH6YHFWFK6WXWTV5PIHFANCNFSM52UC2RXQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
FYI @srl295 @aheninger
Also, should we return "US-ASCII" which is more specific than "ASCII"? Or is that too pedantic?
ASCII is the same. US-ASCII is probably slightly more pedantic. The canonical id is ansi xj 1967 something.
So.. I think ASCII is fine probably more recognizable these days.
I notice the implementation is only in Java. To put a change like this into ICU we'll also need it ported back to C/C++. Hi @koppor I see that you gave Rich's comment a 👍 but I think @richgillam was really asking whether you would be willing to add the C/C++ port in this PR...
Oh, OK, I understood the "we" in a wrong way. I would suggest to get the Java code finished and then I'll look around for a C++ expert having the time to work on this.
I committed the suggestions using GitHub's features. I did some other tweaks. The tests in "TestCharsetDector" successfully run locally. Not sure why. I think, there will be some other classes failing, so I'll wait for the CI.
After "we" sorted everything out, I'll squash into one commit and add a Co-authored-by
for @markusicu
Note that I needed to replace ISO-2022-JP
in the context of com.ibm.icu.dev.test.charsetdet.TestCharsetDetector#TestBufferOverflow
by ASCII
as the byte sequence (line 283) does not contain any non-7-bit characters. Think, it is no harm, because the shift state "at the start" is "a bad one".
Java code finished. Now, I would try to port it to the C implementation.