icu icon indicating copy to clipboard operation
icu copied to clipboard

ICU-22080 Add plain ASCII as an explicitly detected type

Open koppor opened this issue 2 years ago • 13 comments

Checklist
  • [x] Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22080
  • [x] Required: The PR title must be prefixed with a JIRA Issue number.
  • [x] Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • [x] Required: Each commit message must be prefixed with a JIRA Issue number.
  • [ ] Issue accepted (done by Technical Committee after discussion)
  • [x] Tests included, if applicable
  • [ ] API docs and/or User Guide docs changed or added, if applicable

koppor avatar Jul 04 '22 20:07 koppor

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Jul 04 '22 20:07 CLAassistant

Notice: the branch changed across the force-push!

  • icu4j/main/classes/core/src/com/ibm/icu/text/CharsetRecog_ASCII.java is different
  • icu4j/main/tests/charset/src/com/ibm/icu/dev/test/charset/TestCharSetRecognition.java is no longer changed in the branch
  • icu4j/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/TestCharsetDetector.java is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Notice: the branch changed across the force-push!

  • icu4j/main/classes/core/src/com/ibm/icu/text/CharsetRecog_ASCII.java is different
  • icu4j/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/CharsetDetectionTests.xml is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Notice: the branch changed across the force-push!

  • icu4j/main/classes/core/src/com/ibm/icu/text/CharsetDetector.java is different
  • icu4j/main/classes/core/src/com/ibm/icu/text/CharsetRecog_ASCII.java is different
  • icu4j/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/CharsetDetectionTests.xml is different
  • icu4j/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/TestCharsetDetector.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

I notice the implementation is only in Java. To put a change like this into ICU we'll also need it ported back to C/C++.

richgillam avatar Jul 14 '22 16:07 richgillam

I notice the implementation is only in Java. To put a change like this into ICU we'll also need it ported back to C/C++.

Hi @koppor I see that you gave Rich's comment a 👍 but I think @richgillam was really asking whether you would be willing to add the C/C++ port in this PR...

markusicu avatar Sep 09 '22 23:09 markusicu

FYI @srl295 @aheninger

Also, should we return "US-ASCII" which is more specific than "ASCII"? Or is that too pedantic?

markusicu avatar Sep 10 '22 00:09 markusicu

I think ASCII is sufficiently unambiguous.

On Fri, Sep 9, 2022 at 5:23 PM Markus Scherer @.***> wrote:

FYI @srl295 https://github.com/srl295 @aheninger https://github.com/aheninger

Also, should we return "US-ASCII" which is more specific than "ASCII"? Or is that too pedantic?

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/icu/pull/2127#issuecomment-1242571959, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMER6NH6YHFWFK6WXWTV5PIHFANCNFSM52UC2RXQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

macchiati avatar Sep 10 '22 01:09 macchiati

FYI @srl295 @aheninger

Also, should we return "US-ASCII" which is more specific than "ASCII"? Or is that too pedantic?

ASCII is the same. US-ASCII is probably slightly more pedantic. The canonical id is ansi xj 1967 something.

So.. I think ASCII is fine probably more recognizable these days.

srl295 avatar Sep 10 '22 02:09 srl295

I notice the implementation is only in Java. To put a change like this into ICU we'll also need it ported back to C/C++. Hi @koppor I see that you gave Rich's comment a 👍 but I think @richgillam was really asking whether you would be willing to add the C/C++ port in this PR...

Oh, OK, I understood the "we" in a wrong way. I would suggest to get the Java code finished and then I'll look around for a C++ expert having the time to work on this.

koppor avatar Sep 24 '22 07:09 koppor

I committed the suggestions using GitHub's features. I did some other tweaks. The tests in "TestCharsetDector" successfully run locally. Not sure why. I think, there will be some other classes failing, so I'll wait for the CI.

After "we" sorted everything out, I'll squash into one commit and add a Co-authored-by for @markusicu

koppor avatar Sep 24 '22 07:09 koppor

Note that I needed to replace ISO-2022-JP in the context of com.ibm.icu.dev.test.charsetdet.TestCharsetDetector#TestBufferOverflow by ASCII as the byte sequence (line 283) does not contain any non-7-bit characters. Think, it is no harm, because the shift state "at the start" is "a bad one".

koppor avatar Nov 27 '22 21:11 koppor

Java code finished. Now, I would try to port it to the C implementation.

koppor avatar Nov 27 '22 21:11 koppor