commons-imaging icon indicating copy to clipboard operation
commons-imaging copied to clipboard

IMAGING-168 installing package with Swedish characters adds junk char…

Open khemkasudeep opened this issue 9 years ago • 18 comments

…acters to dc:title property

khemkasudeep avatar Aug 03 '15 07:08 khemkasudeep

Hello @khemkasudeep,

after applying your changes I get:

Tests in error:
  ByteSourceImageTest.test:84->checkGuessFormat:139 » ImageRead Couldn't read ma...
  BmpReadTest.data:44->BmpBaseTest.getBmpImages:45->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->BmpBaseTest.access$000:30->BmpBaseTest.isBmp:34 » ImageRead
  DcxReadTest.data:40->DcxBaseTest.getDcxImages:45->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->DcxBaseTest.access$000:30->DcxBaseTest.isDcx:34 » ImageRead
  GifReadTest.data:40->GifBaseTest.getGifImages:45->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->GifBaseTest.access$000:30->GifBaseTest.isGif:34 » ImageRead
  IcnsReadTest.data:42->IcnsBaseTest.getIcnsImages:45->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->IcnsBaseTest.access$000:30->IcnsBaseTest.isIcns:34 » ImageRead
  JpegReadTest.data:46->JpegBaseTest.getJpegImages:45->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->JpegBaseTest.isJpeg:34 » ImageRead
  IptcCodedCharacterSetTest.testCodedCharacterSet:61 » ImageRead Unexpected EOF.
  PamReadTest.test:41->PamBaseTest.getPamImages:45->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->PamBaseTest.access$000:30->PamBaseTest.isPam:34 » ImageRead
  ConvertPngToGifTest.test:38->PngBaseTest.getPngImages:45->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->PngBaseTest.access$000:30->PngBaseTest.isPng:34 » ImageRead
  PngReadTest.test:40->PngBaseTest.getPngImages:45->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->PngBaseTest.access$000:30->PngBaseTest.isPng:34 » ImageRead
  PsdReadTest.test:40->PsdBaseTest.getPsdImages:43->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->PsdBaseTest.access$000:29->PsdBaseTest.isPsd:32 » ImageRead
  RgbeReadTest.test:39->RgbeBaseTest.getRgbeImages:43->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->RgbeBaseTest.access$000:29->RgbeBaseTest.isRgbe:32 » ImageRead
  TiffReadTest.test:36->TiffBaseTest.getTiffImages:45->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->TiffBaseTest.access$000:30->TiffBaseTest.isTiff:34 » ImageRead
  TiffRoundtripTest.test:41->TiffBaseTest.getTiffImages:45->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->TiffBaseTest.access$000:30->TiffBaseTest.isTiff:34 » ImageRead
  XmpDumpTest.test:45 » ImageRead Couldn't read magic numbers to guess format.
  XmpUpdateTest.test:52 » ImageRead Couldn't read magic numbers to guess format.

Tests run: 394, Failures: 0, Errors: 16, Skipped: 3

Without your changes, all tests pass. Can you have a look? Please make sure to run mvn clean verify before submitting PRs. Thank you!

britter avatar Oct 02 '15 08:10 britter

@britter Tests are being passed in windows machine. What OS are you using?

khemkasudeep avatar Oct 02 '15 13:10 khemkasudeep

@khemkasudeep I'm on

Apache Maven 3.3.3 (7994120775791599e205a5524ec3e0dfe41d4a06; 2015-04-22T13:57:37+02:00)
Maven home: /usr/local/Cellar/maven/3.3.3/libexec
Java version: 1.7.0_79, vendor: Oracle Corporation
Java home: /Library/Java/JavaVirtualMachines/jdk1.7.0_79.jdk/Contents/Home/jre
Default locale: de_DE, platform encoding: UTF-8
OS name: "mac os x", version: "10.10.5", arch: "x86_64", family: "mac"

Maybe you can try in a Linux VM?

britter avatar Oct 02 '15 17:10 britter

Hi @britter

I tried on Apache Maven 3.3.3 (7994120775791599e205a5524ec3e0dfe41d4a06; 2015-04-22T17:27:37+05:30) Maven home: /usr/local/Cellar/maven/3.3.3/libexec Java version: 1.7.0_67, vendor: Oracle Corporation Java home: /Library/Java/JavaVirtualMachines/jdk1.7.0_67.jdk/Contents/Home/jre Default locale: en_US, platform encoding: UTF-8 OS name: "mac os x", version: "10.10.5", arch: "x86_64", family: "mac"

But it again passed. Can it depend on locale? Could you provide complete log?

khemkasudeep avatar Oct 05 '15 06:10 khemkasudeep

Hello @khemkasudeep

here is the build log: https://gist.github.com/britter/ec0b7b15128bf8931e6c

I'm currently looking for a way to configure surefire to use english locale on my system...

britter avatar Oct 07 '15 17:10 britter

@britter I tried below. Even then the test passed. <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-surefire-plugin</artifactId> <argLine>-Duser.language=de -Duser.region=DE</argLine>

khemkasudeep avatar Oct 08 '15 16:10 khemkasudeep

@khemkasudeep I also found that one and tried to set my build to your locale. Build still failing. I assume the configuration doesn't work or this problem is unrelated to locale.

britter avatar Oct 08 '15 16:10 britter

Hi @britter , @khemkasudeep I tested this patch on my mac with below maven setup and it passed without any flaws. @britter, could you please start afresh with conning the repo again and merging this PR there .
Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-10T22:11:47+05:30) Maven home: /usr/local/Cellar/maven/3.3.9/libexec Java version: 1.7.0_80, vendor: Oracle Corporation Java home: /Library/Java/JavaVirtualMachines/jdk1.7.0_80.jdk/Contents/Home/jre Default locale: de_DE, platform encoding: UTF-8 OS name: "mac os x", version: "10.11.3", arch: "x86_64", family: "mac"

ashokpanghal avatar Mar 02 '16 07:03 ashokpanghal

I patched this library with this merge request myself and can verify that under windows and linux all tests are passing successfully and the bugfix is solved successfully, too. Please merge this to fix this bug!

Apache Maven 3.3.9
Maven home: /usr/share/maven3
Java version: 1.8.0_171, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-8-oracle/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "3.19.0-25-generic", arch: "amd64", family: "unix"
Tests run: 433, Failures: 0, Errors: 0, Skipped: 15


Apache Maven 3.6.0
Maven home: C:\dev\tools\apache-maven-3.6.0
Java version: 1.8.0_191, vendor: Oracle Corporation, runtime: C:\dev\java\jdk1.8.0_191\jre
Default locale: de_DE, platform encoding: Cp1252
OS name: "windows 10", version: "10.0", arch: "amd64", family: "windows"
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=1024m; support was removed in 8.0
Tests run: 433, Failures: 0, Errors: 0, Skipped: 15

PS: If you want to do this, too:

git clone https://github.com/apache/commons-imaging.git
cd commons-imaging
git fetch origin pull/18/head:bugfix-iptc-encoding
mvn test

janArb avatar Aug 01 '19 12:08 janArb

Hi @janArb ! Thanks for testing and reporting back here.

Unfortunately there is some feedback not addressed in the pull request. I can sort out the conflict files, but would be better if @khemkasudeep could reply/address the pending review comments.

Otherwise we would need a new clean pull request, but that would be bad as @khemkasudeep appears to have a working solution that only needs a bit of polishing. I'll see if I can have another look with more calm in the next days.

Bruno

kinow avatar Aug 01 '19 12:08 kinow

BTW I think Latin1 is one of the 8bit transparent encodings, so the old code would never be unequal. This is one of the reasons why Latin1 is used if a encoding is not well defined. I do not understand iptc enough here to say if it is good or bad to use a different encoding, but it might need to be configurable...

ecki avatar Aug 01 '19 13:08 ecki

@ecki It was not configurable before and latin is still the fallback if no encoding is defined. Why should this bugfix here solve more features? I think an encoding configuration should be treated as feature. PS: It definitely is good to support UTF8 if that is defined in the official IPTC encoding tag (https://de.wikipedia.org/wiki/IPTC-IIM-Standard -> Coded Character Set)

janArb avatar Aug 01 '19 13:08 janArb

Hi @janArb I raised this PR in 2015. It has been a long time since I touched this code base. But AFAIR, the only blocker at that time for this patch was @britter was getting build issues on his machine while patch was building fine on my machine.

khemkasudeep avatar Aug 02 '19 10:08 khemkasudeep

@ecki

BTW I think Latin1 is one of the 8bit transparent encodings, so the old code would never be unequal. This is one of the reasons why Latin1 is used if a encoding is not well defined. I do not understand iptc enough here to say if it is good or bad to use a different encoding, but it might need to be configurable...

I don't know much about IPTC, and this issue with transparent encodings. But looks to me like after this PR we would have ISO-8859-1 as default/fallback, but if within the metadata another value was defined, then the parser would now respect and use it... so in a certain way it would be configurable I think?

kinow avatar Aug 02 '19 10:08 kinow

@janArb

Why should this bugfix here solve more features?

I think @ecki was merely pointing out something to watch out while reviewing it, and as you explained, after this PR it should be configurable I think.

kinow avatar Aug 02 '19 10:08 kinow

@khemkasudeep

Hi! And thanks again for your contribution. I just fetched the code from your branch as was in the process or rebasing against master to try and solve the conflicts, when I noticed that the bytes were removed in IptcRecord :confused:

I will need to look at the commit log and understand better what changed why, and how to massage your code to get it in a state ready to be merged again (unless you have spare time to fix it?).

Once that's done, I think it's ready to be merged - and @janArb kindly tested it as well, so looks like a good issue to be included in 1.0-alpha2 :+1:

Bruno

kinow avatar Aug 02 '19 10:08 kinow

I think @ecki was merely pointing out something to watch out while reviewing it, and as you explained, after this PR it should be configurable I think.

Yes, exactly, the code I commented did not look like it is configurable. If it is, it might be fine (however configurable is not very useful in scenarios where you work on collections of files where some have different encoding, a binary transparent encoding instead of encoding exceptions is preferred in that case (at least to me).

ecki avatar Aug 02 '19 11:08 ecki

Just realized that even after the patch for reading UTF8 the writing of IPTC tags is also still limited to Latin (https://github.com/apache/commons-imaging/blob/2752dd7f613fca3fcb799d9f8fbdefa7793d4748/src/main/java/org/apache/commons/imaging/formats/jpeg/iptc/IptcParser.java#L489) in org.apache.commons.imaging.formats.jpeg.iptc.IptcParser#writeIPTCBlock

                final byte[] recordData = element.value.getBytes("ISO-8859-1");
                if (!new String(recordData, "ISO-8859-1").equals(element.value)) {
                    throw new ImageWriteException(
                            "Invalid record value, not ISO-8859-1");

janArb avatar Aug 14 '19 15:08 janArb