metadata-extractor icon indicating copy to clipboard operation
metadata-extractor copied to clipboard

Misinterpreting encoding of exif copyright message

Open davidekholm opened this issue 7 years ago • 13 comments

The attached sample image contains an exif copyright message that's UTF-8 encoded, but metadata-extractor misinterprets this as ISO-8859-1. It's correctly interpreted on Mac (where the system property file.encoding is UTF-8), but fails to decode it on Windows.

vecht sahara 140328 1005

davidekholm avatar Jun 20 '17 12:06 davidekholm

I have this problem too!

acwolff avatar Jul 25 '17 12:07 acwolff

The same problem for me, but for the caption/description IPTC field, on OSX in my case.

mateusz-fiolka avatar Nov 21 '17 14:11 mateusz-fiolka

For anyone that wants to solve this: This is caused by not specifying character encoding when converting form bytes to string. Many methods have overloads with or without a Charset parameter, and when the version without the this parameter is used, Java use the "default Charset" in the conversion.

This default charset can't be trusted to be what you want and should almost never be used. It's set when the JVM is launched, and can be specified by whoever starts the application. It defaults to the OS' "standard" encoding, which is a localized codepage on Windows for backwards compatibility. On macOS and Linux it's UTF-8 by default.

FindBugs can be used to quickly find all places in the code where there is reliance on the default encoding.

In case FindBugs isn't installed/configured, here's the status for the current master when it comes to default encoding reliance:

Source/com/drew/metadata/mov/QuickTimeDescriptor.java:61 Found reliance on default encoding in com.drew.metadata.mov.QuickTimeDescriptor.getMajorBrandDescription(): new String(byte[]) [Of Concern(19), High confidence]
Tests/com/drew/lang/ByteTrieTest.java:41 Found reliance on default encoding in com.drew.lang.ByteTrieTest.testBasics(): String.getBytes() [Of Concern(19), High confidence]
Source/com/drew/lang/SequentialReader.java:314 Found reliance on default encoding in com.drew.lang.SequentialReader.getString(int, String): new String(byte[]) [Of Concern(19), High confidence]
Source/com/drew/lang/SequentialReader.java:304 Found reliance on default encoding in com.drew.lang.SequentialReader.getString(int): new String(byte[]) [Of Concern(19), High confidence]
Source/com/drew/metadata/adobe/AdobeJpegReader.java:55 Found reliance on default encoding in com.drew.metadata.adobe.AdobeJpegReader.readJpegSegments(Iterable, Metadata, JpegSegmentType): new String(byte[], int, int) [Of Concern(19), High confidence]
Source/com/drew/metadata/StringValue.java:74 Found reliance on default encoding in com.drew.metadata.StringValue.toString(Charset): new String(byte[]) [Of Concern(19), High confidence]
Source/com/drew/metadata/mp4/Mp4Dictionary.java:131 Found reliance on default encoding in com.drew.metadata.mp4.Mp4Dictionary.<static initializer for Mp4Dictionary>(): new String(byte[]) [Of Concern(19), High confidence]
Source/com/drew/metadata/photoshop/DuckyReader.java:57 Found reliance on default encoding in com.drew.metadata.photoshop.DuckyReader.readJpegSegments(Iterable, Metadata, JpegSegmentType): new String(byte[], int, int) [Of Concern(19), High confidence]
Source/com/drew/metadata/mov/QuickTimeDictionary.java:135 Found reliance on default encoding in com.drew.metadata.mov.QuickTimeDictionary.<static initializer for QuickTimeDictionary>(): new String(byte[]) [Of Concern(19), High confidence]
Tests/com/drew/metadata/exif/ExifSubIFDDescriptorTest.java:69 Found reliance on default encoding in com.drew.metadata.exif.ExifSubIFDDescriptorTest.testUserCommentDescription_ZeroLengthAscii1(): String.getBytes() [Of Concern(19), High confidence]
Tests/com/drew/metadata/exif/ExifSubIFDDescriptorTest.java:80 Found reliance on default encoding in com.drew.metadata.exif.ExifSubIFDDescriptorTest.testUserCommentDescription_ZeroLengthAscii2(): String.getBytes() [Of Concern(19), High confidence]
Source/com/drew/tools/ProcessAllImagesInFolderUtility.java:576 Found reliance on default encoding in com.drew.tools.ProcessAllImagesInFolderUtility$MarkdownTableOutputHandler.writeOutput(PrintStream): new java.io.OutputStreamWriter(OutputStream) [Of Concern(19), High confidence]
Source/com/drew/tools/ProcessAllImagesInFolderUtility.java:557 Found reliance on default encoding in com.drew.tools.ProcessAllImagesInFolderUtility$MarkdownTableOutputHandler.onScanCompleted(PrintStream): new java.io.PrintStream(OutputStream, boolean) [Of Concern(19), High confidence]
Source/com/drew/metadata/mov/metadata/QuickTimeDataHandler.java:76 Found reliance on default encoding in com.drew.metadata.mov.metadata.QuickTimeDataHandler.processAtom(Atom, byte[]): String.getBytes() [Of Concern(19), High confidence]
Tests/com/drew/metadata/exif/ExifSubIFDDescriptorTest.java:48 Found reliance on default encoding in com.drew.metadata.exif.ExifSubIFDDescriptorTest.testUserCommentDescription_AsciiHeaderAsciiEncoding(): String.getBytes() [Of Concern(19), High confidence]
Source/com/drew/metadata/mov/metadata/QuickTimeDataHandler.java:104 Found reliance on default encoding in com.drew.metadata.mov.metadata.QuickTimeDataHandler.processData(byte[], SequentialByteArrayReader): new String(byte[]) [Of Concern(19), High confidence]
Tests/com/drew/metadata/exif/ExifSubIFDDescriptorTest.java:58 Found reliance on default encoding in com.drew.metadata.exif.ExifSubIFDDescriptorTest.testUserCommentDescription_BlankAscii(): String.getBytes() [Of Concern(19), High confidence]
Tests/com/drew/metadata/exif/ExifSubIFDDescriptorTest.java:38 Found reliance on default encoding in com.drew.metadata.exif.ExifSubIFDDescriptorTest.testUserCommentDescription_EmptyEncoding(): String.getBytes() [Of Concern(19), High confidence]
Source/com/drew/metadata/mov/metadata/QuickTimeDataHandler.java:93 Found reliance on default encoding in com.drew.metadata.mov.metadata.QuickTimeDataHandler.processKeys(SequentialByteArrayReader): new String(byte[]) [Of Concern(19), High confidence]
Source/com/drew/metadata/mov/metadata/QuickTimeDataHandler.java:62 Found reliance on default encoding in com.drew.metadata.mov.metadata.QuickTimeDataHandler.shouldAcceptContainer(Atom): String.getBytes() [Of Concern(19), High confidence]
Source/com/drew/tools/ProcessAllImagesInFolderUtility.java:75 Found reliance on default encoding in com.drew.tools.ProcessAllImagesInFolderUtility.main(String[]): new java.io.PrintStream(OutputStream, boolean) [Of Concern(19), High confidence]
Source/com/drew/lang/RandomAccessReader.java:390 Found reliance on default encoding in com.drew.lang.RandomAccessReader.getString(int, int, String): new String(byte[]) [Of Concern(19), High confidence]
Tests/com/drew/lang/SequentialAccessTestBase.java:247 Found reliance on default encoding in com.drew.lang.SequentialAccessTestBase.testGetString(): new String(byte[]) [Of Concern(19), High confidence]
Source/com/drew/metadata/avi/AviRiffHandler.java:90 Found reliance on default encoding in com.drew.metadata.avi.AviRiffHandler.processChunk(String, byte[]): new String(byte[]) [Of Concern(19), High confidence]
Source/com/drew/metadata/jfif/JfifReader.java:58 Found reliance on default encoding in com.drew.metadata.jfif.JfifReader.readJpegSegments(Iterable, Metadata, JpegSegmentType): new String(byte[], int, int) [Of Concern(19), High confidence]
Source/com/drew/metadata/exif/makernotes/OlympusMakernoteDescriptor.java:819 Found reliance on default encoding in com.drew.metadata.exif.makernotes.OlympusMakernoteDescriptor.getCameraIdDescription(): new String(byte[]) [Of Concern(19), High confidence]
Source/com/drew/metadata/iptc/IptcReader.java:175 Found reliance on default encoding in com.drew.metadata.iptc.IptcReader.processTag(SequentialReader, Directory, int, int, int): new String(byte[]) [Of Concern(19), High confidence]
Source/com/drew/metadata/photoshop/PhotoshopDescriptor.java:318 Found reliance on default encoding in com.drew.metadata.photoshop.PhotoshopDescriptor.getSimpleString(int): new String(byte[]) [Of Concern(19), High confidence]
Source/com/drew/metadata/Directory.java:475 Found reliance on default encoding in com.drew.metadata.Directory.getInteger(int): String.getBytes() [Of Concern(19), High confidence]
Tests/com/drew/metadata/DirectoryTest.java:210 Found reliance on default encoding in com.drew.metadata.DirectoryTest.testSetStringGetInt(): new String(byte[]) [Of Concern(19), High confidence]
Source/com/drew/metadata/eps/EpsReader.java:256 Found reliance on default encoding in com.drew.metadata.eps.EpsReader.extractXmpData(Metadata, SequentialReader): String.getBytes() [Of Concern(19), High confidence]
Source/com/drew/metadata/exif/ExifDescriptorBase.java:649 Found reliance on default encoding in com.drew.metadata.exif.ExifDescriptorBase.getUserCommentDescription(): new String(byte[], int, int) [Of Concern(19), High confidence]
Source/com/drew/metadata/TagDescriptor.java:267 Found reliance on default encoding in com.drew.metadata.TagDescriptor.get7BitStringFromBytes(int): new String(byte[], int, int) [Of Concern(19), High confidence]
Source/com/drew/metadata/wav/WavRiffHandler.java:111 Found reliance on default encoding in com.drew.metadata.wav.WavRiffHandler.processChunk(String, byte[]): new String(byte[]) [Of Concern(19), High confidence]
Source/com/drew/imaging/FileTypeDetector.java:46 Found reliance on default encoding in com.drew.imaging.FileTypeDetector.<static initializer for FileTypeDetector>(): String.getBytes() [Of Concern(19), High confidence]
Source/com/drew/imaging/FileTypeDetector.java:183 Found reliance on default encoding in com.drew.imaging.FileTypeDetector.detectFileType(BufferedInputStream): new String(byte[], int, int) [Of Concern(19), High confidence]
Source/com/drew/metadata/mov/metadata/QuickTimeDirectoryHandler.java:86 Found reliance on default encoding in com.drew.metadata.mov.metadata.QuickTimeDirectoryHandler.processData(byte[], SequentialByteArrayReader): new String(byte[]) [Of Concern(19), High confidence]
Source/com/drew/metadata/photoshop/PhotoshopReader.java:65 Found reliance on default encoding in com.drew.metadata.photoshop.PhotoshopReader.readJpegSegments(Iterable, Metadata, JpegSegmentType): new String(byte[], int, int) [Of Concern(19), High confidence]
Source/com/drew/metadata/mov/metadata/QuickTimeDirectoryHandler.java:68 Found reliance on default encoding in com.drew.metadata.mov.metadata.QuickTimeDirectoryHandler.processAtom(Atom, byte[]): new String(byte[]) [Of Concern(19), High confidence]
Source/com/drew/metadata/mp4/Mp4Descriptor.java:59 Found reliance on default encoding in com.drew.metadata.mp4.Mp4Descriptor.getMajorBrandDescription(): new String(byte[]) [Of Concern(19), High confidence]
Source/com/drew/lang/StringUtil.java:84 Found reliance on default encoding in com.drew.lang.StringUtil.fromStream(InputStream): new java.io.InputStreamReader(InputStream) [Of Concern(19), High confidence]
Source/com/drew/metadata/xmp/XmpReader.java:101 Found reliance on default encoding in com.drew.metadata.xmp.XmpReader.readJpegSegments(Iterable, Metadata, JpegSegmentType): new String(byte[], int, int) [Of Concern(19), High confidence]
Source/com/drew/metadata/icc/IccReader.java:215 Found reliance on default encoding in com.drew.metadata.icc.IccReader.getStringFromInt32(int): new String(byte[]) [Of Concern(19), High confidence]
Source/com/drew/metadata/icc/IccReader.java:70 Found reliance on default encoding in com.drew.metadata.icc.IccReader.readJpegSegments(Iterable, Metadata, JpegSegmentType): new String(byte[], int, int) [Of Concern(19), High confidence]
Source/com/drew/metadata/icc/IccDescriptor.java:90 Found reliance on default encoding in com.drew.metadata.icc.IccDescriptor.getTagDataString(int): new String(byte[], int, int) [Of Concern(19), High confidence]
Source/com/drew/metadata/icc/IccDescriptor.java:339 Found reliance on default encoding in com.drew.metadata.icc.IccDescriptor.getInt32FromString(String): String.getBytes() [Of Concern(19), High confidence]
Source/com/drew/imaging/riff/RiffReader.java:81 Found reliance on default encoding in com.drew.imaging.riff.RiffReader.processChunks(SequentialReader, int, RiffHandler): new String(byte[]) [Of Concern(19), High confidence]
Source/com/drew/metadata/jfxx/JfxxReader.java:58 Found reliance on default encoding in com.drew.metadata.jfxx.JfxxReader.readJpegSegments(Iterable, Metadata, JpegSegmentType): new String(byte[], int, int) [Of Concern(19), High confidence]
Tests/com/drew/lang/CompoundExceptionTest.java:74 Found reliance on default encoding in com.drew.lang.CompoundExceptionTest.testNoInnerException(): new java.io.PrintWriter(OutputStream) [Of Concern(19), High confidence]
Tests/com/drew/lang/CompoundExceptionTest.java:72 Found reliance on default encoding in com.drew.lang.CompoundExceptionTest.testNoInnerException(): new java.io.PrintStream(OutputStream) [Of Concern(19), High confidence]
Source/com/drew/metadata/exif/ExifReader.java:62 Found reliance on default encoding in com.drew.metadata.exif.ExifReader.readJpegSegments(Iterable, Metadata, JpegSegmentType): new String(byte[], int, int) [Of Concern(19), High confidence]

Nadahar avatar Nov 21 '17 14:11 Nadahar

Another year past without an solution

acwolff avatar Jul 04 '18 16:07 acwolff

@acwolff you are welcome to submit a PR.

drewnoakes avatar Jul 04 '18 16:07 drewnoakes

@drewnoakes that is already done by David Ekholm, see the first message in this thread.

acwolff avatar Jul 04 '18 17:07 acwolff

André, in all well meaning, cause we have the same goal to trigger the maintainer to address this: Avoid that negative approach. It doesn't bring out the best in people.

Cheers /David

On 4 Jul 2018, at 18:40, André Wolff [email protected] wrote:

Another year past without an solution

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/drewnoakes/metadata-extractor/issues/270#issuecomment-402524044, or mute the thread https://github.com/notifications/unsubscribe-auth/ACJ9SG0jk24SoZqLXf4a9-UebdQWT-vUks5uDO_ugaJpZM4N_h5d.

davidekholm avatar Jul 04 '18 18:07 davidekholm

@acwolff I can't see any PR. Can you link to it?

Nadahar avatar Jul 04 '18 18:07 Nadahar

@Nadahar I think the PR is the first message in this thread. I did not write a PR, I don’t have the knowledge to do that. I'm just suffering from this problem.

acwolff avatar Jul 04 '18 20:07 acwolff

@acwolff PR is short for "Pull Request". A description for those that don't know how to make one can be found here.

Nadahar avatar Jul 04 '18 20:07 Nadahar

Seems to be fixed with the latest code. I downloaded the above image and rechecked. Debugging shows that the image has correctly set the encoding tag which contains "UTF-8" as the encoding. That charset is used to decode. I stumbled upon this because I have a similar issue with an images caption tag. Unfortunately I cannot upload the sample image due to legal issues with that.

sknull avatar Jul 09 '20 08:07 sknull

This is happening to me with the latest 2.15.0 version. Somewhere there is code that for Exif bytes is just assuming that the bytes are in whatever the system charset happens to be (which might not even have any relation with where the image was generated).

I'm attaching a simple image. If I extract Exif Copyright data on a system with UTF-8 as the default charset, then I correctly get back "Copyright © 2009 Garret Wilson". But if I run it on a system using e.g. windows-1252 as the default charset, then I get a different string in which "©" has been replaced with two characters corresponding to the bytes of the UTF-8 representation of "©" in the file.

This is a huge blocker issue that corrupts data. Please fix it as soon as reasonably possible. Or let met know in which class you are converting Exif bytes to a string and I'll fix it and file a pull request.

gate-turret-reduced-exif-utf-8

garretwilson avatar Feb 07 '21 22:02 garretwilson

Confirmed, see this album.

acwolff avatar Feb 08 '21 08:02 acwolff