pyexiftool
pyexiftool copied to clipboard
UTF-8 and local codepage
Hi,
As I remember, you added a _encoding
agr to support non-unicodes.
Today, I find that a tag with “utf-8" but wide-character value is unsupported:
UnicodeDecodeError: 'gbk' codec can't decode byte 0xad in position 6326: illegal multibyte sequence
('gbk' (Chinese) is my windows's current code page.)
If I modify row 1018 in file 'exiftool.py' as fellowing, everything goes right.
#raw_stdout = raw_stdout.decode(self._encoding)
raw_stdout = raw_stdout.decode("utf-8")
Well, the result of pyexiftool is abtained from json, but json only accecpt vaild 'utf-8'. If an invaild value (of local codeding) is passed to json, the value will be modified ( but garbled), even you use "self._encoding" (local codeding) to decode the value, you can still no longer get the original value. I did some tests about it: https://exiftool.org/forum/index.php?topic=13473
On the other hand, json returns 'utf-8' and you have to use 'utf-8' to decode it, otherwise, it can't even support valid unicode "utf-8" tag value. I think non-unicode tag value support could not be done, as long as you use json to get the results. I don't know whether other local codeding rather than Chinese could be support by the ‘trik’ to decode json with local encoding, but in my exprience, it just can't. The only way to get it around is use '-b' option and decode it:
def Base64_to_Str(base64_str: str, encoding=None) -> str:
if base64_str == None or not base64_str.startswith('base64:'):
return None
b: bytes = base64.b64decode(base64_str[7:])
if encoding == None:
encoding = locale.getpreferredencoding(False) # 'cp936' same as 'gb2312'?
fixed: str = b.decode(encoding)
return fixed
That's how I deal with filenames, which are always local encoded for windows command line. (Alought now windows support to set the enconding to 'utf-8', but we can't suppose every users turn that option on). I will keep tracking in exiftool fourm, to see whether this problem could be fixed on exiftool side.
In terms of embed tags, I think morden softwares tend to set the value in 'utf-8', and that should be supported.
So, plz decode json with 'utf-8'. And non-unicon values are just not supported by json (at least json in exiftool), you can't help.
Invaild utf-8 bytes will be replaced by \x3F\x3F...
in json, and the original value is irreversibly lost.
@Yang-z you can set that encoding to your option by setting the ExifTool.encoding
property to 'utf-8'
The code was changed because not all locales use UTF-8 by default, and it uses the local codepage by default instead of forcing everyone onto UTF-8. If the codepage doesn't work for you, you can set the property to change it.
Can you post sample code with sample files which I can reproduce whatever you're seeing? I'm having a little trouble understanding the issue
Sorry, I didn't make it clear enough. Yes, I known you want to use local codepage preferentially to maximise compatibility. It's really a good idea.
However, once you use json to extracte metadata, local codepage compatibility is broken. According to the information provided by Phil Harvey on the exiftool forum, JSON of ExifTool requires valid UTF-8.
Additionally, if a value in metadata is not vaild UTF-8 (i.e. local codepage encoded), the extracted value by json is garbled, and you can't decode it with either UTF-8 or local codepage. (Invaild utf-8 bytes will be replaced by \x3F\x3F...
in json, and the original value is irreversibly lost.)
pyExifTool dones not force everyone onto UTF-8, but ExifTool dones, in terms of using json to extracte metadata.
So, my suggestion is to use UTF-8 to decode the results returned by json. Otherwise, it makes no sense. If you use UTF-8 to decode the results returned by json, the UTF-8 values can be obtained correctly even if the local codepage is not UTF-8. The compatibility increases a bit.
Please be careful with the ExifTool option of (-j), if you use it, you are forcing everyone onto UTF-8.
(my local codepage is cp936, and the conclusion above is true for me. I don't know whether other local codepage encoded bytes could be treated as vaild UTF-8 and be passed through json of ExifTool correctly.)
(update 2023-07-18: Fellowing is true in the aspect of reading and writting local codepage encoded values and filenames. On the other hand, with the help of -charset\nfilename=utf8
, along with setting pyExifTool's encoding to 'utf-8', reading and writting utf-8
values or filename could be much easier.)
Today, I try to set the value of tag "exif:usercomment" to some Chinese characters by cmd command line:
exiftool -exif:usercomment=测试 ./test.jpg
("测试" are Chinese characters and they are apparently beyond ASCII char set. )
(chcp is cp936)
And it failed:
Warning: Malformed UTF-8 character(s) - ./test.jpg 0 image files updated 1 image files unchanged
So, ExifTool itself does not support non-utf-8 writting.
(ExifTool has a -charset option for specifying the encoding, but my local codepage cp936 is not on the support list: https://exiftool.org/exiftool_pod.html#WINDOWS-UNICODE-FILE-NAMES )
However, I found a way to save these non-ASCII charecters by encoding them with 'utf-8'. The method is a bit tricky:
-Firstly, I set exiftool.encoding
property to 'utf-8' as you mentioned.
-Then I have to pre-encode the file path with local codepage before passing it to pyExifTool:
filepath_encoded: bytes = filepath.encode(encoding=locale.getpreferredencoding(False))
Otherwise, the filepath wil be encoded by pyExifTool with 'utf-8' and cmd will not able to identify the file.
-Then, call the set_tags funcion of exiftoolhelper, and now utf-8 values can be set to the tag of the file in the non-utf-8 environment. Still some error would be triggled if filepath contains non-ASCII charecters:
'utf-8' codec can't decode byte 0xb8 in position 89: invalid start byte
but the value is updated successfully.
Another way to do this and avoid decode error is leaving the pyExifTool encoding as local codepage, but pre-encode the params of -exif:usercomment=测试
with 'utf-8' before passing it to pyExifTool. In this case, I can't use exiftoolhelper.set_tags, and I have to deal with the params formatting and call 'exiftoolhelper.execute' or 'exiftool.execute'. (tested)
Only one property (exiftool._encoding
) for encoding control is not enough, right?
And I can't change the encoding when exiftool instance is running. So I have to bring two exiftool instances in my program to dealing with utf-8 and local codepage at the same time.
(update 2023-07-18: Fellowing it's true if the local codepage is not 'utf-8', but we can change the encoding of ExifTool for filenames by -charset\nfilename=utf8
to solve the problem simply)
When it comes to writting filename related tags, it's another story.
For filename related tags (file:filename
, file:directory
) , the values have to be local codepage encoded. If 'utf-8' encoded bytes are passed, the result filename is garbled. (If values contain character beyond ASCII charset)
On the other hand, reading filename related tags via json always return garbled values, users have to use option '-b' to obtained the original value as I mentioned before.
(update 2023-07-18: Something I mentioned here is base on the misunderstanding that we can't change the ExifTool encoding for filenames to 'utf-8' in non-utf8 environment, but with the help of -charset\nfilename=utf8
, things I mentionded here about filename encoding is not necessarily true.)
Overall,
Reading:
- filepath shoud be encoded by local codepage
- json results should be decoded by 'utf-8'
- filename related tags (
sourcefile
,file:filename
,file:directory
) returned by json would be always garbled (if values contain charecters of non-ASCII but being supported by local codepage). (use '-b' to get indirectly). - filename related tags extracted without json (not use -j) is OK (non-ASCII but local codepage supported values), but pyExiftool use json.
- if filename contains charecters that unsupported by local codepage, ExifTool just can't read the file totally, because cmd can't. Nothing can do unless user changes his system codepage to utf-8 (not current codepage which is not helpful). Changing system codepage to utf-8 is possible for newest versions of win10 and win11 (Beta).
Writting:
- filepath shoud be encoded by local codepage
- for non-filename-related tags, non-ASCII charecters endcoded by local codepage are not supported by ExifTool. (might have exceptions if ExifTool '-charset' supports the local codepage)
- for non-filename-related tags, 'utf-8' values could be passed to ExifTool via pyExifTool and set the 'utf-8' values in non 'utf-8' local cmd environment. (no need for local codepage to be supported by ExifTool's '-charset' option.)
- filename related tags (
file:filename
,file:directory
) for writting should be encoded by local codepage.
For some cases like filepath and json returns, pyExifTool can safely determine the encoding should be used. However, for other cases like writting filename related tags, pyExifTool need extra functions to identify whether the tags to be written is filename related tags or not, because the tag passed by user could be filename
or file:filename
or file:system:filename
, but I think extra functions to identify the tags is not nessary for pyExifTool.
So, I my recommendation is to let users to control the encoding precisely. For examples, when reading, let users specify which encoding to decode json; when writting, let users specify which encoding to encode tags' dict; and so on.
If so, users can utilise all the functions provide by exiftoolhelper (get_tags, set_tags). Ohterwise users have to use exiftoolhelper.execute and do the command formatting all by themselves.
(Currently I am writting a ExifTool GUI based on ExifTool via pyExifTool. Encoding on Windows is the biggest challenge I meet. All the information I provide above is tested in my local environment. Hope that could help. )
Thanks a lot.
Ok~ I find some good news today.
As the information provided by Phil Harvey here: https://exiftool.org/forum/index.php?topic=9717.0
"UTF-8 is the default character set for tag values. The default for file names depends on your system settings." That's why I was so struggling with filename reading and writting.
Good news is we can set the charset for filename to 'utf-8':
-charset
filename=utf8
or
-charset\nfilename=utf8
for passing to pyExifTool as a single param.
So, setting the ExifTool.encoding property to 'utf-8' you told me is not the full story, we also need to take care about the default encoding of ExifTool for filenames.
Now, I can happly use ExifToolHelper's functions of get_tags
and set_tags
with no pre-encoding things. All kinds of characters in filenames are suppoted including non-ASCII, local codepage supported, even local codepage unsupported, and everything. No metter reading or writting. Just like my local codepage was 'utf-8'. No need to change system local codepage.
'utf-8' supporting in non-utf8 cmd environment now can be realised pefectly.
(However, in terms of local codepage supporting, it's still a challenge. Something I mentioned formerly are still true, like json requires vaild 'utf-8', writting non-ASCII charecters endcoded by local codepage are not always supported by ExifTool (because -chatset
support list is limited), and so on)
@Yang-z this is some good information. I've read through the comments once already but I'm going to try again to understand and see if some improvements can be made to PyExifTool. I really appreciate the investigation, and I look forward to seeing your GUI (if it's public sw)
Anyways, some backstory on that filename and stream encoding. Way back when, in the original PyExifTool, I think it used the system filename encoding, and that broke things for JSON due to the UTF-8. On most systems, where file system encoding was the same or compatible with UTF-8, this wasn't a problem, but weird issues started coming up when it wasn't.
I'll look at the investigation above... it's pretty helpful information. The whole encoding thing is a nightmare to work with
Hi, It's my pleasure to see it helps. And yes, my GUI is public in my repositories.
Encoding on windows is really a tough thing. In order to make things a bit simpler, my current strategy is to stick with 'utf-8' for reading and writing, and to be compatible with reading non-utf-8 values by a function called fix_non_utf8_values
.
Your works are also appreciated for me. Thanks a lot.