dfxml icon indicating copy to clipboard operation
dfxml copied to clipboard

UnicodeEncodeError with walk_to_dfxml.py

Open tw4l opened this issue 8 years ago • 24 comments

Creating DFXML with walk_to_dfxml.py (called with Python 3.5.2 in Ubuntu 16.04) on two mounted disk images resulted in the following errors:

Traceback (most recent call last): File "/usr/share/dfxml/python/walk_to_dfxml.py", line 182, in main() File "/usr/share/dfxml/python/walk_to_dfxml.py", line 168, in main dobj.print_dfxml() File "/usr/share/dfxml/python/Objects.py", line 219, in print_dfxml output_fh.write(_ET_tostring(e)) UnicodeEncodeError: 'utf-8' codec can't encode character '\udc9a' in position 30: surrogates not allowed

Traceback (most recent call last): File "/usr/share/dfxml/python/walk_to_dfxml.py", line 182, in main() File "/usr/share/dfxml/python/walk_to_dfxml.py", line 168, in main dobj.print_dfxml() File "/usr/share/dfxml/python/Objects.py", line 219, in print_dfxml output_fh.write(_ET_tostring(e)) UnicodeEncodeError: 'utf-8' codec can't encode character '\udc8a' in position 33: surrogates not allowed

tw4l avatar May 11 '17 16:05 tw4l

Interesting. Thank you for providing the problem characters/bytes. I assume you have the whole file name in your data; you don't need to post it here, but it would also be helpful to know what character encoding does work with that file name.

Making a wild guess (from the options I know about), does cp863 get you what you wanted? My first guesses look doubtful.

>>> x = b"\xdc\x8a".decode("utf8")
>>> x
'܊'
>>> x = b"\xdc\x8a".decode("iso-8859-1")
>>> x
'Ü\x8a'
>>> x = b"\xdc\x8a".decode("cp1252")
>>> x
'ÜŠ'
>>> x = b"\xdc\x8a".decode("cp863")
>>> x
'܊'

(Except the first, those render in my browser like they did in my terminal. The first looked like a dotted 'T'.)

This is all a preamble to addressing the real design problem, how to handle odd encodings and preserve original bytes.

ajnelson-nist avatar May 11 '17 17:05 ajnelson-nist

Hi Alex. Thanks for the quick response! Python's chardet is showing about a 74% probability that the encoding is ISO-8859-2, but when I try to decode I get the same result you're getting for ISO-8859-1, which probably isn't right.

The disk in question (for the first example) is an HFS disk, and when I mount the disk the directory in question dislays as "BarcelonaBr�nda 1 (invalid encoding)"

tw4l avatar May 11 '17 19:05 tw4l

The fun thing about this is that I don't know what the character should be, haha. This is from a London-based architectural firm that worked all over the world with various partners.

tw4l avatar May 11 '17 19:05 tw4l

What's likely happening is that there is a filename that contains an invalid Unicode encoding. This is supposed to be automatically quoted, but perhaps the code isn't working on this example.

Can we figure out the specific file that is causing the problem?

On May 11, 2017, at 3:24 PM, Tim Walsh [email protected] wrote:

Hi Alex. Thanks for the quick response! Python's chardet is showing about a 74% probability that the encoding is ISO-8859-2, but when I try to decode I get the same result you're getting for ISO-8859-1, which probably isn't right.

The disk in question (for the first example) is an HFS disk, and when I mount the disk the directory in question dislays as "BarcelonaBr�nda 1 (invalid encoding)"

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/simsong/dfxml/issues/19#issuecomment-300891858, or mute the thread https://github.com/notifications/unsubscribe-auth/ABhTrCMjMRBmBhZL_7ntPhiEBQVNq6vhks5r42B2gaJpZM4NYOpJ.

simsong avatar May 11 '17 20:05 simsong

Hi Simson! I think we have the file pinned down. We're just scratching our heads at a graceful way to pick an encoding. I've been scratching my head at a graceful way to record that choice and the original bytes. The hfs2dfxml project has a similar challenge for file name representations because they're encoding HFS (not HFS+) names, which allow any byte but \0 in a file name.

I've been leaning towards this XML output style for any file that doesn't straight translate to UTF-8 (using Tim's example here):

<filename original_bytes_base64="(base64 representation)" decoding="cp863">BarcelonaBr܊nda</filename>

This enforces that DFXML would only ever store UTF-8 data, but you could get back to original name bytes if you wanted. The soft link reference contents may need some other special handling in case a link names an uncommon-character file.

ajnelson-nist avatar May 11 '17 20:05 ajnelson-nist

Tim, would you mind sharing the default encoding of the environment where you hit this hiccup with walk_to_dfxml: echo $LANG?

ajnelson-nist avatar May 11 '17 20:05 ajnelson-nist

Hi Alex! Sure thing: result is en_US.UTF-8

tw4l avatar May 11 '17 20:05 tw4l

Thanks, Tim. I had it at about a 1% chance that you'd have something more exotic in that environment. Ah well.

I didn't know about chardet. For my curiosity, if you run this:

find $root_of_your_mounted_image | sort > ~/test_all_file_names.txt`

What does chardet say about the encoding of ~/test_all_file_names.txt? ISO-8859-2?

ajnelson-nist avatar May 11 '17 20:05 ajnelson-nist

I thought that the system just used hex encoding for names that can't be represented?

On May 11, 2017, at 4:39 PM, Alex Nelson [email protected] wrote:

Hi Simson! I think we have the file pinned down. We're just scratching our heads at a graceful way to pick an encoding. I've been scratching my head at a graceful way to record that choice and the original bytes. The hfs2dfxml project has a similar challenge https://github.com/cul-it/hfs2dfxml/issues/11 for file name representations because they're encoding HFS (not HFS+) names, which allow any byte but \0 in a file name.

I've been leaning towards this XML output style for any file that doesn't straight translate to UTF-8 (using Tim's example here):

BarcelonaBr܊nda This enforces that DFXML would only ever store UTF-8 data, but you could get back to original name bytes if you wanted. The soft link reference contents may need some other special handling in case a link names an uncommon-character file.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/simsong/dfxml/issues/19#issuecomment-300910486, or mute the thread https://github.com/notifications/unsubscribe-auth/ABhTrNhl-0RF_Ptsi_pZxucKg06IfagPks5r43IOgaJpZM4NYOpJ.

simsong avatar May 11 '17 20:05 simsong

I don't see that dfxml_tool.py or the dfxml.py bindings do hex encoding. I know Objects.py doesn't. Maybe fiwalk does? I vaguely recall that something did hex encoding, but I also recall there wasn't a record that byte transformations happened.

Tim, does this particular disk run through fiwalk? If so, what happens with that Barcelona file's name?

ajnelson-nist avatar May 11 '17 21:05 ajnelson-nist

Interestingly, chardet says that the encoding of test_all_file_names.txt is ascii. If I open the file in vim, the directory is represented as "/mnt/diskid/BarcelonaBr<8a>nda 1".

Unfortunately, since it's an HFS disk, fiwalk isn't able to represent any of the files as FileObjects.

tw4l avatar May 11 '17 21:05 tw4l

Okay. This may be the issue. Where is the DFXML coming from? If it is not coming from fiwalk, it's coming from somewhere else, and that other program may not be doing the proper quoting.

On May 11, 2017, at 5:24 PM, Tim Walsh [email protected] wrote:

Interestingly, chardet says that the encoding of test_all_file_names.txt is ascii. If I open the file in vim, the directory is represented as "/mnt/diskid/BarcelonaBr<8a>nda 1".

Unfortunately, since it's an HFS disk, fiwalk isn't able to represent any of the files as FileObjects.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/simsong/dfxml/issues/19#issuecomment-300921137, or mute the thread https://github.com/notifications/unsubscribe-auth/ABhTrBCS0XhvUIDrO7Ji3musPDGEANcEks5r43yHgaJpZM4NYOpJ.

simsong avatar May 11 '17 21:05 simsong

Yes, it is fiwalk, which does the DFXML output, but other tools may generate DFXML as well.

On May 11, 2017, at 5:13 PM, Alex Nelson [email protected] wrote:

I don't see that dfxml_tool.py or the dfxml.py bindings do hex encoding. I know Objects.py doesn't. Maybe fiwalk does? I vaguely recall that something did hex encoding, but I also recall there wasn't a record that byte transformations happened.

Tim, does this particular disk run through fiwalk? If so, what happens with that Barcelona file's name?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/simsong/dfxml/issues/19#issuecomment-300918669, or mute the thread https://github.com/notifications/unsubscribe-auth/ABhTrDTDulc8Ua45nCOgQfK_Hkc1meZtks5r43n2gaJpZM4NYOpJ.

simsong avatar May 11 '17 21:05 simsong

Hi Simson. Thanks for looking into this. It's the walk_to_dfxml.py script included in the Python bindings that is generating the DFXML.

On Thu, May 11, 2017 at 5:39 PM, Simson L. Garfinkel < [email protected]> wrote:

Okay. This may be the issue. Where is the DFXML coming from? If it is not coming from fiwalk, it's coming from somewhere else, and that other program may not be doing the proper quoting.

On May 11, 2017, at 5:24 PM, Tim Walsh [email protected] wrote:

Interestingly, chardet says that the encoding of test_all_file_names.txt is ascii. If I open the file in vim, the directory is represented as "/mnt/diskid/BarcelonaBr<8a>nda 1".

Unfortunately, since it's an HFS disk, fiwalk isn't able to represent any of the files as FileObjects.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/simsong/dfxml/issues/19#issuecomment-300921137>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ ABhTrBCS0XhvUIDrO7Ji3musPDGEANcEks5r43yHgaJpZM4NYOpJ>.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/simsong/dfxml/issues/19#issuecomment-300924570, or mute the thread https://github.com/notifications/unsubscribe-auth/AGchlMi3NKb73H9flHn6YQ1xmOY04-yqks5r44AYgaJpZM4NYOpJ .

tw4l avatar May 11 '17 21:05 tw4l

Looks like in line 66 of walk_to_dfxml.py the filepath is assigned to fobj.filename. Elsewhere, it looks like the filename attribute may be emitted without being quoted. I think that Alex should be able to fix this easier than me.

On May 11, 2017, at 5:54 PM, Tim Walsh [email protected] wrote:

Hi Simson. Thanks for looking into this. It's the walk_to_dfxml.py script included in the Python bindings that is generating the DFXML.

On Thu, May 11, 2017 at 5:39 PM, Simson L. Garfinkel < [email protected]> wrote:

Okay. This may be the issue. Where is the DFXML coming from? If it is not coming from fiwalk, it's coming from somewhere else, and that other program may not be doing the proper quoting.

On May 11, 2017, at 5:24 PM, Tim Walsh [email protected] wrote:

Interestingly, chardet says that the encoding of test_all_file_names.txt is ascii. If I open the file in vim, the directory is represented as "/mnt/diskid/BarcelonaBr<8a>nda 1".

Unfortunately, since it's an HFS disk, fiwalk isn't able to represent any of the files as FileObjects.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/simsong/dfxml/issues/19#issuecomment-300921137>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ ABhTrBCS0XhvUIDrO7Ji3musPDGEANcEks5r43yHgaJpZM4NYOpJ>.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/simsong/dfxml/issues/19#issuecomment-300924570, or mute the thread https://github.com/notifications/unsubscribe-auth/AGchlMi3NKb73H9flHn6YQ1xmOY04-yqks5r44AYgaJpZM4NYOpJ .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/simsong/dfxml/issues/19#issuecomment-300927733, or mute the thread https://github.com/notifications/unsubscribe-auth/ABhTrEdEc0ipvAGpIEfZFDAVyHbPpzw5ks5r44OcgaJpZM4NYOpJ.

simsong avatar May 11 '17 22:05 simsong

Bumping in case it got lost in the shuffle. Alex, do you have time/interest to look into this? Thanks!

tw4l avatar Jun 15 '17 20:06 tw4l

I will be addressing this problem. I think it's going to touch a few areas in the API and possibly the schema.

ajnelson-nist avatar Jun 15 '17 20:06 ajnelson-nist

Hi Alex, sorry to be a bother but any chance you've been able to take a look at this? We're running into unicode decoding errors quite frequently with walk_to_dfxml.py and a collection of disks of Scandinavian origin. Thanks!

tw4l avatar Jan 15 '18 21:01 tw4l

Hi Tim,

How is the XML being generated — from fiwalk executable, or from a Python program? We probably need to revise the way that the XML is being generated.

On Jan 15, 2018, at 4:33 PM, Tim Walsh [email protected] wrote:

Hi Alex, sorry to be a bother but any chance you've been able to take a look at this? We're running into unicode decoding errors quite frequently with walk_to_dfxml.py and a collection of disks of Scandinavian origin. Thanks!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/simsong/dfxml/issues/19#issuecomment-357794129, or mute the thread https://github.com/notifications/unsubscribe-auth/ABhTrMthExdRms65h0_FKn4E8rBvJaGrks5tK8RCgaJpZM4NYOpJ.

simsong avatar Jan 15 '18 21:01 simsong

Hi Simson. The XML is being generated by the walk_to_dfxml.py script included in the Python bindings. In my particular case, that script is being called from a bash terminal against files carved from an HFS disk image, for which fiwalk isn't able to gather FileObject metadata. The walk_to_dfxml.py script gives us a nice workaround for those HFS disks, with the exception of when we run into the UnicodeEncodeError.

tw4l avatar Jan 15 '18 22:01 tw4l

Okay, I didn’t write Objects.py, and it is quite complex. It’s not really clear to me how it is generating the XML. It looks like it is using ETree.However, it looks like it is trying to fake things, rather than completely embracing ETree. It’s doing this to be memory efficient, according to the documentation. So I’m not sure what to do. Do you know precisely what’s broken in the XML being generated?

On Jan 15, 2018, at 5:08 PM, Tim Walsh [email protected] wrote:

Hi Simson. The XML is being generated by the walk_to_dfxml.py script included in the Python bindings. In my particular case, that script is being called from a bash terminal against files carved from an HFS disk image, for which fiwalk isn't able to gather FileObject metadata. The walk_to_dfxml.py script gives us a nice workaround for those HFS disks, with the exception of when we run into the UnicodeEncodeError.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/simsong/dfxml/issues/19#issuecomment-357800042, or mute the thread https://github.com/notifications/unsubscribe-auth/ABhTrJwHEfXP0F_mU70EniO7KNW1B-RHks5tK8xjgaJpZM4NYOpJ.

simsong avatar Jan 15 '18 22:01 simsong

Dianne has also raised this issue with me recently, also for HFS disks. I think the solution I described on May 11 will make the most sense to implement. I'll be sure to discuss the API with at least you and her.

ajnelson-nist avatar Jan 16 '18 15:01 ajnelson-nist

Ah. Thanks for pointing me at the message history. Yes, I think that the mechanism described of having the original name with BASE64 and then having a human-usable name makes sense. Other code will need to be modified to use original name if present. Put that in dfxml.py, I guess, in the filename() method if we have one...

simsong avatar Jan 16 '18 17:01 simsong

I agree that the solution described on May 11 makes sense. Thanks!

tw4l avatar Jan 16 '18 17:01 tw4l