isic-archive icon indicating copy to clipboard operation
isic-archive copied to clipboard

Handle Unicode characters in filenames inside ZIP archives

Open msmolens opened this issue 7 years ago • 0 comments

This issue filed for future followup on a PR comment: https://github.com/ImageMarkup/isic-archive/pull/349#discussion_r106460476

When a user downloads a ZIP file of images, the images in each dataset are grouped in directories named after the datasets. Because ZIP files traditionally don't have a way to specify the encoding of filenames, the directory name will likely be incorrect once extracted. For example, a dataset whose name contains a Chinese character results in a filename encoded as UTF-8 in the ZIP file, and there is no information for the extraction tool to decode the filename from UTF-8.

It's possible that setting a flag in the ZIP header might address this, as many tools apparently now recognize this flag. More specifically see the "language encoding" flag in https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT (keywords: "Bit 11", "Appendix D").

msmolens avatar Mar 16 '17 19:03 msmolens