droid icon indicating copy to clipboard operation
droid copied to clipboard

File Handling for Non-English Alphabets

Open elbre opened this issue 1 year ago • 7 comments

Hello,

I would like to bring attention to an issue I encountered while working with Droid, specifically when dealing with files containing characters from alphabets other than English.

Initially, we suspected that the problem might be related to using *.zip files. However, after further investigation, we observed similar issues when generating *.7z files and attempting different export methods.

To assist in resolving this matter, I am attaching the original files, the export, and a screenshot of the application to provide a comprehensive overview.

I am curious to know if there are plans to address these issues in the near future or if there is already a known solution?

czechfilesfail

aaa.txt Jindřich Šťovíček.zip

elbre avatar Dec 04 '23 09:12 elbre

Hi @elbre Thanks for raising the issue and attaching the supporting documents. Unfortunately, there is no already known solution for it as of now. I'll investigate further and update.

Regards,

sparkhi avatar Dec 04 '23 10:12 sparkhi

Just to note, a workaround might be to unzip the contents here:

image

Which might point to there being something in the archive handling process in general causing the issue. I'm not sure. It's actually the same pattern in Siegfried too cc. @richardlehane.

e.g.

Without extracting:

---
filename : 'Jindrich.Stovicek.zip#Jind²ich µ£ovíƒek/ⁿíƒansk∞ k²iτ£ál/ⁿí'
filesize : 0
modified : 2023-11-29T21:59:26Z
errors   : 'empty source'
matches  :
  - ns      : 'pronom'
    id      : 'UNKNOWN'
    format  : 
    version : 
    mime    : 
    class   : 
    basis   : 
    warning : 'no match'
---
filename : 'Jindrich.Stovicek.zip#Jind²ich µ£ovíƒek/µt╪σátko/µ'
filesize : 0
modified : 2023-11-29T22:00:50Z
errors   : 'empty source'
matches  :
  - ns      : 'pronom'
    id      : 'UNKNOWN'
    format  : 
    version : 
    mime    : 
    class   : 
    basis   : 
    warning : 'no match'

With extracting:

---
filename : 'Říčanský křišťál/Říčanský křišťál.txt'
filesize : 0
modified : 2023-11-29T21:59:26+01:00
errors   : 'empty source'
matches  :
  - ns      : 'pronom'
    id      : 'x-fmt/111'
    format  : 'Plain Text File'
    version : 
    mime    : 'text/plain'
    class   : 
    basis   : 'extension match txt'
    warning : 'match on extension only'
---
filename : 'Štěňátko/Štěňátko.txt'
filesize : 0
modified : 2023-11-29T22:00:50+01:00
errors   : 'empty source'
matches  :
  - ns      : 'pronom'
    id      : 'x-fmt/111'
    format  : 'Plain Text File'
    version : 
    mime    : 'text/plain'
    class   : 
    basis   : 'extension match txt'
    warning : 'match on extension only'

Was just interested to take a look at this as we had problems with earlier DROID releases with the Māori language character set, but I had thought they were resolved. I guess we didn't process a lot of zips back in the day!

ross-spencer avatar Dec 04 '23 11:12 ross-spencer

"Thank you for the workaround. Unfortunately, we are working on a workflow where ZIP files should also be acceptable."

elbre avatar Dec 04 '23 11:12 elbre

I did a little testing with this today. It looks like the file names within the zip file aren't UTF-8 or IBM437 (the default in the zip spec), but rather have the character encoding IBM852. I'm not really sure how you'd go about reliably detecting this during unzipping (though tools like 7-zip and WinZip seem to manage it so perhaps it is possible?):

image

richardlehane avatar Dec 04 '23 17:12 richardlehane

I can also provide material made in command line: 7z a -tzip -scsUTF-8 archiv.zip "Jindřich Šťovíček"
7-Zip 23.01 (x64) for Windows

archiv.zip

elbre avatar Dec 06 '23 05:12 elbre

I can also provide material made in command line: 7z a -tzip -scsUTF-8 archiv.zip "Jindřich Šťovíček" 7-Zip 23.01 (x64) for Windows

archiv.zip

archiv.zip still contains non-UTF-8 filenames. Try the -mcu flag instead...

7z a -tzip -mcu archiv.zip "Jindřich Šťovíček"

richardlehane avatar Dec 06 '23 09:12 richardlehane

Good day once again. I would like to thank you for your comments on this matter, especially regarding the -mcu flag. I was informed that this parameter is missing in the documentation. At this point, I have been told that we can proceed with our project using the information you have provided so far, and from our perspective, the issue can be considered closed.

However, as mentioned earlier, the originally provided source is the default method for creating zip files, and it is highly probable in our region to encounter these files. Therefore, I would prefer to keep the issue open."

elbre avatar Dec 11 '23 09:12 elbre