exist icon indicating copy to clipboard operation
exist copied to clipboard

[BUG] Import of Windows-1252 encoded file looses prolog and becomes mangled UTF-8

Open ahenket opened this issue 1 year ago • 6 comments

Describe the bug When I import attached file through oXygens xmlrpc connection (eXide doesn't let me: different issue), eXist-db looses the prolog that lists that the file is Windows-1252, but does not convert the file into UTF-8. So when you reopen it, it uses the xml default encoding UTF-8 and all characters outside of ASCII are now broken.

Expected behavior Either keep the encoding of uploaded files, or do an on the fly conversion before committing to the database

To Reproduce Extract the one file from the zip and upload that file anywhere on your server. Now reopen using oXygen or eXide and look for "pati". The first hit reads "pati�nt" instead of "patiënt" and is in this path: /XMI/XMI.content[1]/UML:Model[1]/UML:Namespace.ownedElement[1]/UML:Package[1]/UML:Namespace.ownedElement[1]/UML:Collaboration[1]/UML:Namespace.ownedElement[1]/UML:ClassifierRole[2]/UML:ModelElement.taggedValue[1]/UML:TaggedValue[1]/@value

nl.zorg.Zwangerschap-v4.1.xmi.zip

There are 27 occurrences of � that were a regular Windows-1252 compatible characters before.

Environment

Key Value
eXist Version: 6.2.0
eXist Build: 2023-02-04T22:42:29Z
Operating System: Mac OS X 14.6.1 aarch64
Java Version: 21.0.2
Default Encoding: UTF-8

ahenket avatar Aug 28 '24 15:08 ahenket