exist
exist copied to clipboard
[BUG] Import of Windows-1252 encoded file looses prolog and becomes mangled UTF-8
Describe the bug When I import attached file through oXygens xmlrpc connection (eXide doesn't let me: different issue), eXist-db looses the prolog that lists that the file is Windows-1252, but does not convert the file into UTF-8. So when you reopen it, it uses the xml default encoding UTF-8 and all characters outside of ASCII are now broken.
Expected behavior Either keep the encoding of uploaded files, or do an on the fly conversion before committing to the database
To Reproduce Extract the one file from the zip and upload that file anywhere on your server. Now reopen using oXygen or eXide and look for "pati". The first hit reads "pati�nt" instead of "patiënt" and is in this path: /XMI/XMI.content[1]/UML:Model[1]/UML:Namespace.ownedElement[1]/UML:Package[1]/UML:Namespace.ownedElement[1]/UML:Collaboration[1]/UML:Namespace.ownedElement[1]/UML:ClassifierRole[2]/UML:ModelElement.taggedValue[1]/UML:TaggedValue[1]/@value
nl.zorg.Zwangerschap-v4.1.xmi.zip
There are 27 occurrences of � that were a regular Windows-1252 compatible characters before.
Environment
| Key | Value |
|---|---|
| eXist Version: | 6.2.0 |
| eXist Build: | 2023-02-04T22:42:29Z |
| Operating System: | Mac OS X 14.6.1 aarch64 |
| Java Version: | 21.0.2 |
| Default Encoding: | UTF-8 |