enketo-express
enketo-express copied to clipboard
Silent data loss when response contains characters invalid for XML PCDATA
Hi @MartijnR, some people are copying text from PDFs that contains old ASCII control characters and pasting it as responses to questions. I'm guessing that, depending on their platform, these characters might be invisible to them.
The problem is that these characters break submissions and draft saving/loading, but there's not an indication of an error until it's too late to retrieve the responses already entered.
Form used to test (single text question)
<?xml version="1.0" encoding="utf-8"?>
<h:html xmlns="http://www.w3.org/2002/xforms" xmlns:ev="http://www.w3.org/2001/xml-events" xmlns:h="http://www.w3.org/1999/xhtml" xmlns:jr="http://openrosa.org/javarosa" xmlns:odk="http://www.opendatakit.org/xforms" xmlns:orx="http://openrosa.org/xforms" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<h:head>
<h:title>text</h:title>
<model>
<instance>
<text id="text">
<formhub>
<uuid/>
</formhub>
<text/>
<meta>
<instanceID/>
</meta>
</text>
</instance>
<bind nodeset="/text/text" type="string"/>
<bind calculate="concat('uuid:', uuid())" nodeset="/text/meta/instanceID" readonly="true()" type="string"/>
<bind calculate="'dbfdf102b9e74a69abdcc8527a702275'" nodeset="/text/formhub/uuid" type="string"/>
</model>
</h:head>
<h:body>
<input ref="/text/text">
<label>text</label>
</input>
</h:body>
</h:html>
Reproduce via drafts
- Open the form;
- Paste something funky as the text response, e.g.
whatthehey
;- :information_source: It looks like GitHub is not preserving the characters. Perhaps generate the text response with
console.log('what' + String.fromCharCode(4) + 'the' + String.fromCharCode(28) + 'hey')
;
- :information_source: It looks like GitHub is not preserving the characters. Perhaps generate the text response with
- Save the submission as a draft;
- Notice no errors in the UI or developer console;
- Attempt to load the draft;
- See "Loading Error…Error trying to parse XML record. Invalid XML source".
Reproduce via submission
- Use Google Chrome†;
- Create funky text response as above;
- Submit;
- See
parseXML error
in the console; - See "text - 1 was successfully submitted" in the UI;
- Notice (using the browser's developer tools) that the
POST
ed form data contains a valid root node and form UUID:------WebKitFormBoundaryBGt3di99CAJQp9hF Content-Disposition: form-data; name="xml_submission_file"; filename="xml_submission_file" Content-Type: text/xml <text xmlns:jr="http://openrosa.org/javarosa" xmlns:odk="http://www.opendatakit.org/xforms" xmlns:orx="http://openrosa.org/xforms" id="text"><parsererror xmlns="http://www.w3.org/1999/xhtml" style="display: block; white-space: pre; border: 2px solid #c77; padding: 0 1em 0 1em; margin: 1em; background-color: #fdd; color: black"><h3>This page contains the following errors:</h3><div style="font-family:monospace;font-size:12px">error on line 5 at column 21: PCDATA invalid Char value 4</div><h3>Below is a rendering of the page up to the first error.</h3></parsererror> <formhub> <uuid>dbfdf102b9e74a69abdcc8527a702275</uuid> </formhub> <text/></text> ------WebKitFormBoundaryBGt3di99CAJQp9hF--
- Notice, at least in the case of KoBoCAT, that this incomplete XML is indeed accepted as a valid submission, and a 201 response is returned to Enketo;
- Observe that Enketo has deleted the submission from browser storage, as is typical after a successful upload.
† Firefox renders the XML parsererror
differently, in a way that doesn't resemble valid submission XML and isn't accepted by KoBoCAT.
Thanks. What do you think the best solution is? Some ideas:
- Should we perhaps just strip these characters before storing text (type=string binds) into Enketo's model?
- Or should we prevent pasting such text (like we do with mask in number inputs)?
I'm guessing the latter will leave users very puzzled about why it won't allow these characters with no idea on how to resolve it (even with a helpful error message).
I would vote option 1 definitely. Just strip these characters, no need to preserve them.
Makes sense. If KoBo could take this on, I'd welcome a PR with a test(s) and I'll review.
Assuming these characters don't cause any issues before then, the location for the fix would likely be in here: https://github.com/enketo/enketo-core/blob/master/src/js/form-model.js#L1444
A third option would be:
- detect paste event and remove those characters before letting paste proceed. This will prevent them in the input/textarea element itself.
I think this is less preferable than doing it in the model though, because avoiding new event handlers will lead to fewer bugs. So, it's a backup solution if these characters also cause problems in HTML (in which case mask.js probably has useful code to reuse).
Update: I tested it with a source PDF and was able to reproduce the issue without needing to generate invalid chars in the console. I copied the text into this file: invalid characters.txt
The issue stems from the fact that ligature letters (like the combined ff or fi characters) are copied as these ASCII control characters. Here is the same abstract as in the text file with the ligatures highlighted:
Stripping these characters out seems like an alright real-world solution and may very well save other people pain down the line—but it isn't really Enketo's job, and I don't get a warm, fuzzy feeling about silently modifying the inputted responses. Can we escape the characters in the same way <
, >
, and friends are handled?
Do you mean replace them with a different valid character, like � (U+FFFD, for replacements)?
I would argue that we can safely strip them because 1) these are never text that a user can enter, only unintentional characters that can't be used for data analysis, 2) EE already strips similar things like rich text formatting or images that may have been copied, 3) they are being stripped by similar user-level applications such as MS Word.
Isn't it possible that preserving them could lead to more difficulties down the road
OK, it looks like escaping these is a no-go because they're simply not valid in XML, even in their escaped forms. This may look less scary, but it's still rejected in XML 1.0: whatthehey
. As Tim Bray writes about these control characters, "there was no interoperability around this junk":
That was a long time ago, but my best recollection was that they have no graphical representation and also no agreed-upon semantics. Picking a couple at random we see U+0006 "Acknowledge" or U+0016 "Synchronous idle"... what do those mean? Unicode doesn't say. Even back when everyone claimed to support ASCII, there was no interoperability around this junk. XML is supposed to be about interoperability. (https://stackoverflow.com/a/503891/2402324)
I guess stripping out the characters is the winning approach, then :shrug: We'll need to trash anything not included in this range:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
Somewhat pedantic responses (sorry) to Tino below, that I wrote before finding the references above:
Do you mean replace them with a different valid character, like � (U+FFFD, for replacements)?
Sort of. Not replace with a single character, but escape. For example, If I write the text response 6 < 7, red & green
, something in Enketo escapes the <
and &
characters so that the response sent in XML becomes 6 < 7, red & green
.
EE already strips similar things like rich text formatting or images that may have been copied
The browser does that, not EE.
they are being stripped by similar user-level applications such as MS Word
OK, that surprises me, but it's a good example :+1:
Isn't it possible that preserving them could lead to more difficulties down the road
Yes, absolutely. That's what I meant by
Stripping these characters out…may very well save other people pain down the line
EE already strips similar things like rich text formatting or images that may have been copied
The browser does that, not EE.
That's a good point. Would be great if the browser stripped the control characters out as well when pasting. I guess we've documented now why we have to strip them in EE? 🤷
سوف اذهب شكرا