enketo-express icon indicating copy to clipboard operation
enketo-express copied to clipboard

Silent data loss when response contains characters invalid for XML PCDATA

Open jnm opened this issue 4 years ago • 10 comments

Hi @MartijnR, some people are copying text from PDFs that contains old ASCII control characters and pasting it as responses to questions. I'm guessing that, depending on their platform, these characters might be invisible to them.

The problem is that these characters break submissions and draft saving/loading, but there's not an indication of an error until it's too late to retrieve the responses already entered.

Form used to test (single text question)

<?xml version="1.0" encoding="utf-8"?>
<h:html xmlns="http://www.w3.org/2002/xforms" xmlns:ev="http://www.w3.org/2001/xml-events" xmlns:h="http://www.w3.org/1999/xhtml" xmlns:jr="http://openrosa.org/javarosa" xmlns:odk="http://www.opendatakit.org/xforms" xmlns:orx="http://openrosa.org/xforms" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <h:head>
    <h:title>text</h:title>
    <model>
      <instance>
        <text id="text">
          <formhub>
            <uuid/>
          </formhub>
          <text/>
          <meta>
            <instanceID/>
          </meta>
        </text>
      </instance>
      <bind nodeset="/text/text" type="string"/>
      <bind calculate="concat('uuid:', uuid())" nodeset="/text/meta/instanceID" readonly="true()" type="string"/>
      <bind calculate="'dbfdf102b9e74a69abdcc8527a702275'" nodeset="/text/formhub/uuid" type="string"/>
    </model>
  </h:head>
  <h:body>
    <input ref="/text/text">
      <label>text</label>
    </input>
  </h:body>
</h:html>

Reproduce via drafts

  1. Open the form;
  2. Paste something funky as the text response, e.g. whatthehey;
    • :information_source: It looks like GitHub is not preserving the characters. Perhaps generate the text response with console.log('what' + String.fromCharCode(4) + 'the' + String.fromCharCode(28) + 'hey');
  3. Save the submission as a draft;
  4. Notice no errors in the UI or developer console;
  5. Attempt to load the draft;
  6. See "Loading Error…Error trying to parse XML record. Invalid XML source".

Reproduce via submission

  1. Use Google Chrome†;
  2. Create funky text response as above;
  3. Submit;
  4. See parseXML error in the console;
  5. See "text - 1 was successfully submitted" in the UI;
  6. Notice (using the browser's developer tools) that the POSTed form data contains a valid root node and form UUID:
    ------WebKitFormBoundaryBGt3di99CAJQp9hF
    Content-Disposition: form-data; name="xml_submission_file"; filename="xml_submission_file"
    Content-Type: text/xml
    
    <text xmlns:jr="http://openrosa.org/javarosa" xmlns:odk="http://www.opendatakit.org/xforms" xmlns:orx="http://openrosa.org/xforms" id="text"><parsererror xmlns="http://www.w3.org/1999/xhtml" style="display: block; white-space: pre; border: 2px solid #c77; padding: 0 1em 0 1em; margin: 1em; background-color: #fdd; color: black"><h3>This page contains the following errors:</h3><div style="font-family:monospace;font-size:12px">error on line 5 at column 21: PCDATA invalid Char value 4</div><h3>Below is a rendering of the page up to the first error.</h3></parsererror>
              <formhub>
                <uuid>dbfdf102b9e74a69abdcc8527a702275</uuid>
              </formhub>
              <text/></text>
    ------WebKitFormBoundaryBGt3di99CAJQp9hF--
    
  7. Notice, at least in the case of KoBoCAT, that this incomplete XML is indeed accepted as a valid submission, and a 201 response is returned to Enketo;
  8. Observe that Enketo has deleted the submission from browser storage, as is typical after a successful upload.

† Firefox renders the XML parsererror differently, in a way that doesn't resemble valid submission XML and isn't accepted by KoBoCAT.

jnm avatar Jun 24 '20 00:06 jnm

Thanks. What do you think the best solution is? Some ideas:

  • Should we perhaps just strip these characters before storing text (type=string binds) into Enketo's model?
  • Or should we prevent pasting such text (like we do with mask in number inputs)?

I'm guessing the latter will leave users very puzzled about why it won't allow these characters with no idea on how to resolve it (even with a helpful error message).

MartijnR avatar Jun 24 '20 15:06 MartijnR

I would vote option 1 definitely. Just strip these characters, no need to preserve them.

tinok avatar Jun 24 '20 16:06 tinok

Makes sense. If KoBo could take this on, I'd welcome a PR with a test(s) and I'll review.

Assuming these characters don't cause any issues before then, the location for the fix would likely be in here: https://github.com/enketo/enketo-core/blob/master/src/js/form-model.js#L1444

MartijnR avatar Jun 24 '20 16:06 MartijnR

A third option would be:

  • detect paste event and remove those characters before letting paste proceed. This will prevent them in the input/textarea element itself.

I think this is less preferable than doing it in the model though, because avoiding new event handlers will lead to fewer bugs. So, it's a backup solution if these characters also cause problems in HTML (in which case mask.js probably has useful code to reuse).

MartijnR avatar Jun 24 '20 16:06 MartijnR

Update: I tested it with a source PDF and was able to reproduce the issue without needing to generate invalid chars in the console. I copied the text into this file: invalid characters.txt

The issue stems from the fact that ligature letters (like the combined ff or fi characters) are copied as these ASCII control characters. Here is the same abstract as in the text file with the ligatures highlighted: image

tinok avatar Jun 24 '20 18:06 tinok

Stripping these characters out seems like an alright real-world solution and may very well save other people pain down the line—but it isn't really Enketo's job, and I don't get a warm, fuzzy feeling about silently modifying the inputted responses. Can we escape the characters in the same way <, >, and friends are handled?

jnm avatar Jun 26 '20 18:06 jnm

Do you mean replace them with a different valid character, like � (U+FFFD, for replacements)?

I would argue that we can safely strip them because 1) these are never text that a user can enter, only unintentional characters that can't be used for data analysis, 2) EE already strips similar things like rich text formatting or images that may have been copied, 3) they are being stripped by similar user-level applications such as MS Word.

Isn't it possible that preserving them could lead to more difficulties down the road

tinok avatar Jun 26 '20 18:06 tinok

OK, it looks like escaping these is a no-go because they're simply not valid in XML, even in their escaped forms. This may look less scary, but it's still rejected in XML 1.0: what&#x0004;the&#x001c;hey. As Tim Bray writes about these control characters, "there was no interoperability around this junk":

That was a long time ago, but my best recollection was that they have no graphical representation and also no agreed-upon semantics. Picking a couple at random we see U+0006 "Acknowledge" or U+0016 "Synchronous idle"... what do those mean? Unicode doesn't say. Even back when everyone claimed to support ASCII, there was no interoperability around this junk. XML is supposed to be about interoperability. (https://stackoverflow.com/a/503891/2402324)

I guess stripping out the characters is the winning approach, then :shrug: We'll need to trash anything not included in this range:

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Somewhat pedantic responses (sorry) to Tino below, that I wrote before finding the references above:

Do you mean replace them with a different valid character, like � (U+FFFD, for replacements)?

Sort of. Not replace with a single character, but escape. For example, If I write the text response 6 < 7, red & green, something in Enketo escapes the < and & characters so that the response sent in XML becomes 6 &lt; 7, red &amp; green.

EE already strips similar things like rich text formatting or images that may have been copied

The browser does that, not EE.

they are being stripped by similar user-level applications such as MS Word

OK, that surprises me, but it's a good example :+1:

Isn't it possible that preserving them could lead to more difficulties down the road

Yes, absolutely. That's what I meant by

Stripping these characters out…may very well save other people pain down the line

jnm avatar Jul 01 '20 18:07 jnm

EE already strips similar things like rich text formatting or images that may have been copied

The browser does that, not EE.

That's a good point. Would be great if the browser stripped the control characters out as well when pasting. I guess we've documented now why we have to strip them in EE? 🤷

tinok avatar Jul 02 '20 19:07 tinok

سوف اذهب شكرا

gido3823 avatar Aug 26 '22 01:08 gido3823