PapaParse icon indicating copy to clipboard operation
PapaParse copied to clipboard

ReadableStreamStreamer sometimes breaks UTF8-characters to ��

Open jehna opened this issue 5 years ago • 13 comments

Issue

If I pass a CSV stream to Papa.parse that contains special characters, it sometimes breaks the special characters so they show up as e.g. ��.

How to reproduce

See example at: https://repl.it/repls/ArcticTreasuredLine

Press "Run" at the top of the page

What should happen?

There should only be output of ä character

What happened?

There's random occurrences of ��

Root cause

These two lines are responsible for this issue: https://github.com/mholt/PapaParse/blob/ae73d2a96639beec58a83326de6bd8e8ca0c02b3/papaparse.js#L863 https://github.com/mholt/PapaParse/blob/ae73d2a96639beec58a83326de6bd8e8ca0c02b3/papaparse.js#L506

So when papaparse reads a chunk, it directly calls .toString of that chunk.

However a chunk consists of bytes, and some utf8-characters are two bytes long:

  • ä consists of two bytes: 11000011 and 10100100
  • a (and other "regular" characters) is just one byte 01100001

Now if the chunk splits right between a multi-byte character like ä, papaparse calls toString to both parts of the character distinctly, and produces two weird characters:

11000011 (from end of first chunk) transforms to 10100100 (from start of second chunk) transforms to

How to fix this issue?

If received chunks of bytes, the concatenation should be done in bytes too, e.g. using Buffer.concat. Papaparse should not call toString before it has split the stream to lines, so the _partialLine remains as a buffer rather than a string type.

jehna avatar Dec 10 '19 08:12 jehna

Hi @jehna,

Thanks for reporting the issue.

What you propose makes sense to me. It will be great if you can provide a PR that adds a test case for this issue and fixes it.

pokoli avatar Dec 10 '19 08:12 pokoli

Workaround

I implemented a workaround for this issue: Use another library to pre-parse the lines as a stream.

I'm using delimiter-stream NPM package that seems to have a correct implementation of line parsing as a byte stream:

https://github.com/peterhaldbaek/delimiter-stream/blob/043346af778986d63a7ba0f87b94c3df0bc425d4/delimiter-stream.js#L46

Using this library you can do a simple wrapper to wrap up your stream:

const toLineDelimitedStream = input => {
  // Two-byte UTF characters (such as "ä") can break because the chunk might get
  // split at the middle of the character, and papaparse parses the byte stream
  // incorrectly. We can use `DelimiterStream` to fix this, as it parses the
  // chunks to lines correctly before passing the data to papaparse.
  const output = new DelimiterStream()
  input.pipe(output)
  return output
}

Using this helper function you can wrap the stream before passing it to Papa.parse:

Papa.parse(toLineDelimitedStream(stream), {
   ...
})

jehna avatar Dec 10 '19 08:12 jehna

Hey @pokoli

It will be great if you can provide a PR that adds a test case for this issue and fixes it.

I'll have to see if I can find some time to spend on this. Not making any promises yet 😄

jehna avatar Dec 10 '19 08:12 jehna

@jehna How would this workaround look like for a browser implementation? Where instead of the stream I was actually passing the file from a file input field to the Papa.parse function

nichgalea avatar Jun 01 '20 15:06 nichgalea

@nichgalea if you don't mind the extra memory usage, I believe you can call .text() on the file and pass the it to Papa parse as string.

So something like:

Papa.parse(await file.text());

jehna avatar Jun 01 '20 17:06 jehna