PapaParse ReadableStreamStreamer sometimes breaks UTF8-characters to ��

Issue

If I pass a CSV stream to Papa.parse that contains special characters, it sometimes breaks the special characters so they show up as e.g. ��.

How to reproduce

See example at: https://repl.it/repls/ArcticTreasuredLine

Press "Run" at the top of the page

What should happen?

There should only be output of ä character

What happened?

There's random occurrences of ��

Root cause

These two lines are responsible for this issue: https://github.com/mholt/PapaParse/blob/ae73d2a96639beec58a83326de6bd8e8ca0c02b3/papaparse.js#L863 https://github.com/mholt/PapaParse/blob/ae73d2a96639beec58a83326de6bd8e8ca0c02b3/papaparse.js#L506

So when papaparse reads a chunk, it directly calls .toString of that chunk.

However a chunk consists of bytes, and some utf8-characters are two bytes long:

ä consists of two bytes: 11000011 and 10100100
a (and other "regular" characters) is just one byte 01100001

Now if the chunk splits right between a multi-byte character like ä, papaparse calls toString to both parts of the character distinctly, and produces two weird characters:

11000011 (from end of first chunk) transforms to � 10100100 (from start of second chunk) transforms to �

How to fix this issue?

If received chunks of bytes, the concatenation should be done in bytes too, e.g. using Buffer.concat. Papaparse should not call toString before it has split the stream to lines, so the _partialLine remains as a buffer rather than a string type.

Dec 10 '19 08:12 jehna

Hi @jehna,

Thanks for reporting the issue.

What you propose makes sense to me. It will be great if you can provide a PR that adds a test case for this issue and fixes it.

Dec 10 '19 08:12 pokoli

Workaround

I implemented a workaround for this issue: Use another library to pre-parse the lines as a stream.

I'm using delimiter-stream NPM package that seems to have a correct implementation of line parsing as a byte stream:

https://github.com/peterhaldbaek/delimiter-stream/blob/043346af778986d63a7ba0f87b94c3df0bc425d4/delimiter-stream.js#L46

Using this library you can do a simple wrapper to wrap up your stream:

const toLineDelimitedStream = input => {
  // Two-byte UTF characters (such as "ä") can break because the chunk might get
  // split at the middle of the character, and papaparse parses the byte stream
  // incorrectly. We can use `DelimiterStream` to fix this, as it parses the
  // chunks to lines correctly before passing the data to papaparse.
  const output = new DelimiterStream()
  input.pipe(output)
  return output
}

Using this helper function you can wrap the stream before passing it to Papa.parse:

Papa.parse(toLineDelimitedStream(stream), {
   ...
})

Dec 10 '19 08:12 jehna

Hey @pokoli

It will be great if you can provide a PR that adds a test case for this issue and fixes it.

I'll have to see if I can find some time to spend on this. Not making any promises yet 😄

Dec 10 '19 08:12 jehna

@jehna How would this workaround look like for a browser implementation? Where instead of the stream I was actually passing the file from a file input field to the Papa.parse function

Jun 01 '20 15:06 nichgalea

@nichgalea if you don't mind the extra memory usage, I believe you can call .text() on the file and pass the it to Papa parse as string.

So something like:

Papa.parse(await file.text());

Jun 01 '20 17:06 jehna

PapaParse PapaParse copied to clipboard

ReadableStreamStreamer sometimes breaks UTF8-characters to ��

Issue

How to reproduce

What should happen?

What happened?

Root cause

How to fix this issue?

Workaround

PapaParse
PapaParse copied to clipboard