PapaParse
PapaParse copied to clipboard
ReadableStreamStreamer sometimes breaks UTF8-characters to ��
Issue
If I pass a CSV stream to Papa.parse that contains special characters, it sometimes breaks the special characters so they show up as e.g. ��.
How to reproduce
See example at: https://repl.it/repls/ArcticTreasuredLine
Press "Run" at the top of the page
What should happen?
There should only be output of ä character
What happened?
There's random occurrences of ��
Root cause
These two lines are responsible for this issue: https://github.com/mholt/PapaParse/blob/ae73d2a96639beec58a83326de6bd8e8ca0c02b3/papaparse.js#L863 https://github.com/mholt/PapaParse/blob/ae73d2a96639beec58a83326de6bd8e8ca0c02b3/papaparse.js#L506
So when papaparse reads a chunk, it directly calls .toString of that chunk.
However a chunk consists of bytes, and some utf8-characters are two bytes long:
äconsists of two bytes:11000011and10100100a(and other "regular" characters) is just one byte01100001
Now if the chunk splits right between a multi-byte character like ä, papaparse calls toString to both parts of the character distinctly, and produces two weird characters:
11000011 (from end of first chunk) transforms to �
10100100 (from start of second chunk) transforms to �
How to fix this issue?
If received chunks of bytes, the concatenation should be done in bytes too, e.g. using Buffer.concat. Papaparse should not call toString before it has split the stream to lines, so the _partialLine remains as a buffer rather than a string type.
Hi @jehna,
Thanks for reporting the issue.
What you propose makes sense to me. It will be great if you can provide a PR that adds a test case for this issue and fixes it.
Workaround
I implemented a workaround for this issue: Use another library to pre-parse the lines as a stream.
I'm using delimiter-stream NPM package that seems to have a correct implementation of line parsing as a byte stream:
https://github.com/peterhaldbaek/delimiter-stream/blob/043346af778986d63a7ba0f87b94c3df0bc425d4/delimiter-stream.js#L46
Using this library you can do a simple wrapper to wrap up your stream:
const toLineDelimitedStream = input => {
// Two-byte UTF characters (such as "ä") can break because the chunk might get
// split at the middle of the character, and papaparse parses the byte stream
// incorrectly. We can use `DelimiterStream` to fix this, as it parses the
// chunks to lines correctly before passing the data to papaparse.
const output = new DelimiterStream()
input.pipe(output)
return output
}
Using this helper function you can wrap the stream before passing it to Papa.parse:
Papa.parse(toLineDelimitedStream(stream), {
...
})
Hey @pokoli
It will be great if you can provide a PR that adds a test case for this issue and fixes it.
I'll have to see if I can find some time to spend on this. Not making any promises yet 😄
@jehna How would this workaround look like for a browser implementation? Where instead of the stream I was actually passing the file from a file input field to the Papa.parse function
@nichgalea if you don't mind the extra memory usage, I believe you can call .text() on the file and pass the it to Papa parse as string.
So something like:
Papa.parse(await file.text());