PapaParse
PapaParse copied to clipboard
UTF-8-BOM string parsing - header first name incorrectly enclosed in a double quote
When a file is encoded as UTF-8-BOM, PapaParse CSV to Json incorrectly returns the records with the first object key name enclosed in a single quote. One cannot then reference the field called name (example below). record.name then doesn't exist. The field is record.'name' which is not easily accessible in JavaScript using record.name or record[name] etc. You can only see by printing the record to the console, or using a for-in loop.
The subsequent object keys are correct without quotes.
Change the file encoding to UTF-8 and the keys are normal, without a quote.
"PapaConfig": {
"quotes": true,
"quoteChar": "\"",
"escapeChar": "\"",
"delimiter": ",",
"header": true,
"skipEmptyLines": true,
"columns": null
}
Papa.parse(csvData, PapaConfig)
csvData (subset):
name,phone
De Akker Guest House,0514442010
UTF-8-BOM encoding:
[
{
'name': 'De Akker Guest House',
phone: '0514442010',
UTF-8 encoding:
[
{
'name': 'De Akker Guest House',
phone: '0514442010',
Excel exports csv files to UTF-8-BOM, possibly because that encoding is supposedly faster and more reliable. Can PapaParse be changed to handle UTF-8-BOM correctly?
This seems to be related to the fact that I read the input file using fs.readFile(filename, 'utf-8') and this apparently doesn't strip off the BOM markers.
I found https://github.com/mholt/PapaParse/issues/407 after posting.
It would be useful if PapaParse would handle this itself instead.
Solved by removing the first character of the readFile output: if (data.charCodeAt(0) === 0xfeff) { data = data.substr(1); }
It may be useful to include this in PapaParse, to save many people encountering and struggling with this repeatedly.
Could you please submit a pull request that add your code to papaparse and adds a test to ensure the behaviour?
We should read the first caracter before setting the encodding and if it is the BOM, we remove it and force the encoding to UTF-8.
Hi.
Is there any update regarding this issue? I believe I've also encountered it. Here is my case:
csv file content:
Id;Number;Account Type;Description
1;105-347-266;ASST;name1606195953751
2;107-397-393;ASST;name1606001642584
3;109-380-871;ASST;name1606059520118
my code:
let csvFile = fs.readFileSync('file.csv', 'utf8', function (err) {
});
let csvFileContent = papa.parse(csvFile, {
dynamicTyping: true,
skipEmptyLines: true
});
assert.isNotEmpty(csvFileContent.data);
assert.sameMembers(csvFileContent.data[0], ['Id', 'Number', 'Account Type', 'Description']);
Output:
throw new AssertionError(msg, {
^
AssertionError: expected [ Array(4) ] to have the same members as [ Array(4) ]
at Object.<anonymous> (D:\Robocze\js-test\index.js:15:8)
at Module._compile (internal/modules/cjs/loader.js:1137:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:1157:10)
at Module.load (internal/modules/cjs/loader.js:985:32)
at Function.Module._load (internal/modules/cjs/loader.js:878:14)
at Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:71:12)
at internal/main/run_main_module.js:17:47 {
showDiff: true,
actual: [ 'Id', 'Number', 'Account Type', 'Description' ],
expected: [ 'Id', 'Number', 'Account Type', 'Description' ]
}
I've run an addition check:
let expectedResult = ['Id', 'Number', 'Account Type', 'Description'];
csvFileContent.data[0].forEach((element, index) => {
console.log(`${element} ${expectedResult[index]} ${expectedResult[index] === element}`)
})
with following output:
In the output picture a whitespace character before 'Id' can be seen, but it's get lost when I copy the output.
I'm having the same issue in 2022. I was given some external CSV file, probably edited/written on Windows, processing it on Linux with papaparse and I was unable to access the first row property defined by the header. When I console.log(row.data)
I would see the property key quoted:
{
'CID': '164.306(a)',
Section: 'Ensure Confidentiality, Integrity and Availability',
}
I edited the original CSV and simply retyped the first character in the head, then reran:
{
CID: '164.306(a)',
Section: 'Ensure Confidentiality, Integrity and Availability',
}
I'm using const csvFile = fs.createReadStream(csvFilename);
and I tried switching to const csvFile = fs.readFileSync(csvFilename, { encoding: 'utf-8'});
without luck. I read BOM was supposed to strip with readFileSync but it doesn't work for me at least: https://github.com/nodejs/node-v0.x-archive/issues/1918
I went with this approach, not the most efficient:
const stripBom = function(str) {
if (str.charCodeAt(0) === 0xfeff) {
return str.slice(1)
}
return str
}
papaparse.parse(csvFile, {
step: function(row, parser) {
...
const data = Object.fromEntries(
Object.entries(row.data).map(([k, v]) => [stripBom(k), v])
)
Since csvFile is a read stream, not a pre-read file, I just tossed it in there for each step. I could do it only for the 1st step and skip if its anything but the 1st row.