PapaParse icon indicating copy to clipboard operation
PapaParse copied to clipboard

UTF-8-BOM string parsing - header first name incorrectly enclosed in a double quote

Open icaptnbob opened this issue 4 years ago • 6 comments

When a file is encoded as UTF-8-BOM, PapaParse CSV to Json incorrectly returns the records with the first object key name enclosed in a single quote. One cannot then reference the field called name (example below). record.name then doesn't exist. The field is record.'name' which is not easily accessible in JavaScript using record.name or record[name] etc. You can only see by printing the record to the console, or using a for-in loop.

The subsequent object keys are correct without quotes.

Change the file encoding to UTF-8 and the keys are normal, without a quote.

"PapaConfig": {
    "quotes": true,
    "quoteChar": "\"",
    "escapeChar": "\"",
    "delimiter": ",",
    "header": true,
    "skipEmptyLines": true,
    "columns": null
}

Papa.parse(csvData, PapaConfig)

csvData (subset):

name,phone
De Akker Guest House,0514442010

UTF-8-BOM encoding:

[
  {
    'name': 'De Akker Guest House',
    phone: '0514442010',

UTF-8 encoding:

[
  {
    'name': 'De Akker Guest House',
    phone: '0514442010',

Excel exports csv files to UTF-8-BOM, possibly because that encoding is supposedly faster and more reliable. Can PapaParse be changed to handle UTF-8-BOM correctly?

icaptnbob avatar Oct 27 '20 23:10 icaptnbob

This seems to be related to the fact that I read the input file using fs.readFile(filename, 'utf-8') and this apparently doesn't strip off the BOM markers.

I found https://github.com/mholt/PapaParse/issues/407 after posting.

It would be useful if PapaParse would handle this itself instead.

icaptnbob avatar Oct 27 '20 23:10 icaptnbob

Solved by removing the first character of the readFile output: if (data.charCodeAt(0) === 0xfeff) { data = data.substr(1); }

It may be useful to include this in PapaParse, to save many people encountering and struggling with this repeatedly.

icaptnbob avatar Oct 27 '20 23:10 icaptnbob

Could you please submit a pull request that add your code to papaparse and adds a test to ensure the behaviour?

We should read the first caracter before setting the encodding and if it is the BOM, we remove it and force the encoding to UTF-8.

pokoli avatar Oct 28 '20 09:10 pokoli

Hi.

Is there any update regarding this issue? I believe I've also encountered it. Here is my case:

csv file content:

Id;Number;Account Type;Description
1;105-347-266;ASST;name1606195953751
2;107-397-393;ASST;name1606001642584
3;109-380-871;ASST;name1606059520118

my code:

let csvFile = fs.readFileSync('file.csv', 'utf8', function (err) {
});

let csvFileContent = papa.parse(csvFile, {
    dynamicTyping: true,
    skipEmptyLines: true
});

assert.isNotEmpty(csvFileContent.data);
assert.sameMembers(csvFileContent.data[0], ['Id', 'Number', 'Account Type', 'Description']);

Output:

      throw new AssertionError(msg, {
      ^
AssertionError: expected [ Array(4) ] to have the same members as [ Array(4) ]
    at Object.<anonymous> (D:\Robocze\js-test\index.js:15:8)
    at Module._compile (internal/modules/cjs/loader.js:1137:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:1157:10)
    at Module.load (internal/modules/cjs/loader.js:985:32)
    at Function.Module._load (internal/modules/cjs/loader.js:878:14)
    at Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:71:12)
    at internal/main/run_main_module.js:17:47 {
  showDiff: true,
  actual: [ 'Id', 'Number', 'Account Type', 'Description' ],
  expected: [ 'Id', 'Number', 'Account Type', 'Description' ]
}

I've run an addition check:

let expectedResult = ['Id', 'Number', 'Account Type', 'Description'];

csvFileContent.data[0].forEach((element, index) => {
    console.log(`${element}​ ${expectedResult[index]}​ ${expectedResult[index] === element}​`)
})

with following output: image

In the output picture a whitespace character before 'Id' can be seen, but it's get lost when I copy the output.

MikoSh95 avatar Nov 24 '20 09:11 MikoSh95

I'm having the same issue in 2022. I was given some external CSV file, probably edited/written on Windows, processing it on Linux with papaparse and I was unable to access the first row property defined by the header. When I console.log(row.data) I would see the property key quoted:

{
  'CID': '164.306(a)',
  Section: 'Ensure Confidentiality, Integrity and Availability',
}

I edited the original CSV and simply retyped the first character in the head, then reran:

{
  CID: '164.306(a)',
  Section: 'Ensure Confidentiality, Integrity and Availability',
}

I'm using const csvFile = fs.createReadStream(csvFilename); and I tried switching to const csvFile = fs.readFileSync(csvFilename, { encoding: 'utf-8'}); without luck. I read BOM was supposed to strip with readFileSync but it doesn't work for me at least: https://github.com/nodejs/node-v0.x-archive/issues/1918

duhmojo avatar Jan 19 '22 21:01 duhmojo

I went with this approach, not the most efficient:

        const stripBom = function(str) {
                if (str.charCodeAt(0) === 0xfeff) {
                    return str.slice(1)
                }
                return str
        }

        papaparse.parse(csvFile, {
            step: function(row, parser) {
                ...
                const data = Object.fromEntries(
                    Object.entries(row.data).map(([k, v]) => [stripBom(k), v])
                )

Since csvFile is a read stream, not a pre-read file, I just tossed it in there for each step. I could do it only for the 1st step and skip if its anything but the 1st row.

duhmojo avatar Jan 20 '22 19:01 duhmojo