dkan icon indicating copy to clipboard operation
dkan copied to clipboard

DKAN Datastore Simple Import produces warning messages (7.x)

Open stefan-korn opened this issue 3 years ago • 3 comments

Describe the bug

The DKAN Datastore Simple import may produce (incorrect) warning messages like "Source file is not in "UTF-8" encoding". The Simple import does a chunked import, chunking the file into 32 Byte parts. The import does an UTF-8 check with mb_check_encoding() on these 32 byte chunks. Now it is possible that the chunk splits an UTF8 char in the middle. If this is happening the UTF-8 check will fail and warning "Source file is not in "UTF-8" encoding" is given. The import itself will not fail, because beside checking the UTF-8 and giving the warning the data is processed unaltered.

I think, it does not make sense to perform an UTF-8 check on the chunked parts of the file in the import.

Maybe read the file line by line and do UTF-8 check on each line. Or do the UTF-8 check on the whole file?

This is the code line where the check happens: https://github.com/GetDKAN/dkan/blob/0342af1b8aa91bc1ec3ff347d00be9aa3cabf0f5/modules/dkan/dkan_datastore/modules/dkan_datastore_simple_import/src/SimpleImport.php#L54

stefan-korn avatar Mar 08 '21 11:03 stefan-korn

Proposal for reading the file line by line #3393

stefan-korn avatar Mar 08 '21 11:03 stefan-korn

@stefan-korn if the file does not have new line characters it will fetch the entire file with fgets.

janette avatar Mar 17 '21 15:03 janette

@janette : Okay, that's a valid objection. But the chunking implementation with possibly dozens of (incorrect) warnings still seems kind of buggy to me. Maybe skip the UTF 8 validation of the chunks, seems not really reliable anyway.

stefan-korn avatar Mar 17 '21 15:03 stefan-korn