dkan
dkan copied to clipboard
DKAN Datastore Simple Import produces warning messages (7.x)
Describe the bug
The DKAN Datastore Simple import may produce (incorrect) warning messages like "Source file is not in "UTF-8" encoding". The Simple import does a chunked import, chunking the file into 32 Byte parts. The import does an UTF-8 check with mb_check_encoding() on these 32 byte chunks. Now it is possible that the chunk splits an UTF8 char in the middle. If this is happening the UTF-8 check will fail and warning "Source file is not in "UTF-8" encoding" is given. The import itself will not fail, because beside checking the UTF-8 and giving the warning the data is processed unaltered.
I think, it does not make sense to perform an UTF-8 check on the chunked parts of the file in the import.
Maybe read the file line by line and do UTF-8 check on each line. Or do the UTF-8 check on the whole file?
This is the code line where the check happens: https://github.com/GetDKAN/dkan/blob/0342af1b8aa91bc1ec3ff347d00be9aa3cabf0f5/modules/dkan/dkan_datastore/modules/dkan_datastore_simple_import/src/SimpleImport.php#L54
Proposal for reading the file line by line #3393
@stefan-korn if the file does not have new line characters it will fetch the entire file with fgets.
@janette : Okay, that's a valid objection. But the chunking implementation with possibly dozens of (incorrect) warnings still seems kind of buggy to me. Maybe skip the UTF 8 validation of the chunks, seems not really reliable anyway.