Allow quotes in imported csv files and make mass import more resilient
fixes https://github.com/kitodo/kitodo-production/issues/6009
This PR supersedes the closed Pull Requests https://github.com/kitodo/kitodo-production/pull/6011 and https://github.com/kitodo/kitodo-production/pull/6010 and does three things:
- it makes the mass import more resilient by filtering out empty lines in the imported csv file which lead to an exception right now. (see https://github.com/kitodo/kitodo-production/pull/6010)
- it fixes the bug described in https://github.com/kitodo/kitodo-production/issues/6009 by changing the method how the seperator for the import is switched
- it allows for quoted values in the CSV file, which can be used when the value contains potential seperators ("," or ";"). Values which contain a seperator break the import in current master. This PR fixes that.
Marking as draft until tests can be added after https://github.com/kitodo/kitodo-production/pull/5961 is merged.
Is OpenCSV project / library not an option to be used instead of coding this again?
Yes, it definitely is. While fixing the above bugs i was already wondering wether we will end up fixing a lot of bugs manually, which might already be taken care of by a library. I was just not sure, when the point is reached where we should bring in a new dependency. And if this might bring in new problems. I will look into it.
I prefer using libraries instead of writing the same solution for my issue again and if I encounter issues I would solve them in the library or in the source data. But in the end it is a decision by the developer.
Just a note: OpenCSV was not in use (and the dependency could be removed) until this pull request and the dependency is quite old: version 3.10 from July 2017 - maybe updating the version to a more recent version like 5.9 from November 2023?
A small suggestion how the method parseLines in the MassImportService class can may look:
public List<CsvRecord> parseLines(List<String> lines, String separator) throws IOException, CsvException {
List<CsvRecord> records = new LinkedList<>();
CSVParser parser = new CSVParserBuilder()
.withSeparator(separator.charAt(0))
.withQuoteChar('\"')
.build();
try (StringReader reader = new StringReader(String.join("\n", lines));
CSVReader csvReader = new CSVReaderBuilder(reader)
.withSkipLines(0)
.withCSVParser(parser)
.build()) {
for (String[] entries : csvReader.readAll()) {
List<CsvCell> cells = new LinkedList<>();
for (String value : entries) {
cells.add(new CsvCell(value));
}
records.add(new CsvRecord(cells));
}
}
return records;
}
And a more compact variant:
public List<CsvRecord> parseLines(List<String> lines, String separator) throws IOException, CsvException {
List<CsvRecord> records;
CSVParser parser = new CSVParserBuilder()
.withSeparator(separator.charAt(0))
.withQuoteChar('\"')
.build();
try (CSVReader csvReader = new CSVReaderBuilder(new StringReader(String.join("\n", lines)))
.withSkipLines(0)
.withCSVParser(parser)
.build()) {
records = csvReader.readAll().stream().map(csvLines -> Arrays.stream(csvLines)
.map(CsvCell::new)
.collect(Collectors.toCollection(LinkedList::new)))
.map(CsvRecord::new).collect(Collectors.toCollection(LinkedList::new));
}
return records;
}
Thanks a lot for the support, this looks way better. Your second suggestion probably needs some finetuning to work, so i used the more classical approach for now. There are still some problems when i have a very large csv (2000 and more lines) and switch the separator multiple times. Somehow the table data gets lost. I will look into that next week.
You are welcome. I looked only the reading part and I think there is room left for more improvements:
- maybe the
StreamReadermust be wrapped in aBufferedReaderfor more amount of data - maybe converting the
UploadedFilefrom PrimeFaces to a list of strings back to aReaderobject asCSVReadercan directly use aInputBufferStream:CSVReaderBuilder(UploadedFile.getInputStream()).build(). Extracting the header line should even be possible with the OpenCSV library with the same delimiters and quote qualifiers.
There are still some problems when i have a very large csv (2000 and more lines) and switch the separator multiple times. Somehow the table data gets lost. I will look into that next week.
This happens also in current master. When having 2000 (500 are working) entries in my CSV file i can only switch the separator once. After doing an additional switch the table loses its data.
This happens also in current master. When having 2000 (500 are working) entries in my CSV file i can only switch the separator once. After doing an additional switch the table loses its data.
@BartChris If this happens in the master as well, the problem is not introduced by the changes in this pull request, right? Is there another reason why it is still a draft?
@solth Sorry, i did not get to that again. I will add tests next week and then this should be ready for review.