kitodo-production icon indicating copy to clipboard operation
kitodo-production copied to clipboard

Allow quotes in imported csv files and make mass import more resilient

Open BartChris opened this issue 1 year ago • 11 comments

fixes https://github.com/kitodo/kitodo-production/issues/6009

This PR supersedes the closed Pull Requests https://github.com/kitodo/kitodo-production/pull/6011 and https://github.com/kitodo/kitodo-production/pull/6010 and does three things:

  • it makes the mass import more resilient by filtering out empty lines in the imported csv file which lead to an exception right now. (see https://github.com/kitodo/kitodo-production/pull/6010)
  • it fixes the bug described in https://github.com/kitodo/kitodo-production/issues/6009 by changing the method how the seperator for the import is switched
  • it allows for quoted values in the CSV file, which can be used when the value contains potential seperators ("," or ";"). Values which contain a seperator break the import in current master. This PR fixes that.

Marking as draft until tests can be added after https://github.com/kitodo/kitodo-production/pull/5961 is merged.

BartChris avatar Mar 22 '24 15:03 BartChris

Is OpenCSV project / library not an option to be used instead of coding this again?

henning-gerhardt avatar Mar 23 '24 09:03 henning-gerhardt

Yes, it definitely is. While fixing the above bugs i was already wondering wether we will end up fixing a lot of bugs manually, which might already be taken care of by a library. I was just not sure, when the point is reached where we should bring in a new dependency. And if this might bring in new problems. I will look into it.

BartChris avatar Mar 23 '24 09:03 BartChris

I prefer using libraries instead of writing the same solution for my issue again and if I encounter issues I would solve them in the library or in the source data. But in the end it is a decision by the developer.

henning-gerhardt avatar Mar 23 '24 10:03 henning-gerhardt

Just a note: OpenCSV was not in use (and the dependency could be removed) until this pull request and the dependency is quite old: version 3.10 from July 2017 - maybe updating the version to a more recent version like 5.9 from November 2023?

henning-gerhardt avatar Mar 23 '24 11:03 henning-gerhardt

A small suggestion how the method parseLines in the MassImportService class can may look:

   public List<CsvRecord> parseLines(List<String> lines, String separator) throws IOException, CsvException {
        List<CsvRecord> records = new LinkedList<>();
        CSVParser parser = new CSVParserBuilder()
                .withSeparator(separator.charAt(0))
                .withQuoteChar('\"')
                .build();
        try (StringReader reader = new StringReader(String.join("\n", lines));
             CSVReader csvReader = new CSVReaderBuilder(reader)
                .withSkipLines(0)
                .withCSVParser(parser)
                .build()) {
            for (String[] entries : csvReader.readAll()) {
                List<CsvCell> cells = new LinkedList<>();
                for (String value : entries) {
                    cells.add(new CsvCell(value));
                }
                records.add(new CsvRecord(cells));
            }
        }
        return records;
    }

henning-gerhardt avatar Mar 23 '24 13:03 henning-gerhardt

And a more compact variant:

    public List<CsvRecord> parseLines(List<String> lines, String separator) throws IOException, CsvException {
        List<CsvRecord> records;
        CSVParser parser = new CSVParserBuilder()
                .withSeparator(separator.charAt(0))
                .withQuoteChar('\"')
                .build();
        try (CSVReader csvReader = new CSVReaderBuilder(new StringReader(String.join("\n", lines)))
                .withSkipLines(0)
                .withCSVParser(parser)
                .build()) {
            records = csvReader.readAll().stream().map(csvLines -> Arrays.stream(csvLines)
                    .map(CsvCell::new)
                    .collect(Collectors.toCollection(LinkedList::new)))
                    .map(CsvRecord::new).collect(Collectors.toCollection(LinkedList::new));
        }
        return records;
    }

henning-gerhardt avatar Mar 23 '24 13:03 henning-gerhardt

Thanks a lot for the support, this looks way better. Your second suggestion probably needs some finetuning to work, so i used the more classical approach for now. There are still some problems when i have a very large csv (2000 and more lines) and switch the separator multiple times. Somehow the table data gets lost. I will look into that next week.

BartChris avatar Mar 23 '24 13:03 BartChris

You are welcome. I looked only the reading part and I think there is room left for more improvements:

  • maybe the StreamReader must be wrapped in a BufferedReader for more amount of data
  • maybe converting the UploadedFile from PrimeFaces to a list of strings back to a Reader object as CSVReader can directly use a InputBufferStream: CSVReaderBuilder(UploadedFile.getInputStream()).build(). Extracting the header line should even be possible with the OpenCSV library with the same delimiters and quote qualifiers.

henning-gerhardt avatar Mar 23 '24 13:03 henning-gerhardt

There are still some problems when i have a very large csv (2000 and more lines) and switch the separator multiple times. Somehow the table data gets lost. I will look into that next week.

This happens also in current master. When having 2000 (500 are working) entries in my CSV file i can only switch the separator once. After doing an additional switch the table loses its data.

BartChris avatar Mar 25 '24 11:03 BartChris

This happens also in current master. When having 2000 (500 are working) entries in my CSV file i can only switch the separator once. After doing an additional switch the table loses its data.

@BartChris If this happens in the master as well, the problem is not introduced by the changes in this pull request, right? Is there another reason why it is still a draft?

solth avatar May 17 '24 08:05 solth

@solth Sorry, i did not get to that again. I will add tests next week and then this should be ready for review.

BartChris avatar Jun 03 '24 12:06 BartChris