doddle-model icon indicating copy to clipboard operation
doddle-model copied to clipboard

Optimize performance of CSVLoader

Open inejc opened this issue 5 years ago • 4 comments

The current implementation is very slow, I think a better approach would be to implement a custom solution rather than using a third-party library.

inejc avatar Oct 04 '19 16:10 inejc

@inejc feel free to peak routines and tricks from the jsoniter-scala-core module.

Here are results of benchmarks for estimation of possible throughput and allocations.

plokhotnyuk avatar Oct 04 '19 19:10 plokhotnyuk

@plokhotnyuk thanks for the pointers! I will look at your solution. Are you perhaps aware of any existing and efficient CSV loading libraries on JVM?

inejc avatar Oct 04 '19 21:10 inejc

There are a lot of solutions for Java: https://github.com/uniVocity/csv-parsers-comparison

But a custom codec which is based on jsoniter-scala-core outperforms them greatly when numbers and strings are represented as JSON values. That require wrapping all string values by " characters and using UTF-8 encoding or hexadecimal escaping for non-ASCII characters, and not using numbers with leading zeroes.

If implementation that is locked to JSON representation for string and numbers is not acceptable you can fork and replace it by other for other rules and encoding formats using the same approaches and hacks.

plokhotnyuk avatar Oct 04 '19 23:10 plokhotnyuk

I merged https://github.com/picnicml/doddle-model/pull/106 but keeping this issue open as we want to improve the current solution. Preferably look into the examples given by @plokhotnyuk.

inejc avatar Oct 06 '19 11:10 inejc