disdrodb icon indicating copy to clipboard operation
disdrodb copied to clipboard

Using Arrow to further speed up raw data I/O

Open ghiggi opened this issue 2 years ago • 0 comments

Prework

  • [x] Read and agree to the code of conduct.
  • [x] If there is already a relevant issue, whether open or closed, comment on the existing thread instead of posting a new issue.
  • [ ] Post a minimal reproducible example so the maintainer can troubleshoot the problems you identify. A reproducible example is:
  • [ ] Runnable
  • [ ] Minimal: reduce runtime wherever possible and remove complicated details that are irrelevant to the issue at hand.

Description

Evaluate the benefits of using:

  • the engine="arrow" in read.csv to read the raw data using multithreading,
  • the arrow dtype backend introduced in pandas 2.0 to decrease the memory usage of string columns in pd.DataFrame

Please describe the performance issue.

Benchmarks

How poorly does DISDRODB perform?

ghiggi avatar Jun 06 '23 22:06 ghiggi