vaex
vaex copied to clipboard
[BUG-REPORT] Slow HDF5 conversion of file with large number of columns
Description Files with a significant amount of columns seem to freeze, while trying to convert to HDF5. On a test file with 5000 columns and 10 rows, conversion to arrow takes 0.19s, convertion to parquet 0.25s, while hdf5 seems to progress quite slowly.
import vaex
df = vaex.open("test_file.csv")
df.export_arrow("test_file.arrow", progress=True)
df.export_parquet("test_file.parquet", progress=True)
df.export_hdf5("test_file.hdf5", progress=True)
export(arrow) [########################################] 100.00% elapsed time : 0.20s = 0.0m = 0.0h
export(arrow) [########################################] 100.00% elapsed time : 0.24s = 0.0m = 0.0h
export(hdf5) [###############-------------------------] 38.01% estimated time: 49.82s = 0.8m = 0.0h
It seems most of the delay is originating from this line: https://github.com/vaexio/vaex/blob/633970528cb5091ef376dbca2e4721cd42525419/packages/vaex-hdf5/vaex/hdf5/writer.py#L73
Software information
- Vaex version (
import vaex; vaex.__version__)
:
{'vaex': '4.11.1', 'vaex-core': '4.11.1', 'vaex-viz': '0.5.2', 'vaex-hdf5': '0.12.3', 'vaex-server': '0.8.1', 'vaex-astro': '0.9.1', 'vaex-jupyter': '0.8.0', 'vaex-ml': '0.18.0'}
- Vaex was installed via: pip / conda-forge / from source: pip
- OS: Ubuntu 20.04
Additional information I have uploaded a dataset to help reproduce this issue. test_file.csv
Thanks - good catch! Let's see if we can improve it.
PRs are welcome of course!
Hi, I also have a use case with a lot of columns. I tried to reproduce this issue in my environment, and observed that after vaex 4.14, even arrow and parquet exports are much slower.
I used python 3.9.18 on Ubuntu 22.04 and Windows 10. I installed vaex with conda using conda-forge channel.
With vaex 4.13, only HDF5 export is slow. (Sorry for pasting 4.12.0 results. I copied from wrong terminal.)
In [1]: import vaex
In [2]: vaex.__version__
Out[2]:
{'vaex-core': '4.12.0',
'vaex-viz': '0.5.4',
'vaex-hdf5': '0.12.3',
'vaex-server': '0.8.1',
'vaex-astro': '0.9.3',
'vaex-jupyter': '0.8.1',
'vaex-ml': '0.18.1'}
In [3]: df = vaex.open("/tmp/test_file.csv")
In [4]: df.export_arrow("test_file.arrow", progress=True)
export(arrow) [########################################] 100.00% elapsed time : 0.24s = 0.0m = 0.0h
In [5]: df.export_parquet("test_file.parquet", progress=True)
export(arrow) [########################################] 100.00% elapsed time : 0.31s = 0.0m = 0.0h
In [6]: df.export_hdf5("test_file.hdf5", progress=True)
export(hdf5) [########################################] 100.00% elapsed time : 140.48s = 2.3m = 0.0h
But with vaex 4.14, arrow & parquet export show significant slow down.
In [1]: import vaex
In [2]: vaex.__version__
Out[2]:
{'vaex-core': '4.14.0',
'vaex-viz': '0.5.4',
'vaex-hdf5': '0.13.0',
'vaex-server': '0.8.1',
'vaex-astro': '0.9.3',
'vaex-jupyter': '0.8.1',
'vaex-ml': '0.18.1'}
In [3]: df = vaex.open("/tmp/test_file.csv")
In [4]: df.export_arrow("test_file.arrow", progress=True)
export(arrow) [########################################] 100.00% elapsed time : 76.80s = 1.3m = 0.0h
In [5]: df.export_parquet("test_file.parquet", progress=True)
export(arrow) [########################################] 100.00% elapsed time : 79.64s = 1.3m = 0.0h
In [6]: df.export_hdf5("test_file.hdf5", progress=True)
export(hdf5) [########################################] 100.00% elapsed time : 274.33s = 4.6m = 0.1h
This means that we can't work around the slow HDF5 export of wide dataframes by using arrow or parquet. I would love to see this resolved because vaex seems like a good option for my use case.
Thanks,