ParquetFiles.jl
ParquetFiles.jl copied to clipboard
Reading Parquet to DataFrame is slow
Reading a parquet file into a DataFrame is ~170 slower than using CSV.read with the same data. Not sure I can help improve performance but this is limiting my use of ParquetFiles.jl
MWE:
(@v1.4) pkg> st
Status `~/.julia/environments/v1.4/Project.toml`
[6e4b80f9] BenchmarkTools v0.5.0
[336ed68f] CSV v0.6.2
[a93c6f00] DataFrames v0.21.2
[626c502c] Parquet v0.4.0
[46a55296] ParquetFiles v0.2.0
using ParquetFiles, BenchmarkTools, CSV, DataFrames
CSV.read("data.csv")
DataFrame(load("data.parquet"))
Loading times for ParquetFiles
@benchmark DataFrame(load("data.parquet"))
BenchmarkTools.Trial:
memory estimate: 45.66 MiB
allocs estimate: 961290
--------------
minimum time: 287.492 ms (0.00% GC)
median time: 290.843 ms (0.00% GC)
mean time: 296.344 ms (1.64% GC)
maximum time: 326.041 ms (8.46% GC)
--------------
samples: 17
evals/sample: 1
Loading times for CSV:
@benchmark CSV.read("data.csv")
BenchmarkTools.Trial:
memory estimate: 758.14 KiB
allocs estimate: 2299
--------------
minimum time: 1.690 ms (0.00% GC)
median time: 1.735 ms (0.00% GC)
mean time: 1.772 ms (1.43% GC)
maximum time: 14.096 ms (63.93% GC)
--------------
samples: 2817
evals/sample: 1
As compared to pandas:
import pandas as pd
%timeit pd.read_parquet("data.parquet")
# 3.61 ms ± 25.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit pd.read_csv("data.csv")
# 4.73 ms ± 166 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Data are included in zip file: data.zip
I think one of the reasons is that ParquetFiles.jl
doesn't have the interface Tables.columns
implemented, which makes DataFrame(...)
go to the fallback solution, that is, row by row appending.