odbc2parquet
odbc2parquet copied to clipboard
Quick benchmark
With default options I made the following measurements using a 10m row table with few int32, int64, double, date, timestamp columns (so no string or other exotic type).
tool filetype output size execution
MySQL txt 1,5 GB 47 seconds
odbc2parquet parquet 344MB file 54 seconds
turbodbc pandas DF (in-memory) 54 seconds
IDK whether this is expected, but thought you might find it useful.
how did you generate the text file with MySQL?
and was the database used a MySQL, in all three cases?
It was the same (old) MySQL instance. The text is simply mysql -e query
output piped to a file.
The unixodbc, mysql client, turbodbc and odbc2parquet were installed using conda-forge (cargo coming from the rust dev channel), so everything up to date.
I can make more scientific benchmarks (or even add to this repo to compare)
cool thanks! I wonder how odbcsv
would perform against the piped text output? This should be more comparable than with odbc2parquet
. I also wonder how odbc2parquet
perform against turbodbc
-> pandas
-> parquet
.
I don't know which of these comparisions with other tech stacks makes sense to include in documentation, but I am all for a benchmarks suite (I usually use criterion
).
What do you think about https://github.com/sharkdp/hyperfine for e2e benchmarks? Criterion is nice for rust lib benchmarking, but it wouldn't allow us to track the performance compared to a CLI baseline (with odbcsv, turbodbc or simple mysql CLI client)
Have never used it. Yet it seems nice enough and the license checks out. I think criterions primary usecase is to catch preformance regressions. Comparing a bunch of different tools is a different usage scenario. Both have their uses. From your suggestion I gather that you are more interested in doing the latter.
How do you want to go about this? Do you want to test a bunch of stuff once and just publish the results? Or do you want to check a reproducible setup into this repository? What do we actually want to compare?
I am curious what we will learn from these benchmarks. So far the earily similar numbers between turbodbc and odbc2parquet
indicate that the bottle neck is transmitting the records over the network.
I'd be interessted in learning what could be achieved by playing around with the batch-size
parameter. Is there a point of diminishing returns, there the speedup does not varant the additional memory usage? Is there a point after which it affects the total speed negatively?
I can see many usescases for a nice benchmarking setup. I am looking forward to it, should you wish to provide one.
I played with one of my larger tables (30M rows) a little bit. The query execution seems to be quick, eg if I add limit 1000000 it completes in 5sec fetching included.
So far I collected this:
- Prints ODBC env created. It starts with CPU usage 20-25%
- After 1 min it prints the column info (11 columns, int, bigint, timestamp, but tried without timestamp as well). Row size 220 bytes, batch size 10M.
- Switches to 100% CPU usage and one fetch seems to be 12sec
- Still with 100% CPU usage writing columns for 10 sec (so 1 sec / column)
- The complete workload is finished in 2minutes (repeating the 2-3. points above 4x), user time 70sec, sys time 7 sec.
This definitely looks to be CPU bound. If I turn off compression it's down to 1m38s (and user time down to 50s), but the CPU is still on 100% (only one used out of many). Setting the batch size to 1M doesn't really help, the column info is printed earlier, but the end-to-end time is the same.
How should I proceed in the investigation? Is this helpful?
odbcsv
is similar in terms of performance, 1m50s with 1m user time. If I understand correctly it's 5000rows in a batch/fetch. Still 100% CPU usage continuously.
Hi @alippai ,
thanks for the investigation and the numbers. Increasing the fetch size, has diminshing returns. If you assume a fixed overhead per batch. Increasing the batch size from 1 to 100, cuts the overhead to 1/100. Going from there to a fetch size of 10.000 will decrease it to a 1/100 again, but the overhead you decrease is way smaller already.
As ODBC batch fetch size == Parquet Row Group Size, you should probably choose the value so you get the desired output rowput size. As long as we don't go row by row, the IO overhead per batch should not be to bad.
The above equation is however not by accident. Having the Row Group Size being equal to the fetch size allows for passing the non nullable, binary identical buffers directly to parquet without copying them in memory. This optimization is currently already implemented for inserting, but not fetching data. Doing that will at least save some CPU cycles depending on the use case.
I think you even mentioned the optimization already earlier, in another issue. ;-)
Cheers, Markus
Just another thought: As you said the performance of odbcsv
is similar. The writing data part is different in odbc2parquet
and odbcsv
. The work done by the ODBC driver is somewhat similar (with the notable difference, that odbcsv
requests all columns as string). My guess is that the majority of the work happens within the ODBC driver.
I'm afraid that the ODBC driver is the bottleneck. Is there an easy way to profile the run? Or are there good entry points in odbc-sys/API to collect some basic perf stats? Eg end of the job it could write a debug message: "receiving data 3s, time spent in ODBC driver 1s, odbc2parquet 0.5s, writing parquet 3s"
In terms of this tool, we could always add more logging, but I think it won't be very enlightning. If you can build your ODBC driver with profiling information you can get a flamegraph, or something like that. Yet at that point we would have stretched even the most generous definition of "easy way". I am currently unaware of profiling information in the ODBC API.
Sorry, I don't have better news.
No problem, I had to ask as you are super-experienced in this field. I'll let you know if I find anything useful.
After seeing @alippai benchmark it performs the same as others? My focus was on performance than the size/format of the output file.
option 1 seems better than this tool. see attached below. @pacman82 Should we evaluate this tool may be for larger tables with strings where it might make a difference? Sorry if I have misunderstood.
Hello @vikramhn ,
feel free to evaluate and choose any tool you like for your usecase. This odbc2parquet
is only concerned with ODBC <-> Parquet. As such I might be interessted in benchmarks regarding odbc2parquet
itself, comparing it to previous versions of itself. It would be interessting to know how it competes with other ways of fetching parquet from a database. So far I feel though odbc2parquet
is competitive, if not best.
Personally I do not have the bandwith to worry about the performance of other tools. If you want to do so here and share the results of your findings, you are welcome to do so however.
Cheers, Markus
Closing this issue, I hope this does not discourage anyone from sharing benchmarks, or interssting findings