odbc2parquet Quick benchmark

With default options I made the following measurements using a 10m row table with few int32, int64, double, date, timestamp columns (so no string or other exotic type).

tool			filetype	output size		execution
MySQL			txt			1,5 GB			47 seconds
odbc2parquet 	parquet		344MB file		54 seconds
turbodbc 		pandas DF	(in-memory)		54 seconds

IDK whether this is expected, but thought you might find it useful.

Nov 10 '20 13:11 alippai

how did you generate the text file with MySQL?

Nov 10 '20 20:11 pacman82

and was the database used a MySQL, in all three cases?

Nov 10 '20 20:11 pacman82

It was the same (old) MySQL instance. The text is simply mysql -e query output piped to a file. The unixodbc, mysql client, turbodbc and odbc2parquet were installed using conda-forge (cargo coming from the rust dev channel), so everything up to date.

Nov 10 '20 22:11 alippai

I can make more scientific benchmarks (or even add to this repo to compare)

Nov 10 '20 22:11 alippai

cool thanks! I wonder how odbcsv would perform against the piped text output? This should be more comparable than with odbc2parquet. I also wonder how odbc2parquet perform against turbodbc -> pandas -> parquet.

I don't know which of these comparisions with other tech stacks makes sense to include in documentation, but I am all for a benchmarks suite (I usually use criterion).

Nov 11 '20 02:11 pacman82

What do you think about https://github.com/sharkdp/hyperfine for e2e benchmarks? Criterion is nice for rust lib benchmarking, but it wouldn't allow us to track the performance compared to a CLI baseline (with odbcsv, turbodbc or simple mysql CLI client)

Nov 11 '20 11:11 alippai

Have never used it. Yet it seems nice enough and the license checks out. I think criterions primary usecase is to catch preformance regressions. Comparing a bunch of different tools is a different usage scenario. Both have their uses. From your suggestion I gather that you are more interested in doing the latter.

How do you want to go about this? Do you want to test a bunch of stuff once and just publish the results? Or do you want to check a reproducible setup into this repository? What do we actually want to compare?

I am curious what we will learn from these benchmarks. So far the earily similar numbers between turbodbc and odbc2parquet indicate that the bottle neck is transmitting the records over the network.

I'd be interessted in learning what could be achieved by playing around with the batch-size parameter. Is there a point of diminishing returns, there the speedup does not varant the additional memory usage? Is there a point after which it affects the total speed negatively?

I can see many usescases for a nice benchmarking setup. I am looking forward to it, should you wish to provide one.

Nov 11 '20 20:11 pacman82

I played with one of my larger tables (30M rows) a little bit. The query execution seems to be quick, eg if I add limit 1000000 it completes in 5sec fetching included.

So far I collected this:

Prints ODBC env created. It starts with CPU usage 20-25%
After 1 min it prints the column info (11 columns, int, bigint, timestamp, but tried without timestamp as well). Row size 220 bytes, batch size 10M.
Switches to 100% CPU usage and one fetch seems to be 12sec
Still with 100% CPU usage writing columns for 10 sec (so 1 sec / column)
The complete workload is finished in 2minutes (repeating the 2-3. points above 4x), user time 70sec, sys time 7 sec.

This definitely looks to be CPU bound. If I turn off compression it's down to 1m38s (and user time down to 50s), but the CPU is still on 100% (only one used out of many). Setting the batch size to 1M doesn't really help, the column info is printed earlier, but the end-to-end time is the same.

How should I proceed in the investigation? Is this helpful?

May 22 '21 22:05 alippai

odbcsv is similar in terms of performance, 1m50s with 1m user time. If I understand correctly it's 5000rows in a batch/fetch. Still 100% CPU usage continuously.

May 22 '21 23:05 alippai

Hi @alippai ,

thanks for the investigation and the numbers. Increasing the fetch size, has diminshing returns. If you assume a fixed overhead per batch. Increasing the batch size from 1 to 100, cuts the overhead to 1/100. Going from there to a fetch size of 10.000 will decrease it to a 1/100 again, but the overhead you decrease is way smaller already.

As ODBC batch fetch size == Parquet Row Group Size, you should probably choose the value so you get the desired output rowput size. As long as we don't go row by row, the IO overhead per batch should not be to bad.

The above equation is however not by accident. Having the Row Group Size being equal to the fetch size allows for passing the non nullable, binary identical buffers directly to parquet without copying them in memory. This optimization is currently already implemented for inserting, but not fetching data. Doing that will at least save some CPU cycles depending on the use case.

May 30 '21 12:05 pacman82

I think you even mentioned the optimization already earlier, in another issue. ;-)

Cheers, Markus

May 30 '21 12:05 pacman82

Just another thought: As you said the performance of odbcsv is similar. The writing data part is different in odbc2parquet and odbcsv. The work done by the ODBC driver is somewhat similar (with the notable difference, that odbcsv requests all columns as string). My guess is that the majority of the work happens within the ODBC driver.

May 30 '21 12:05 pacman82

I'm afraid that the ODBC driver is the bottleneck. Is there an easy way to profile the run? Or are there good entry points in odbc-sys/API to collect some basic perf stats? Eg end of the job it could write a debug message: "receiving data 3s, time spent in ODBC driver 1s, odbc2parquet 0.5s, writing parquet 3s"

May 30 '21 13:05 alippai

In terms of this tool, we could always add more logging, but I think it won't be very enlightning. If you can build your ODBC driver with profiling information you can get a flamegraph, or something like that. Yet at that point we would have stretched even the most generous definition of "easy way". I am currently unaware of profiling information in the ODBC API.

Sorry, I don't have better news.

May 30 '21 14:05 pacman82

No problem, I had to ask as you are super-experienced in this field. I'll let you know if I find anything useful.

May 30 '21 15:05 alippai

After seeing @alippai benchmark it performs the same as others? My focus was on performance than the size/format of the output file. option 1 seems better than this tool. see attached below. @pacman82 Should we evaluate this tool may be for larger tables with strings where it might make a difference? Sorry if I have misunderstood. Capture

Aug 31 '22 18:08 vikramhn

Hello @vikramhn ,

feel free to evaluate and choose any tool you like for your usecase. This odbc2parquet is only concerned with ODBC <-> Parquet. As such I might be interessted in benchmarks regarding odbc2parquet itself, comparing it to previous versions of itself. It would be interessting to know how it competes with other ways of fetching parquet from a database. So far I feel though odbc2parquet is competitive, if not best.

Personally I do not have the bandwith to worry about the performance of other tools. If you want to do so here and share the results of your findings, you are welcome to do so however.

Cheers, Markus

Sep 01 '22 05:09 pacman82

Closing this issue, I hope this does not discourage anyone from sharing benchmarks, or interssting findings

Sep 19 '22 05:09 pacman82

odbc2parquet odbc2parquet copied to clipboard

Quick benchmark

odbc2parquet
odbc2parquet copied to clipboard