pg2parquet icon indicating copy to clipboard operation
pg2parquet copied to clipboard

COPY TO STDOUT support

Open dmitry-sinina opened this issue 5 months ago • 5 comments

Thank you for a great tool.

After some time of use we faced performance problems with big tables( ~5GB in our case). In our case pg_dump require 5 minutes to dump table but pg2parquet requires more than 2h.

May be it is possible to implement postgresql COPY TO STDOUT mechanism when custom query is not required? Because it looks like problem caused by slow execution of SELECT * from ...;

dmitry-sinina avatar Jul 18 '25 15:07 dmitry-sinina

Thanks for the suggestion. I tried to implement support for COPY TO STDOUT, and yes it's possible (and quite easy) to do. However, on my tests, I see no performance benefit. Obviously, your tables have different structure, could you please try if the version in https://github.com/exyi/pg2parquet/pull/34 performs better? (build instructions are cargo build --release in the cli directory; or download binaries from Github Actions artifacts)

I have a 7GiB (compressed) table and it runs in under 4 minutes. Something must be going wrong based on your table schema, it is quite possible that some data types have very inefficient implementation in pg2parquet. Can you please also check if the majority of CPU time is spent in pg2parquet or the postgres server?


If you want, you could run pg2parquet under a profiler, on Linux it's:

perf record -F 997 --call-graph=dwarf ./target/release/pg2parquet export {your-settings}
perf script | inferno-collapse-perf | inferno-flamegraph > flamegraph.svg

Just stop if after a minute or so, perf generates ton of data. You can then send me the flamegraph for investigation (it can't contain any private info, only pg2parquet function names)

exyi avatar Aug 06 '25 23:08 exyi

Hi. I found that my initial report was wrong - it was not performance issues but bug in our software that calling pg2parquet and not able properly read stderr when it return lot of data there. So basically pg2parquet was blocked on stderr write.

But anyway COPY is good to have feature - it should reduce load on postgresql server. I will compare #34 with current implementation and post my results here.

dmitry-sinina avatar Aug 08 '25 12:08 dmitry-sinina

version with select:

[0:10:13.118]: 34_545_425 rows, 38_898 MiB raw, 5_391 MiB parquet, 78 groups | 56_344 rows/s, 63.44 MiB/s                   
[0:10:17.243]: 34_545_425 rows, 38_898 MiB raw, 5_391 MiB parquet, 78 groups | 55_967 rows/s, 63.02 MiB/s                   
[0:10:13.058]: 34_545_425 rows, 38_898 MiB raw, 5_391 MiB parquet, 78 groups | 56_349 rows/s, 63.45 MiB/s                   
[0:10:18.673]: 34_545_425 rows, 38_898 MiB raw, 5_391 MiB parquet, 78 groups | 55_838 rows/s, 62.87 MiB/s                   
[0:10:19.152]: 34_545_425 rows, 38_898 MiB raw, 5_391 MiB parquet, 78 groups | 55_795 rows/s, 62.83 MiB/s                   
[0:10:12.636]: 34_545_425 rows, 38_898 MiB raw, 5_391 MiB parquet, 78 groups | 56_388 rows/s, 63.49 MiB/s                   
[0:10:16.206]: 34_545_425 rows, 38_898 MiB raw, 5_391 MiB parquet, 78 groups | 56_061 rows/s, 63.13 MiB/s                   
[0:10:18.566]: 34_545_425 rows, 38_898 MiB raw, 5_391 MiB parquet, 78 groups | 55_848 rows/s, 62.89 MiB/s                   
[0:10:11.305]: 34_545_425 rows, 38_898 MiB raw, 5_391 MiB parquet, 78 groups | 56_511 rows/s, 63.63 MiB/s                   
[0:10:16.284]: 34_545_425 rows, 38_898 MiB raw, 5_391 MiB parquet, 78 groups | 56_054 rows/s, 63.12 MiB/s 

vesion with copy:

[0:10:07.689]: 34_545_425 rows, 38_898 MiB raw, 5_391 MiB parquet, 78 groups | 56_847 rows/s, 64.01 MiB/s                   
[0:10:04.343]: 34_545_425 rows, 38_898 MiB raw, 5_391 MiB parquet, 78 groups | 57_162 rows/s, 64.36 MiB/s                   
[0:10:17.626]: 34_545_425 rows, 38_898 MiB raw, 5_391 MiB parquet, 78 groups | 55_933 rows/s, 62.98 MiB/s                   
[0:10:20.604]: 34_545_425 rows, 38_898 MiB raw, 5_391 MiB parquet, 78 groups | 55_664 rows/s, 62.68 MiB/s                   
[0:10:11.764]: 34_545_425 rows, 38_898 MiB raw, 5_391 MiB parquet, 78 groups | 56_468 rows/s, 63.58 MiB/s                   
[0:10:13.498]: 34_545_425 rows, 38_898 MiB raw, 5_391 MiB parquet, 78 groups | 56_309 rows/s, 63.40 MiB/s                   
[0:10:12.417]: 34_545_425 rows, 38_898 MiB raw, 5_391 MiB parquet, 78 groups | 56_408 rows/s, 63.52 MiB/s                   
[0:10:09.244]: 34_545_425 rows, 38_898 MiB raw, 5_391 MiB parquet, 78 groups | 56_702 rows/s, 63.85 MiB/s                   
[0:10:15.629]: 34_545_425 rows, 38_898 MiB raw, 5_391 MiB parquet, 78 groups | 56_114 rows/s, 63.19 MiB/s                   
[0:10:15.065]: 34_545_425 rows, 38_898 MiB raw, 5_391 MiB parquet, 78 groups | 56_165 rows/s, 63.24 MiB/s  

dmitry-sinina avatar Aug 14 '25 08:08 dmitry-sinina

Thanks for testing it out. In my experiments, it wasn't faster either. However, the bottleneck is usually pg2parquet, postgres can stream the data faster than pg2parquet can process it. I'm still wondering if COPY TO STDOUT lightens the postgres load, which might be quite relevant for loaded DB servers, so I'll leave this open for now, but I'm hesitant to merge support for this without evidence it's helpful.

exyi avatar Aug 20 '25 21:08 exyi

I'm still wondering if COPY TO STDOUT lightens the postgres load

I will try to run additional tests to compare CPU time of postgresql worker processes.

dmitry-sinina avatar Aug 24 '25 18:08 dmitry-sinina