LibPQ.jl icon indicating copy to clipboard operation
LibPQ.jl copied to clipboard

Add CopyOut! interface for >5x faster data queries

Open cmey opened this issue 3 years ago • 1 comments

Hi! I tend to do SELECT queries that return a large dataset for which the standard method takes a few seconds:

using LibPQ, BenchmarkTools, DataFrames

conn = LibPQ.Connection("dbname=postgres user=postgres")

result = execute(conn, """
    CREATE TEMPORARY TABLE libpqjl_test (
        no_nulls    varchar(10),
        yes_nulls   varchar(10)
    );
""")

result = execute(conn, """
    insert into libpqjl_test (no_nulls, yes_nulls)
select
    left(md5(i::text), 10),
    left(md5(random()::text), 10)
from generate_series(1, 1000000) s(i);
""")

@benchmark execute(conn, "SELECT * FROM libpqjl_test;") |> DataFrame

Time  (median):     2.977 s               ┊ GC (median):    13.03%

PostgreSQL supports the COPY TO API and I found it to be faster, so I implemented it! Use ] add https://github.com/cmey/LibPQ.jl#copy-out to try it out! (we already support the COPY FROM API for fast ingestion which I took heavy inspiration from)

using CSV

@benchmark begin
    databuf = IOBuffer()
    copyout = LibPQ.CopyOut!(databuf, "COPY (SELECT * FROM libpqjl_test) TO STDOUT (FORMAT CSV, HEADER);")
    result = execute(conn, copyout)
    databuf |> CSV.File |> DataFrame
end

Time  (median):     590.818 ms               ┊ GC (median):    9.10%

Both methods return equal outputs:

isequal(execute(conn, "SELECT * FROM libpqjl_test;") |> DataFrame,
    begin databuf = IOBuffer()
        copyout = LibPQ.CopyOut!(databuf, "COPY (SELECT * FROM libpqjl_test) TO STDOUT (FORMAT CSV, HEADER);")
        result = execute(conn, copyout)
        databuf |> CSV.File |> DataFrame
    end)

true

The new API expects the user to pass a "string builder" type of construct in the form of a IOBuffer that, during execution, gets filled with the text of the CSV returned by the database. I found this didn't fit well with the table-style of Result, since this is just a text blob.

Note that CSV API changed from CSV version 0.8 to 0.9, and that Julia <1.3.0-1 loads the old CSV, that's why there is the VERSION check in the test code, to support the test run on Julia 1.0.

I hope others will find this helpful! I certainly found this helped me, as I see >10x speed improvement in real-life use cases.

cmey avatar Dec 29 '21 16:12 cmey

@iamed2 Hi! I'm not sure how to fix the failing formatting CI step, it flags about all source files in the repo, including many I haven't touched :open_mouth:

cmey avatar Dec 30 '21 09:12 cmey