COPY duckdb_table FROM STDIN
Right now if you try to copy into a duckdb table (MotherDuck or temporary) you'll get an error. An example of this is the following psql commands (\copy is a psql meta-command that transforms into COPY ... FROM STDIN)
CREATE TEMP TABLE t(a int) USING duckdb;
\copy t FROM 'my_local_data.csv' (FORMAT 'csv', quote '"', header 1, DELIMITER '|');
You then get this error:
ERROR: 0A000: duckdb does not implement duckdb_slot_callbacks
A workaround for this issue is to first do the COPY into a postgres table and then do CREATE TABLE AS.
Related to #30 but different. #30 is about writing the results from a duckdb-only query into a Postgres heap table. While this issue is about writing from Postgres input to a duckdb table.
hi @JelteF , I am working on copy from stdin for pg_mooncake. Could you share any tech preference or direction for supporting copy_from_stdin in pg_duckdb? particularly if we're not planning a full TAM impl? Thanks!
I think the only way to make COPY FROM STDIN work is by implement more of the TAM interface. That we've not done that is more because it isn't needed for the rest of the functionality so far, not that we don't want it.
Thanks for the insight! I am considering redirecting COPY FROM STDIN to local temp files on the server, and then triggering a DuckDB query (COPY FROM 'file://local') to load data from those files. And async loading & cleanup mechanisms might also be necessary to prevent overloading the server's local storage.
For the TAM approach, we might need to implement TupleSlot cache to reduce the overhead of pg_row->duckdb_chunk conversion. And we will lose parallelism on the DuckDB side, right?
With the TAM approach we'd indeed be constrained by Postgres its native COPY limitations. So indeed single threaded (unless users actively open more connections) and not supporting duckdb only formats like JSON and PARQUET.
I think using a temp file has the big downside that it loses the "streaming" benefit that COPY FROM STDIN normally has. So the Postgres machine would require two times the storage to load the data. That seems like a big enough downside that I think we shouldn't go down that road.
Another approach could be to stream the bytes directly to duckdb its copy implementation somehow. But I'm not sure if that's possible currently. Maybe that needs some DuckDB changes. But that approach would allow loading parquet files.
ah, supporting JSON/PARQUET format is a great idea!
For the temp file approach, we could implement DuckDB loading in mini-batches and asynchronously. Specifically, the main thread keeps streaming data to files and switch to new file when the current one reaches a size (e.g., 100MB). Meanwhile, async DuckDB threads could load these files and clean them up after processing. The storage should be manageable in this way. However, this might only work for CSV, as we know how to split the stream into batches (files).
Streaming CSV directly to DuckDB is definitely a better approach. It saves not only storage but also CPU and IO overhead.
ps. I am not sure if it is feasible to stream PARQUET format as the reader typically requires seek-read to do parse(decoding). Worse, the file footer (i.e., the end of the stream) is parsed first to get file metadata.