citus_docs
citus_docs copied to clipboard
Evaluate updating Bulk Loading section's Note
We have the following note as part of the bulk loading section:
"There is no notion of snapshot isolation across shards, which means that a multi-shard SELECT that runs concurrently with a COPY might see it committed on some shards, but not on others. If the user is storing events data, he may occasionally observe small gaps in recent data. It is up to applications to deal with this if it is a problem (e.g. exclude the most recent data from queries, or use some lock).
If COPY fails to open a connection for a shard placement then it behaves in the same way as INSERT, namely to mark the placement(s) as inactive unless there are no more active placements. If any other failure occurs after connecting, the transaction is rolled back and thus no metadata changes are made."
I had three questions on this section:
- When the user runs
COPY, Citus currently uses transactions to commit or rollback batch loading of data. I ran multipleCOPYoperations and concurrentSELECT count(*) FROM github_events;and I saw transactional behavior here. Are we worried about the window where parallel commits across machines take time to complete? -- Isn't that a small window? - The first and second paragraphs in this note seem unrelated. Do we have two notes?
- Do we want to document
\COPYorCOPY? PostgreSQL's documentation generally talks aboutCOPY. That said,\COPYis more convenient to use.
(I'm scanning these issues to see which are still relevant, and can confirm that this note still exists in https://docs.citusdata.com/en/v7.3/dist_tables/dml.html#bulk-loading )
@onderkalaci do you know whether the warnings in https://docs.citusdata.com/en/v8.3/develop/reference_dml.html#copy-command-bulk-load are still accurate?