timescaledb-parallel-copy icon indicating copy to clipboard operation
timescaledb-parallel-copy copied to clipboard

add some details to give the reader expectations

Open SvenDowideit opened this issue 4 years ago • 2 comments

I presume there is a really good reason this is better than using COPY .... FROM stdin CSV; - but its not instantly obvious from the readme

ie - is it 20 times faster?

it is equivalent to running 20 pqsl COPY FROMs after splitting a single csv.

SvenDowideit avatar Dec 02 '20 01:12 SvenDowideit

A COPY is transactional and single-threaded, so the "parallel" tool allows us to parallelize over many workers. This could be emulated by doing yourself, but note that your parallel tool should be trying to send data at similar time regions at the same time.

That is, you don't want to split a single CSV covering 1 year into 12 months, then in parallel try to insert each month -- they'll thrash each other to disk. Rather, this tool effectively will almost "round robin" the original CSV, so that parallel inserts look more like they are all in loose time order and memory management is much more effective to achieve better ingest at larger scale.

mfreed avatar Dec 02 '20 01:12 mfreed

gosh, that too, is a gem that would be good to add to the top of the readme

SvenDowideit avatar Dec 02 '20 01:12 SvenDowideit