timescaledb-parallel-copy
timescaledb-parallel-copy copied to clipboard
add some details to give the reader expectations
I presume there is a really good reason this is better than using COPY .... FROM stdin CSV;
- but its not instantly obvious from the readme
ie - is it 20 times faster?
it is equivalent to running 20 pqsl COPY FROMs after splitting a single csv.
A COPY is transactional and single-threaded, so the "parallel" tool allows us to parallelize over many workers. This could be emulated by doing yourself, but note that your parallel tool should be trying to send data at similar time regions at the same time.
That is, you don't want to split a single CSV covering 1 year into 12 months, then in parallel try to insert each month -- they'll thrash each other to disk. Rather, this tool effectively will almost "round robin" the original CSV, so that parallel inserts look more like they are all in loose time order and memory management is much more effective to achieve better ingest at larger scale.
gosh, that too, is a gem that would be good to add to the top of the readme