Add sizes to Totals output table
It would be nice to add to "Finals" table below the information about:
- size of table data transferred over network,
- data transfer rate,
- size of indexes created,
- index creation rate GB\sec
- Total database size.
This is low-priority wish, just for convenience.
Step Connection Duration Concurrency
--------------------------------------------- ---------- ---------- ------------
Dump Schema source 6s045 1
Prepare Schema target 2s292 1
COPY, INDEX, CONSTRAINTS, VACUUM (wall clock) both 3h28m 4 + 4
COPY (cumulative) both 12h01m 4
Large Objects (cumulative) both 24ms 1
CREATE INDEX, CONSTRAINTS (cumulative) target 21h36m 4
Finalize Schema target 1h51m 1
--------------------------------------------- ---------- ---------- ------------
Total Wall Clock Duration both 5h19m 4 + 4
--------------------------------------------- ---------- ---------- ------------
That's a good comment, thanks. Most of the data mentioned is not available yet in pgcopydb though, and require extra setup and computations to fetch:
- size of data transferred requires counting bytes in our copy-in-copy-out buffered implementation,
- data transfer rate is not static throughout the runtime, so we would need to handle chunked statistics and a then display slowest, fastest, average, median, .90 and .99 percentiles I suppose...
- index size needs to be computed on the target database after creation, that's another way to slow down operations,
- index creation rate also is not going to be the same in all the runtime, and it might be interesting to group by index access method (btree, gin, gist, hash, brin, etc)
- total database size requires yet another call to a Postgres function that consumes some resources.
It's not trivial to guess the exact impact of computing all those operations. As timings are already captured, I could also add a command to get a summary of the average times to build indexes (and rates), and I suspect adding the buffer size capture and computation to our COPY code would not slow it down too much.
Contributions welcome, with an eye to do the easy bits (in terms of perf impact) first!
Hmm, Your proposal with counting bytes during copy-in-copy-out is precise, but looks too complicated. I meant much simpler way: like accumulate pg_relation_size() for each copied table, divide by time (COPY cumulative metric) and get average data transfer rate. One figure for the whole copying session would be enough here. Same logic - with indexes.
P.S.
The practical question behind this: If we've got for example 1 Gbit network, which % of bandwidth we've utilized?
Do we need to increase\decrease number of copy jobs to utilize bandwidth, but avoid contention?
But it does not work that way. The amount of bytes sent through the network is not the same as the pg_relation_size() on the source, because of bloat (tuples on-disk that are not visible to our transaction). And the size sent over the network also includes network protocol metadata, whereas the pg_relation_size() on the target includes storage metadata of the target Postgres version. So transfer rate can NOT be computed from the on-disk size of the relation, which is going to be different on source and target even with the same exact data set.
Then on-top of that, because of the concurrency aspects of pgcopydb, if your question is one of bandwidth capacity, it becomes very hard to answer that with numbers computed by individual processes. You'd be better off with OS-level tooling here, such as dstat for instance.
#503 contains a small portion of proposed changes here.