timescaledb icon indicating copy to clipboard operation
timescaledb copied to clipboard

testing some stuff around distributed inserts

Open akuzm opened this issue 2 years ago • 1 comments

Changes

  1. Use nonblocking mode on COPY connections.
  2. Group the rows into batches on access node before sending them to data nodes.

Benchmarks

I'm using the UK Price Paid data set because that's what I had at hand: https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads

I'm benchmarking inserts of 10k, 100k and 1M rows, ordered either randomly or by time & postcode2 which is used for space partitioning. The '-1' and '-2' versions of files are just different random subsets of data, to test inserting into new chunks or existing chunks. I'm tesing distributed hypertable ordered by time, either with no space partitioning or partitioned by postcode2:

# \d uk_price_paid
             Table "public.uk_price_paid"
  Column   │   Type   │ Collation │ Nullable │ Default 
───────────┼──────────┼───────────┼──────────┼─────────
 price     │ integer  │           │          │ 
 date      │ date     │           │ not null │ 
 postcode1 │ text     │           │          │ 
 postcode2 │ text     │           │          │ 
 type      │ smallint │           │          │ 
 is_new    │ boolean  │           │          │ 
 duration  │ smallint │           │          │ 
 addr1     │ text     │           │          │ 
 addr2     │ text     │           │          │ 
 street    │ text     │           │          │ 
 locality  │ text     │           │          │ 
 town      │ text     │           │          │ 
 district  │ text     │           │          │ 
 country   │ text     │           │          │ 
 category  │ smallint │           │          │ 
Indexes:
    "uk_price_paid_date_idx" btree (date DESC)
    "uk_price_paid_postcode2_date_idx" btree (postcode2, date DESC)
Triggers:
    ts_insert_blocker BEFORE INSERT ON uk_price_paid FOR EACH ROW EXECUTE FUNCTION _timescaledb_internal.insert_blocker()

# select create_hypertable('uk_price_paid_l', 'date');
# select create_distributed_hypertable('uk_price_paid_ns', 'date');
# select create_distributed_hypertable('uk_price_paid', 'date', 'postcode2');

I'm using the following, pretty generic, queries:

==> insert-binary.sql <==
begin;
\copy rides1 from '/home/akuzm/nyc_data_rides.pgbinary' with (format binary);
rollback;

==> insert-prices-100k-ordered.sql <==
begin;
copy uk_price_paid from '/home/akuzm/uk-price-paid/prices-100k-ordered-1.tsv';
rollback;

==> insert-prices-100k-random.sql <==
begin;
copy uk_price_paid from '/home/akuzm/uk-price-paid/prices-100k-random-1.tsv';
rollback;

==> insert-prices-10k-ordered-2.sql <==
begin;
copy uk_price_paid from '/home/akuzm/uk-price-paid/prices-10k-ordered-2.tsv';
rollback;

==> insert-prices-10k-ordered.sql <==
begin;
copy uk_price_paid from '/home/akuzm/uk-price-paid/prices-10k-ordered-1.tsv';
rollback;

==> insert-prices-10k-random.sql <==
begin;
copy uk_price_paid from '/home/akuzm/uk-price-paid/prices-10k-random-1.tsv';
rollback;

==> insert-prices-1M-ordered.sql <==
begin;
copy uk_price_paid from '/home/akuzm/uk-price-paid/prices-1M-ordered-1.tsv';
rollback;

==> insert-prices-l-10k-ordered-2.sql <==
begin;
copy uk_price_paid_l from '/home/akuzm/uk-price-paid/prices-10k-ordered-2.tsv';
rollback;

==> insert-prices-l-10k-ordered.sql <==
begin;
copy uk_price_paid_l from '/home/akuzm/uk-price-paid/prices-10k-ordered-1.tsv';
rollback;

==> insert-prices-l-10k-random.sql <==
begin;
copy uk_price_paid_l from '/home/akuzm/uk-price-paid/prices-10k-random-1.tsv';
rollback;

==> insert-prices-ns-10k-ordered-2.sql <==
begin;
copy uk_price_paid_ns from '/home/akuzm/uk-price-paid/prices-10k-ordered-2.tsv';
rollback;

==> insert-prices-ns-10k-ordered.sql <==
begin;
copy uk_price_paid_ns from '/home/akuzm/uk-price-paid/prices-10k-ordered-1.tsv';
rollback;

==> insert-prices-ns-10k-random.sql <==
begin;
copy uk_price_paid_ns from '/home/akuzm/uk-price-paid/prices-10k-random-1.tsv';
rollback;

==> insert.sql <==
begin;
\copy rides1 from '/home/akuzm/nyc_data_rides.tsv';
rollback;

Parts of this PR:

  • https://github.com/timescale/timescaledb/pull/4205
  • https://github.com/timescale/timescaledb/pull/4281
  • https://github.com/timescale/timescaledb/pull/4390
  • https://github.com/timescale/timescaledb/pull/4476
  • https://github.com/timescale/timescaledb/pull/4478
  • https://github.com/timescale/timescaledb/pull/4483
  • https://github.com/timescale/timescaledb/pull/4484
  • https://github.com/timescale/timescaledb/pull/4488

akuzm avatar Apr 29 '22 08:04 akuzm

Codecov Report

Merging #4285 (435c31c) into main (8bf6c88) will increase coverage by 0.01%. The diff coverage is 85.33%.

:exclamation: Current head 435c31c differs from pull request most recent head b1e2687. Consider uploading reports for the commit b1e2687 to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4285      +/-   ##
==========================================
+ Coverage   90.79%   90.81%   +0.01%     
==========================================
  Files         221      221              
  Lines       41006    40937      -69     
==========================================
- Hits        37233    37178      -55     
+ Misses       3773     3759      -14     
Impacted Files Coverage Δ
tsl/src/remote/connection.c 88.37% <54.16%> (-0.95%) :arrow_down:
tsl/src/remote/dist_copy.c 90.99% <88.94%> (-2.39%) :arrow_down:
src/copy.c 94.49% <100.00%> (+0.02%) :arrow_up:
src/bgw/scheduler.c 82.89% <0.00%> (-3.54%) :arrow_down:
tsl/src/chunk_copy.c 92.90% <0.00%> (-0.51%) :arrow_down:
src/bgw/job.c 93.04% <0.00%> (-0.39%) :arrow_down:
src/cross_module_fn.c 69.28% <0.00%> (-0.20%) :arrow_down:
src/chunk.c 94.30% <0.00%> (-0.14%) :arrow_down:
src/process_utility.c 94.53% <0.00%> (-0.06%) :arrow_down:
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 8bf6c88...b1e2687. Read the comment docs.

codecov[bot] avatar May 03 '22 09:05 codecov[bot]

Everything integrated in other PRs.

akuzm avatar Oct 06 '22 13:10 akuzm