timescaledb-parallel-copy [Bug]: missing values on large inserts

What type of bug is this?

Data corruption

What subsystems and features are affected?

Data ingestion

What happened?

Upon inserting a large CSV of cryptocurrency data, certain rows are missing. timescaleIssue.zip

TimescaleDB version affected

2.6.1

PostgreSQL version used

12

What operating system did you use?

Ubuntu 20.04 LTS x86_64

What installation method did you use?

Deb/Apt

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

psql:testDB.sql:1: NOTICE:  extension "timescaledb" already exists, skipping
     create_hypertable
 -------------------------
  (8,public,test_table,t)
 (1 row)

                add_dimension
 --------------------------------------------
  (30,public,test_table,collection_window,t)
 (1 row)

             add_dimension
 --------------------------------------
  (31,public,test_table,roll_window,t)
 (1 row)

          add_dimension
 -------------------------------
  (32,public,test_table,pair,t)
 (1 row)

 Performing vanilla postgres COPY command...
 COPY 3748508
 Results:
  count
 -------
      1
 (1 row)

 Resetting DB
 psql:testDB.sql:1: NOTICE:  extension "timescaledb" already exists, skipping
     create_hypertable
 -------------------------
  (9,public,test_table,t)
 (1 row)

                add_dimension
 --------------------------------------------
  (34,public,test_table,collection_window,t)
 (1 row)

             add_dimension
 --------------------------------------
  (35,public,test_table,roll_window,t)
 (1 row)

          add_dimension
 -------------------------------
  (36,public,test_table,pair,t)
 (1 row)

 Performing timescaledb-parallel-copy command...
  ttaCOPY 3748509
 Results:
  count
 -------
      0
 (1 row)

How can we reproduce the bug?

Verify by unzipping the attached archive and running the "run" shell script. Out of caution, I included an obfuscated version of the crypto data in question. It has the same datatypes and the same dimensions as the original, unobfuscated csv. The script performs a vanilla postgres \COPY command on the csv and selects the row that I found to be missing when I discovered the bug (proving that COPY does, in fact, include the row). Then, the script clears all data out of the DB and repeats the copy, this time using timescaledb-parallel-copy. It then performs a selection to show the row in question is no longer there, despite supposedly being copied from an identical csv

May 03 '22 19:05 hjfeldy

@hjfeldy Thanks for the bug report. Since this seems to be an issue with timescaledb-parallel-copy I move it over there.

May 04 '22 08:05 mkindahl

@hjfeldy Thanks for the report. It looks like the utility doesn't play well with the HEADER option you've specified -- Postgres will ignore the first line from every chunk, which is where your missing rows are going.

I suggest using -skip-header instead. If I switch your script from using

timescaledb-parallel-copy -file obf.csv -workers 1 -db-name $DB_NAME -connection "${CON_STRING}" -table test_table -copy-options "CSV HEADER"

to

timescaledb-parallel-copy -file obf.csv -workers 1 -db-name $DB_NAME -connection "${CON_STRING}" -table test_table -skip-header

then your test passes, and I can see the expected number of rows in the table. Does that work for you?

Jun 09 '22 18:06 jchampio

timescaledb-parallel-copy timescaledb-parallel-copy copied to clipboard

[Bug]: missing values on large inserts

What type of bug is this?

What subsystems and features are affected?

What happened?

TimescaleDB version affected

PostgreSQL version used

What operating system did you use?

What installation method did you use?

What platform did you run on?

Relevant log output and stack trace

How can we reproduce the bug?

timescaledb-parallel-copy
timescaledb-parallel-copy copied to clipboard