timescaledb-parallel-copy icon indicating copy to clipboard operation
timescaledb-parallel-copy copied to clipboard

[Bug]: missing values on large inserts

Open hjfeldy opened this issue 2 years ago • 2 comments

What type of bug is this?

Data corruption

What subsystems and features are affected?

Data ingestion

What happened?

Upon inserting a large CSV of cryptocurrency data, certain rows are missing. timescaleIssue.zip

TimescaleDB version affected

2.6.1

PostgreSQL version used

12

What operating system did you use?

Ubuntu 20.04 LTS x86_64

What installation method did you use?

Deb/Apt

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

psql:testDB.sql:1: NOTICE:  extension "timescaledb" already exists, skipping
     create_hypertable
 -------------------------
  (8,public,test_table,t)
 (1 row)

                add_dimension
 --------------------------------------------
  (30,public,test_table,collection_window,t)
 (1 row)

             add_dimension
 --------------------------------------
  (31,public,test_table,roll_window,t)
 (1 row)

          add_dimension
 -------------------------------
  (32,public,test_table,pair,t)
 (1 row)

 Performing vanilla postgres COPY command...
 COPY 3748508
 Results:
  count
 -------
      1
 (1 row)

 Resetting DB
 psql:testDB.sql:1: NOTICE:  extension "timescaledb" already exists, skipping
     create_hypertable
 -------------------------
  (9,public,test_table,t)
 (1 row)

                add_dimension
 --------------------------------------------
  (34,public,test_table,collection_window,t)
 (1 row)

             add_dimension
 --------------------------------------
  (35,public,test_table,roll_window,t)
 (1 row)

          add_dimension
 -------------------------------
  (36,public,test_table,pair,t)
 (1 row)

 Performing timescaledb-parallel-copy command...
  ttaCOPY 3748509
 Results:
  count
 -------
      0
 (1 row)

How can we reproduce the bug?

Verify by unzipping the attached archive and running the "run" shell script. Out of caution, I included an obfuscated version of the crypto data in question. It has the same datatypes and the same dimensions as the original, unobfuscated csv. The script performs a vanilla postgres \COPY command on the csv and selects the row that I found to be missing when I discovered the bug (proving that COPY does, in fact, include the row). Then, the script clears all data out of the DB and repeats the copy, this time using timescaledb-parallel-copy. It then performs a selection to show the row in question is no longer there, despite supposedly being copied from an identical csv

hjfeldy avatar May 03 '22 19:05 hjfeldy

@hjfeldy Thanks for the bug report. Since this seems to be an issue with timescaledb-parallel-copy I move it over there.

mkindahl avatar May 04 '22 08:05 mkindahl

@hjfeldy Thanks for the report. It looks like the utility doesn't play well with the HEADER option you've specified -- Postgres will ignore the first line from every chunk, which is where your missing rows are going.

I suggest using -skip-header instead. If I switch your script from using

timescaledb-parallel-copy -file obf.csv -workers 1 -db-name $DB_NAME -connection "${CON_STRING}" -table test_table -copy-options "CSV HEADER"

to

timescaledb-parallel-copy -file obf.csv -workers 1 -db-name $DB_NAME -connection "${CON_STRING}" -table test_table -skip-header

then your test passes, and I can see the expected number of rows in the table. Does that work for you?

jchampio avatar Jun 09 '22 18:06 jchampio