timescaledb-parallel-copy
timescaledb-parallel-copy copied to clipboard
[Bug]: missing values on large inserts
What type of bug is this?
Data corruption
What subsystems and features are affected?
Data ingestion
What happened?
Upon inserting a large CSV of cryptocurrency data, certain rows are missing. timescaleIssue.zip
TimescaleDB version affected
2.6.1
PostgreSQL version used
12
What operating system did you use?
Ubuntu 20.04 LTS x86_64
What installation method did you use?
Deb/Apt
What platform did you run on?
On prem/Self-hosted
Relevant log output and stack trace
psql:testDB.sql:1: NOTICE: extension "timescaledb" already exists, skipping
create_hypertable
-------------------------
(8,public,test_table,t)
(1 row)
add_dimension
--------------------------------------------
(30,public,test_table,collection_window,t)
(1 row)
add_dimension
--------------------------------------
(31,public,test_table,roll_window,t)
(1 row)
add_dimension
-------------------------------
(32,public,test_table,pair,t)
(1 row)
Performing vanilla postgres COPY command...
COPY 3748508
Results:
count
-------
1
(1 row)
Resetting DB
psql:testDB.sql:1: NOTICE: extension "timescaledb" already exists, skipping
create_hypertable
-------------------------
(9,public,test_table,t)
(1 row)
add_dimension
--------------------------------------------
(34,public,test_table,collection_window,t)
(1 row)
add_dimension
--------------------------------------
(35,public,test_table,roll_window,t)
(1 row)
add_dimension
-------------------------------
(36,public,test_table,pair,t)
(1 row)
Performing timescaledb-parallel-copy command...
ttaCOPY 3748509
Results:
count
-------
0
(1 row)
How can we reproduce the bug?
Verify by unzipping the attached archive and running the "run" shell script. Out of caution, I included an obfuscated version of the crypto data in question. It has the same datatypes and the same dimensions as the original, unobfuscated csv. The script performs a vanilla postgres \COPY command on the csv and selects the row that I found to be missing when I discovered the bug (proving that COPY does, in fact, include the row). Then, the script clears all data out of the DB and repeats the copy, this time using timescaledb-parallel-copy. It then performs a selection to show the row in question is no longer there, despite supposedly being copied from an identical csv
@hjfeldy Thanks for the bug report. Since this seems to be an issue with timescaledb-parallel-copy
I move it over there.
@hjfeldy Thanks for the report. It looks like the utility doesn't play well with the HEADER
option you've specified -- Postgres will ignore the first line from every chunk, which is where your missing rows are going.
I suggest using -skip-header
instead. If I switch your script from using
timescaledb-parallel-copy -file obf.csv -workers 1 -db-name $DB_NAME -connection "${CON_STRING}" -table test_table -copy-options "CSV HEADER"
to
timescaledb-parallel-copy -file obf.csv -workers 1 -db-name $DB_NAME -connection "${CON_STRING}" -table test_table -skip-header
then your test passes, and I can see the expected number of rows in the table. Does that work for you?