dsbulk icon indicating copy to clipboard operation
dsbulk copied to clipboard

checkpoint file is generated regardless of job successful or not

Open weideng1 opened this issue 2 years ago • 1 comments

Using the latest dsbulk-1.10.0, I observed the following:

ubuntu@ip-172-31-94-145:~$ ./dsbulk-1.10.0/bin/dsbulk load -url file:///mnt/data/toload/foo_small.csv -k testks -t dsbulk_test -b ~/secureBundle_prod.zip -u xxx -p xxx
Username and password provided but auth provider not specified, inferring PlainTextAuthProvider
A cloud secure connect bundle was provided: ignoring all explicit contact points.
A cloud secure connect bundle was provided and selected operation performs writes: changing default consistency level to LOCAL_QUORUM.
Operation directory: /home/ubuntu/logs/LOAD_20220814-221547-708451
Setting executor.maxPerSecond not set when connecting to DataStax Astra: applying a limit of 9,000 ops/second based on the number of coordinators (3).
If your Astra database has higher limits, please define executor.maxPerSecond explicitly.
Operation LOAD_20220814-221547-708451 failed: Java.io.IOException: Error creating CSV parser for file:/mnt/data/toload/foo_small.csv.
   Caused by: Error creating CSV parser for file:/mnt/data/toload/foo_small.csv.
     Caused by: File not found: /mnt/data/toload/foo_small.csv (No such file or directory).
total | failed | rows/s | p50ms | p99ms | p999ms | batches
    0 |      0 |      0 |  0.00 |  0.00 |   0.00 |    0.00
Checkpoints for the current operation were written to checkpoint.csv.
To resume the current operation, re-run it with the same settings, and add the following command line flag:
--dsbulk.log.checkpoint.file=/home/ubuntu/logs/LOAD_20220814-221547-708451/checkpoint.csv
ubuntu@ip-172-31-94-145:~$ ./dsbulk-1.10.0/bin/dsbulk load -url file:///mnt/data/foo_small.csv -k testks -t dsbulk_test -b ~/secureBundle_prod.zip -u xxx -p xxx
Username and password provided but auth provider not specified, inferring PlainTextAuthProvider
A cloud secure connect bundle was provided: ignoring all explicit contact points.
A cloud secure connect bundle was provided and selected operation performs writes: changing default consistency level to LOCAL_QUORUM.
Operation directory: /home/ubuntu/logs/LOAD_20220814-221616-336561
Setting executor.maxPerSecond not set when connecting to DataStax Astra: applying a limit of 9,000 ops/second based on the number of coordinators (3).
If your Astra database has higher limits, please define executor.maxPerSecond explicitly.
total | failed | rows/s | p50ms | p99ms | p999ms | batches
4,999 |      0 |  8,237 |  2.64 | 15.40 |  20.32 |    1.00
Operation LOAD_20220814-221616-336561 completed successfully in less than one second.
Checkpoints for the current operation were written to checkpoint.csv.
To resume the current operation, re-run it with the same settings, and add the following command line flag:
--dsbulk.log.checkpoint.file=/home/ubuntu/logs/LOAD_20220814-221616-336561/checkpoint.csv
ubuntu@ip-172-31-94-145:~$ ls -l /home/ubuntu/logs/LOAD_20220814-221616-336561/checkpoint.csv
-rw-rw-r-- 1 ubuntu ubuntu 44 Aug 14 22:16 /home/ubuntu/logs/LOAD_20220814-221616-336561/checkpoint.csv
ubuntu@ip-172-31-94-145:~$ cat /home/ubuntu/logs/LOAD_20220814-221616-336561/checkpoint.csv
file:/mnt/data/foo_small.csv;1;4999;1:4999;

Basically no matter if the job is successful or not, checkpoint.csv is always generated and printed out at the end. This can cause confusions to the end users.

┆Issue is synchronized with this Jira Task by Unito

weideng1 avatar Aug 15 '22 05:08 weideng1

@adutra I think this was where the check got accidentally removed/reverted: https://github.com/datastax/dsbulk/pull/432/commits/795d96e0fbb9dddef0814dd0d6e5de9752a8648b#diff-f1cbdde11c8882f0716f29733453909f71bc2515b77b0e26711aac96367e294a

weideng1 avatar Aug 15 '22 16:08 weideng1