dsbulk
dsbulk copied to clipboard
checkpoint file is generated regardless of job successful or not
Using the latest dsbulk-1.10.0, I observed the following:
ubuntu@ip-172-31-94-145:~$ ./dsbulk-1.10.0/bin/dsbulk load -url file:///mnt/data/toload/foo_small.csv -k testks -t dsbulk_test -b ~/secureBundle_prod.zip -u xxx -p xxx
Username and password provided but auth provider not specified, inferring PlainTextAuthProvider
A cloud secure connect bundle was provided: ignoring all explicit contact points.
A cloud secure connect bundle was provided and selected operation performs writes: changing default consistency level to LOCAL_QUORUM.
Operation directory: /home/ubuntu/logs/LOAD_20220814-221547-708451
Setting executor.maxPerSecond not set when connecting to DataStax Astra: applying a limit of 9,000 ops/second based on the number of coordinators (3).
If your Astra database has higher limits, please define executor.maxPerSecond explicitly.
Operation LOAD_20220814-221547-708451 failed: Java.io.IOException: Error creating CSV parser for file:/mnt/data/toload/foo_small.csv.
Caused by: Error creating CSV parser for file:/mnt/data/toload/foo_small.csv.
Caused by: File not found: /mnt/data/toload/foo_small.csv (No such file or directory).
total | failed | rows/s | p50ms | p99ms | p999ms | batches
0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00
Checkpoints for the current operation were written to checkpoint.csv.
To resume the current operation, re-run it with the same settings, and add the following command line flag:
--dsbulk.log.checkpoint.file=/home/ubuntu/logs/LOAD_20220814-221547-708451/checkpoint.csv
ubuntu@ip-172-31-94-145:~$ ./dsbulk-1.10.0/bin/dsbulk load -url file:///mnt/data/foo_small.csv -k testks -t dsbulk_test -b ~/secureBundle_prod.zip -u xxx -p xxx
Username and password provided but auth provider not specified, inferring PlainTextAuthProvider
A cloud secure connect bundle was provided: ignoring all explicit contact points.
A cloud secure connect bundle was provided and selected operation performs writes: changing default consistency level to LOCAL_QUORUM.
Operation directory: /home/ubuntu/logs/LOAD_20220814-221616-336561
Setting executor.maxPerSecond not set when connecting to DataStax Astra: applying a limit of 9,000 ops/second based on the number of coordinators (3).
If your Astra database has higher limits, please define executor.maxPerSecond explicitly.
total | failed | rows/s | p50ms | p99ms | p999ms | batches
4,999 | 0 | 8,237 | 2.64 | 15.40 | 20.32 | 1.00
Operation LOAD_20220814-221616-336561 completed successfully in less than one second.
Checkpoints for the current operation were written to checkpoint.csv.
To resume the current operation, re-run it with the same settings, and add the following command line flag:
--dsbulk.log.checkpoint.file=/home/ubuntu/logs/LOAD_20220814-221616-336561/checkpoint.csv
ubuntu@ip-172-31-94-145:~$ ls -l /home/ubuntu/logs/LOAD_20220814-221616-336561/checkpoint.csv
-rw-rw-r-- 1 ubuntu ubuntu 44 Aug 14 22:16 /home/ubuntu/logs/LOAD_20220814-221616-336561/checkpoint.csv
ubuntu@ip-172-31-94-145:~$ cat /home/ubuntu/logs/LOAD_20220814-221616-336561/checkpoint.csv
file:/mnt/data/foo_small.csv;1;4999;1:4999;
Basically no matter if the job is successful or not, checkpoint.csv is always generated and printed out at the end. This can cause confusions to the end users.
@adutra I think this was where the check got accidentally removed/reverted: https://github.com/datastax/dsbulk/pull/432/commits/795d96e0fbb9dddef0814dd0d6e5de9752a8648b#diff-f1cbdde11c8882f0716f29733453909f71bc2515b77b0e26711aac96367e294a