ArchiveBot
ArchiveBot copied to clipboard
Keep and upload database of finished jobs
When a job crashes or is aborted, its DB and therefore its queue is simply deleted while the other things (WARC so far, JSON, and since #396 the log file) are kept. I think we should also retain the database. It may even be worth considering keeping it for all jobs.
Besides preserving the remaining queue for crashed and aborted jobs, it also allows for easier access to the crawl information. For example, it's much easier to extract all URLs that failed three times or that resulted in a particular status code from the DB than painful processing of the log file. It could also allow for running 'update crawls' (outside of ArchiveBot) at a later time by reusing the DB of a job to skip (some) URLs that were already retrieved without having to construct such a DB from the log file.
The obvious downside is the data/storage size. However, in the grand scheme of things, this doesn't make a big difference. As a point of reference, job 6recrrotn072khaaje73k60kh – one of the largest jobs currently running at 65 million URLs – has a DB file of 15.8 GiB. This is pretty much insignificant compared to the job's data size of 4.8 TiB, especially as compression decreases the size further by a factor 4-5 (zstd without tuning: 3.57 GiB or 22.6 %). So this is an increase in data per job on the order of 1 ‰ (except in the rare extreme cases where the vast majority of URLs is ignored).
This is a great idea, I think size is not a problem here.
I'm not sure what is exactly stored in the DB, any sensitive information? If not, we should definitely preserve the DB (also on finished jobs).
Nothing sensitive at all. It contains only information on the crawl itself that could in theory be regenerated using the source code and the WARCs (but you might turn suicidal trying to do that): URLs, their relations (parent, root), recursion info (level, inline level), crawl info (status, try count, priority [currently unused]), and some info on the content ('link type', status code). POST data and local filenames would also end up there but are not used by AB. Sometime in the future, cookies will also be there, but again, nothing that couldn't be reconstructed from the WARCs anyway (and the IRC commands if we add manual cookie control).
It's pretty difficult (and would have to process a ton of data) to reconstruct this. Size is not a problem (relative to total WARC size). Since there's nothing sensitive in the DB, let's do it.
We could gzip it and upload together with the JSON and WARCs.
Yep. It's possible in theory but completely unfeasible in practice.
I'll play around with gzip vs zstd a bit. It'll be a .db.gz or .db.zst file with the same filename structure as everything else.
I ran a few tests on large-ish databases on a busy pipeline in a terminal:
| Job | Original size | gzip -6 size |
... time | gzip -9 size |
... time | zstd size |
... time |
|---|---|---|---|---|---|---|---|
| 1m71j820n4qka3ob7w6dlja3y | 23.7 GiB | 3.93 GiB | 17 mn | 3.58 GiB | 2.5 mn | ||
| 9hdfwijhzx86os1k3tm1wgq3i | 1034 MiB | 241 MiB | 33 s | 239 MiB | 50 s | 221 MiB | 4.3 s |
| 2g2xqrj2na5od7mk5mql0q3bn | 326 MiB | 67.3 MiB | 10 s | 66.7 MiB | 23 s | 64.7 MiB | 2.3 s |
| 73pjjo1i8uyububkhbpaf6ndr | 5.00 GiB | 0.915 GiB | 2 mn 20 s | 0.890 GiB | 30 s |
The implications are pretty obvious.
I'll probably switch the log compression (on crashed/aborted jobs) to zstd as well. Although zstd actually produces a larger file than even gzip -6 with the default settings in a test, it only takes a slight increase of the compression level to fix that. zstd -10 takes about the same time as gzip -9 on my partial test log from job 9hdfwijhzx86os1k3tm1wgq3i (1010 MiB) at 33 s but produces a file of 85 MiB compared to gzip's 99 MiB. I'll do some more testing to find the sweet spot there.
I looked into this a bit again. I took the DB from 5nbpflkse0rs1tlgch8n4efud (2.94 GB, 13 million URLs, runtime before crashing about a week) and the partial log file from 3pwf0useacbmua9uwp4idpale (3.64 GB, 12 million URLs, runtime about a month so far) and compressed them at most levels of zstd and gzip. I ran this on a fairly busy AB pipeline (jap-kakapo), so it should be representative of what the runtime might look like in reality. The jobs are obviously among the larger ones running through AB. My analysis consisted of staring at shitty graphs of user time vs compression ratio in LibreOffice Calc.
Test results
Database
zstd
| Compression level | Original size | Compressed size | Compression ratio | Real time | User time | Sys time |
|---|---|---|---|---|---|---|
| 1 | 2944745472 | 721835831 | 24.51% | 13.616 | 11.565 | 1.751 |
| 2 | 2944745472 | 700663107 | 23.79% | 13.192 | 13.300 | 1.097 |
| 3 | 2944745472 | 682199494 | 23.17% | 16.743 | 16.700 | 1.231 |
| 4 | 2944745472 | 677610833 | 23.01% | 21.281 | 21.426 | 1.187 |
| 5 | 2944745472 | 661601839 | 22.47% | 50.899 | 50.691 | 1.414 |
| 6 | 2944745472 | 657653273 | 22.33% | 56.692 | 56.470 | 1.247 |
| 7 | 2944745472 | 630368182 | 21.41% | 68.079 | 67.727 | 1.542 |
| 8 | 2944745472 | 625318048 | 21.24% | 79.158 | 79.157 | 1.252 |
| 9 | 2944745472 | 622723235 | 21.15% | 93.913 | 93.947 | 1.114 |
| 10 | 2944745472 | 613131472 | 20.82% | 117.855 | 117.652 | 1.381 |
| 11 | 2944745472 | 610937389 | 20.75% | 131.157 | 130.767 | 1.344 |
| 12 | 2944745472 | 609634199 | 20.70% | 176.475 | 176.017 | 1.516 |
| 13 | 2944745472 | 609777705 | 20.71% | 201.050 | 196.477 | 2.412 |
| 14 | 2944745472 | 607311093 | 20.62% | 218.251 | 215.267 | 2.175 |
| 15 | 2944745472 | 605756166 | 20.57% | 265.313 | 262.540 | 2.719 |
| 16 | 2944745472 | 588934765 | 20.00% | 572.187 | 561.204 | 3.635 |
| 17 | 2944745472 | 562606051 | 19.11% | 697.623 | 690.251 | 4.968 |
| 18 | 2944745472 | 538896215 | 18.30% | 1085.334 | 1077.788 | 5.231 |
| 19 | 2944745472 | 530637003 | 18.02% | 1519.945 | 1512.603 | 4.898 |
gzip
| Compression level | Original size | Compressed size | Compression ratio | Real time | User time | Sys time |
|---|---|---|---|---|---|---|
| 1 | 2944745472 | 806176600 | 27.38% | 47.730 | 42.625 | 1.891 |
| 2 | 2944745472 | 800534883 | 27.19% | 47.969 | 44.378 | 1.551 |
| 3 | 2944745472 | 770088717 | 26.15% | 56.418 | 54.249 | 1.833 |
| 4 | 2944745472 | 736143418 | 25.00% | 65.347 | 62.334 | 1.711 |
| 5 | 2944745472 | 723571018 | 24.57% | 71.107 | 68.941 | 1.759 |
| 6 | 2944745472 | 717027291 | 24.35% | 89.407 | 87.594 | 1.560 |
| 7 | 2944745472 | 713746787 | 24.24% | 103.271 | 100.502 | 1.680 |
| 8 | 2944745472 | 711333243 | 24.16% | 126.486 | 124.023 | 1.536 |
| 9 | 2944745472 | 711214985 | 24.15% | 138.508 | 134.626 | 1.927 |
Log
zstd
(Only ran it up to level 15 because it was getting ridiculous...)
| Compression level | Original size | Compressed size | Compression ratio | Real time | User time | Sys time |
|---|---|---|---|---|---|---|
| 1 | 3641670876 | 440404842 | 12.09% | 11.606 | 11.189 | 1.098 |
| 2 | 3641670876 | 435763309 | 11.97% | 12.000 | 11.859 | 1.232 |
| 3 | 3641670876 | 432647510 | 11.88% | 15.586 | 15.240 | 1.290 |
| 4 | 3641670876 | 433149771 | 11.89% | 18.272 | 17.941 | 1.072 |
| 5 | 3641670876 | 402242867 | 11.05% | 39.903 | 39.730 | 1.240 |
| 6 | 3641670876 | 395880291 | 10.87% | 43.403 | 43.543 | 1.198 |
| 7 | 3641670876 | 379345921 | 10.42% | 58.751 | 58.411 | 1.505 |
| 8 | 3641670876 | 369124857 | 10.14% | 72.449 | 71.646 | 1.644 |
| 9 | 3641670876 | 367090066 | 10.08% | 87.926 | 87.384 | 1.644 |
| 10 | 3641670876 | 365891660 | 10.05% | 103.167 | 103.085 | 1.317 |
| 11 | 3641670876 | 365068174 | 10.02% | 124.915 | 124.942 | 1.265 |
| 12 | 3641670876 | 363906198 | 9.99% | 164.777 | 163.296 | 1.347 |
| 13 | 3641670876 | 359998040 | 9.89% | 228.925 | 228.116 | 2.024 |
| 14 | 3641670876 | 358985335 | 9.86% | 267.016 | 265.797 | 2.248 |
| 15 | 3641670876 | 358212227 | 9.84% | 334.854 | 333.494 | 2.032 |
gzip
| Compression level | Original size | Compressed size | Compression ratio | Real time | User time | Sys time |
|---|---|---|---|---|---|---|
| 1 | 3641670876 | 506536391 | 13.91% | 36.188 | 33.271 | 1.340 |
| 2 | 3641670876 | 493880878 | 13.56% | 32.974 | 31.712 | 1.168 |
| 3 | 3641670876 | 483203714 | 13.27% | 36.171 | 33.592 | 1.383 |
| 4 | 3641670876 | 452770760 | 12.43% | 48.991 | 45.913 | 1.296 |
| 5 | 3641670876 | 436844810 | 12.00% | 47.157 | 45.902 | 1.175 |
| 6 | 3641670876 | 418611901 | 11.50% | 63.652 | 60.297 | 1.332 |
| 7 | 3641670876 | 416090448 | 11.43% | 70.472 | 68.945 | 1.268 |
| 8 | 3641670876 | 400631037 | 11.00% | 88.818 | 87.520 | 1.128 |
| 9 | 3641670876 | 400425421 | 11.00% | 114.291 | 112.666 | 1.244 |
(Raw terminal output in case I screwed up the tabulation somewhere)
> for lvl in {1..22}; do echo $lvl; time zstd -$lvl patriots.win-inf-20210123-012541-5nbpf-wpull.db -o patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst$lvl; echo; echo; done
1
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 24.51% (2944745472 => 721835831 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst1)
real 0m13.616s
user 0m11.565s
sys 0m1.751s
2
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 23.79% (2944745472 => 700663107 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst2)
real 0m13.192s
user 0m13.300s
sys 0m1.097s
3
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 23.17% (2944745472 => 682199494 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst3)
real 0m16.743s
user 0m16.700s
sys 0m1.231s
4
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 23.01% (2944745472 => 677610833 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst4)
real 0m21.281s
user 0m21.426s
sys 0m1.187s
5
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 22.47% (2944745472 => 661601839 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst5)
real 0m50.899s
user 0m50.691s
sys 0m1.414s
6
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 22.33% (2944745472 => 657653273 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst6)
real 0m56.692s
user 0m56.470s
sys 0m1.247s
7
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 21.41% (2944745472 => 630368182 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst7)
real 1m8.079s
user 1m7.727s
sys 0m1.542s
8
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 21.24% (2944745472 => 625318048 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst8)
real 1m19.158s
user 1m19.157s
sys 0m1.252s
9
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 21.15% (2944745472 => 622723235 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst9)
real 1m33.913s
user 1m33.947s
sys 0m1.144s
10
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 20.82% (2944745472 => 613131472 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst10)
real 1m57.855s
user 1m57.652s
sys 0m1.381s
11
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 20.75% (2944745472 => 610937389 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst11)
real 2m11.157s
user 2m10.767s
sys 0m1.344s
12
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 20.70% (2944745472 => 609634199 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst12)
real 2m56.475s
user 2m56.017s
sys 0m1.516s
13
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 20.71% (2944745472 => 609777705 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst13)
real 3m21.050s
user 3m16.477s
sys 0m2.412s
14
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 20.62% (2944745472 => 607311093 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst14)
real 3m38.251s
user 3m35.267s
sys 0m2.175s
15
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 20.57% (2944745472 => 605756166 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst15)
real 4m25.313s
user 4m22.540s
sys 0m2.719s
16
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 20.00% (2944745472 => 588934765 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst16)
real 9m32.187s
user 9m21.204s
sys 0m3.635s
17
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 19.11% (2944745472 => 562606051 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst17)
real 11m37.623s
user 11m30.251s
sys 0m4.968s
18
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 18.30% (2944745472 => 538896215 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst18)
real 18m5.334s
user 17m57.788s
sys 0m5.231s
19
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 18.02% (2944745472 => 530637003 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst19)
real 25m19.945s
user 25m12.603s
sys 0m4.898s
> for lvl in {1..9}; do echo $lvl; time gzip -$lvl <patriots.win-inf-20210123-012541-5nbpf-wpull.db >patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz$lvl; echo; echo; done
1
real 0m47.730s
user 0m42.625s
sys 0m1.891s
2
real 0m47.969s
user 0m44.378s
sys 0m1.551s
3
real 0m56.418s
user 0m54.249s
sys 0m1.833s
4
real 1m5.347s
user 1m2.334s
sys 0m1.711s
5
real 1m11.107s
user 1m8.941s
sys 0m1.759s
6
real 1m29.407s
user 1m27.594s
sys 0m1.560s
7
real 1m43.271s
user 1m40.502s
sys 0m1.680s
8
real 2m6.486s
user 2m4.023s
sys 0m1.536s
9
real 2m18.508s
user 2m14.626s
sys 0m1.927s
> for lvl in {1..19}; do echo $lvl; time zstd -$lvl 3pwf0useacbmua9uwp4idpale.log -o 3pwf0useacbmua9uwp4idpale.log.zst$lvl; echo; echo; done
1
3pwf0useacbmua9uwp4idpale.log : 12.09% (3641670876 => 440404842 bytes, 3pwf0useacbmua9uwp4idpale.log.zst1)
real 0m11.606s
user 0m11.189s
sys 0m1.098s
2
3pwf0useacbmua9uwp4idpale.log : 11.97% (3641670876 => 435763309 bytes, 3pwf0useacbmua9uwp4idpale.log.zst2)
real 0m12.000s
user 0m11.859s
sys 0m1.232s
3
3pwf0useacbmua9uwp4idpale.log : 11.88% (3641670876 => 432647510 bytes, 3pwf0useacbmua9uwp4idpale.log.zst3)
real 0m15.586s
user 0m15.240s
sys 0m1.290s
4
3pwf0useacbmua9uwp4idpale.log : 11.89% (3641670876 => 433149771 bytes, 3pwf0useacbmua9uwp4idpale.log.zst4)
real 0m18.272s
user 0m17.941s
sys 0m1.072s
5
3pwf0useacbmua9uwp4idpale.log : 11.05% (3641670876 => 402242867 bytes, 3pwf0useacbmua9uwp4idpale.log.zst5)
real 0m39.903s
user 0m39.730s
sys 0m1.240s
6
3pwf0useacbmua9uwp4idpale.log : 10.87% (3641670876 => 395880291 bytes, 3pwf0useacbmua9uwp4idpale.log.zst6)
real 0m43.403s
user 0m43.543s
sys 0m1.198s
7
3pwf0useacbmua9uwp4idpale.log : 10.42% (3641670876 => 379345921 bytes, 3pwf0useacbmua9uwp4idpale.log.zst7)
real 0m58.751s
user 0m58.411s
sys 0m1.505s
8
3pwf0useacbmua9uwp4idpale.log : 10.14% (3641670876 => 369124857 bytes, 3pwf0useacbmua9uwp4idpale.log.zst8)
real 1m12.449s
user 1m11.646s
sys 0m1.644s
9
3pwf0useacbmua9uwp4idpale.log : 10.08% (3641670876 => 367090066 bytes, 3pwf0useacbmua9uwp4idpale.log.zst9)
real 1m27.926s
user 1m27.384s
sys 0m1.644s
10
3pwf0useacbmua9uwp4idpale.log : 10.05% (3641670876 => 365891660 bytes, 3pwf0useacbmua9uwp4idpale.log.zst10)
real 1m43.167s
user 1m43.085s
sys 0m1.317s
11
3pwf0useacbmua9uwp4idpale.log : 10.02% (3641670876 => 365068174 bytes, 3pwf0useacbmua9uwp4idpale.log.zst11)
real 2m4.915s
user 2m4.942s
sys 0m1.265s
12
3pwf0useacbmua9uwp4idpale.log : 9.99% (3641670876 => 363906198 bytes, 3pwf0useacbmua9uwp4idpale.log.zst12)
real 2m44.777s
user 2m43.296s
sys 0m1.347s
13
3pwf0useacbmua9uwp4idpale.log : 9.89% (3641670876 => 359998040 bytes, 3pwf0useacbmua9uwp4idpale.log.zst13)
real 3m48.925s
user 3m48.116s
sys 0m2.024s
14
3pwf0useacbmua9uwp4idpale.log : 9.86% (3641670876 => 358985335 bytes, 3pwf0useacbmua9uwp4idpale.log.zst14)
real 4m27.016s
user 4m25.797s
sys 0m2.248s
15
3pwf0useacbmua9uwp4idpale.log : 9.84% (3641670876 => 358212227 bytes, 3pwf0useacbmua9uwp4idpale.log.zst15)
real 5m34.854s
user 5m33.494s
sys 0m2.032s
> for lvl in {1..9}; do echo $lvl; time gzip -$lvl <3pwf0useacbmua9uwp4idpale.log >3pwf0useacbmua9uwp4idpale.log.gz$lvl; echo; echo; done
1
real 0m36.188s
user 0m33.271s
sys 0m1.340s
2
real 0m32.974s
user 0m31.712s
sys 0m1.168s
3
real 0m36.171s
user 0m33.592s
sys 0m1.383s
4
real 0m48.991s
user 0m45.913s
sys 0m1.296s
5
real 0m47.157s
user 0m45.902s
sys 0m1.175s
6
real 1m3.652s
user 1m0.297s
sys 0m1.332s
7
real 1m10.472s
user 1m8.945s
sys 0m1.268s
8
real 1m28.818s
user 1m27.520s
sys 0m1.128s
9
real 1m54.291s
user 1m52.666s
sys 0m1.244s
> ll
total 34151264
drwxr-xr-x 2 archivebot archivebot 4096 Feb 21 03:42 .
drwxr-xr-x 20 archivebot archivebot 4096 Feb 21 02:55 ..
-rw-r--r-- 1 archivebot archivebot 3641670876 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log
-rw-r--r-- 1 archivebot archivebot 506536391 Feb 21 03:33 3pwf0useacbmua9uwp4idpale.log.gz1
-rw-r--r-- 1 archivebot archivebot 493880878 Feb 21 03:33 3pwf0useacbmua9uwp4idpale.log.gz2
-rw-r--r-- 1 archivebot archivebot 483203714 Feb 21 03:34 3pwf0useacbmua9uwp4idpale.log.gz3
-rw-r--r-- 1 archivebot archivebot 452770760 Feb 21 03:35 3pwf0useacbmua9uwp4idpale.log.gz4
-rw-r--r-- 1 archivebot archivebot 436844810 Feb 21 03:35 3pwf0useacbmua9uwp4idpale.log.gz5
-rw-r--r-- 1 archivebot archivebot 418611901 Feb 21 03:36 3pwf0useacbmua9uwp4idpale.log.gz6
-rw-r--r-- 1 archivebot archivebot 416090448 Feb 21 03:38 3pwf0useacbmua9uwp4idpale.log.gz7
-rw-r--r-- 1 archivebot archivebot 400631037 Feb 21 03:39 3pwf0useacbmua9uwp4idpale.log.gz8
-rw-r--r-- 1 archivebot archivebot 400425421 Feb 21 03:41 3pwf0useacbmua9uwp4idpale.log.gz9
-rw-r--r-- 1 archivebot archivebot 440404842 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst1
-rw-r--r-- 1 archivebot archivebot 365891660 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst10
-rw-r--r-- 1 archivebot archivebot 365068174 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst11
-rw-r--r-- 1 archivebot archivebot 363906198 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst12
-rw-r--r-- 1 archivebot archivebot 359998040 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst13
-rw-r--r-- 1 archivebot archivebot 358985335 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst14
-rw-r--r-- 1 archivebot archivebot 358212227 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst15
-rw-r--r-- 1 archivebot archivebot 435763309 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst2
-rw-r--r-- 1 archivebot archivebot 432647510 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst3
-rw-r--r-- 1 archivebot archivebot 433149771 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst4
-rw-r--r-- 1 archivebot archivebot 402242867 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst5
-rw-r--r-- 1 archivebot archivebot 395880291 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst6
-rw-r--r-- 1 archivebot archivebot 379345921 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst7
-rw-r--r-- 1 archivebot archivebot 369124857 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst8
-rw-r--r-- 1 archivebot archivebot 367090066 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst9
-rw-r--r-- 2 archivebot archivebot 2944745472 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db
-rw-r--r-- 1 archivebot archivebot 806176600 Feb 21 02:17 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz1
-rw-r--r-- 1 archivebot archivebot 800534883 Feb 21 02:18 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz2
-rw-r--r-- 1 archivebot archivebot 770088717 Feb 21 02:19 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz3
-rw-r--r-- 1 archivebot archivebot 736143418 Feb 21 02:20 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz4
-rw-r--r-- 1 archivebot archivebot 723571018 Feb 21 02:21 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz5
-rw-r--r-- 1 archivebot archivebot 717027291 Feb 21 02:22 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz6
-rw-r--r-- 1 archivebot archivebot 713746787 Feb 21 02:24 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz7
-rw-r--r-- 1 archivebot archivebot 711333243 Feb 21 02:26 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz8
-rw-r--r-- 1 archivebot archivebot 711214985 Feb 21 02:28 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz9
-rw-r--r-- 1 archivebot archivebot 721835831 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst1
-rw-r--r-- 1 archivebot archivebot 613131472 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst10
-rw-r--r-- 1 archivebot archivebot 610937389 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst11
-rw-r--r-- 1 archivebot archivebot 609634199 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst12
-rw-r--r-- 1 archivebot archivebot 609777705 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst13
-rw-r--r-- 1 archivebot archivebot 607311093 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst14
-rw-r--r-- 1 archivebot archivebot 605756166 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst15
-rw-r--r-- 1 archivebot archivebot 588934765 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst16
-rw-r--r-- 1 archivebot archivebot 562606051 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst17
-rw-r--r-- 1 archivebot archivebot 538896215 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst18
-rw-r--r-- 1 archivebot archivebot 530637003 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst19
-rw-r--r-- 1 archivebot archivebot 700663107 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst2
-rw-r--r-- 1 archivebot archivebot 682199494 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst3
-rw-r--r-- 1 archivebot archivebot 677610833 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst4
-rw-r--r-- 1 archivebot archivebot 661601839 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst5
-rw-r--r-- 1 archivebot archivebot 657653273 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst6
-rw-r--r-- 1 archivebot archivebot 630368182 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst7
-rw-r--r-- 1 archivebot archivebot 625318048 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst8
-rw-r--r-- 1 archivebot archivebot 622723235 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst9
My conclusion: the sweet spot with zstd seems to be 10 for databases and 8 for logs. Up to that, there is an acceptable increase in runtime with significant space savings. Beyond that, the large increase in compression time outweighs the relatively small size reduction. Unless someone yells at me, that's what I'll implement soon™.
Fun side note: even zstd -2 compresses better than gzip -9 – and at a 10 times shorter runtime!
A complication is SQLite's Write-Ahead Log (which records changes to the DB that aren't merged into the main database file yet). When the DB gets closed, it gets merged, and only wpull.db remains (but is this guaranteed behaviour?). This is what happens on aborting, for example. But when wpull crashes, wpull.db-wal and wpull.db-shm remain. Merging explicitly is possible using sqlite3 wpull.db 'PRAGMA wal_checkpoint' (docs, possibly an argument would be better), but I'm not sure whether that always works. Perhaps there'd need to be a fallback to preserve all three files in case the wal_checkpoint fails to merge them together.