ArchiveBot
ArchiveBot copied to clipboard
Evaluate whether we want to keep --large
As mentioned in #330, --large
hasn't been used for a job submission in over a year, LARGE
more often than not isn't specified on pipelines, and most current pipelines are sufficiently big to handle even large jobs with many millions of URLs and large files.
I think of --large
as a workaround for when some jobs were larger than what some pipelines could handle. But computing resources only get cheaper and this seems to no longer be an issue nowadays. So I think the option has outlived its usefulness and should be removed to reduce clutter in the code base.
i'd be happy to close #330 in favor of another PR (someone else would make) that would remove the --large
option altogether of course. :)
In case we decide to remove --large
:
I just looked at the code a bit to see whether there's anything special we'd need to do. I think it should work if we simply drop it entirely.
The only potential issue I could think of is if the pending-large
queue no longer exists and a pipeline using the old, pre-removal code tries to get a job from it. Control.dequeue_item
uses Redis's RPOPLPUSH
command, which returns nil
if the source
does not exist. I'm not entirely sure how that translates to the Python world, but I'd expect it to result in a None
, which then gets treated (in Control.reserve_job
). So that should probably work, but we should test it first to ensure we don't crash all running pipelines with this change. (Testing it by modifying the pipeline code to try to retrieve a job from a made-up queue name should suffice.)
The purpose of this feature was to allow us to use servers like the Scaleway baremetal C1, which is only 3 euros a month and can handle the load perfectly except that it has only a 50GB SSD. But in principle, there's no actual reason a web crawl should need anywhere near that amount of disk, and so this flag might be the wrong solution to the problem.
On the other hand, to actually allow the use of these servers the right way would be a massive engineering effort I plan to get back to staring at when life calms down a little...
Nice to hear from you. :-)
in principle, there's no actual reason a web crawl should need anywhere near that amount of disk
It depends. We've had a number of jobs with a queue of a few dozen millions of URLs. The SQLite DB and the log file alone use a decent amount of space for such jobs. And if deduplication weren't broken (#311), the dedupe DB would presumably use a significant amount of storage as well.
We can get rid of the log issue by splitting the meta WARC up and truncating the wpull.log
file. This would also fix the issue that we're missing the log files of crashed/aborted jobs if the pipeline operators don't upload them. I've actually implemented the meta WARC splitting part in wpull already (but the code isn't available currently; I need to resume working on that).
As for decreasing the DB size, we could in theory remove ignored URLs from the DB. This would require removing the ignore statements from the log file because we'd otherwise log the same URL potentially millions of times. I don't think this is a good idea though since it's good to have a log of what URLs we intentionally excluded from the crawl. We could remove some unnecessary columns from the DB, but that would not have a large effect in the big scheme of things.
The other issue, and really the major reason why small disks don't work well, are large file downloads. One huge problem there is that each download is written to disk twice, and that's obviously a massive waste of disk space. However, while we could definitely eliminate one of these copies, I don't think it's possible to get rid of the other. If we had only one download running at a time, we could simply stream the data to the document processor (which would simply ignore the data in most cases of large files since those tend to not be HTML/CSS/JS documents) and the WARC writer, which could then write to WARC directly. It could even split it up across WARC files using continuation records and move WARCs to the uploader directory while still downloading the same file. In theory, that would even allow downloading larger files than the disk size (assuming the upload's fast enough). Unfortunately, concurrent downloads throw a wrench into this. And furthermore, the WARC spec on continuation records is a bit flawed; you can't begin to write the records until you have the entire file since the first record already needs to contain the payload digest of the entire logical record (cf. iipc/warc-specifications#28).
Anyway, this probably belongs into a "make wpull more space-efficient" issue. And as you mentioned, it requires a decent amount of work.
To get back to --large
, I completely understand what the idea behind it was, but I think in the end it simply doesn't work out. In addition to pipelines generally having sufficient diskspace now, people also forget to use this option when they queue jobs (or don't even know about it because they haven't read the docs; if it were in the docs in the first place, that is), or maybe they simply don't realise that there are huge files hidden somewhere in a directory. From that perspective, it would make more sense to have a --small
option for sites which verifiably aren't large and can run even on very small pipelines. But in general, jobs end up being larger than one thought originally, and a single bad link (e.g. a webcam stream) is enough to cause a crash, so I'm not really convinced of that either.
While I do agree on removing the functionality of --large
, I think the option itself should remain for sake of compatibility (in a functionless/placeboic state).
For necroposting context, ArchiveBot job 36sdsz38kvp70rjl4ziye6l8u was started using the offending option and, as a result, required manual intervention to rectify. It was added to the "pending-large" queue — no longer checked by any pipelines — and therefore would have remained, stuck, in said queue.
Retaining semi-functionality of the option would keep this (admittedly highly uncommon) situation from reoccurring. A message like nick: --large option ignored
might work, or simply no message at all.
Just my two cents.
There's probably no reason to keep it, as it was used only for a brief time.
@systwi-again I don't see any advantage to keeping a meaningless option. Backwards compatibility is not a concern here as commands are issued by humans, not scripts. And as you said, usage of it is indeed very rare – there have only been two cases in the past three years (including tonight's). Apart from that, a couple people have included it when requesting jobs, but they undoubtedly took that from the docs, where it would be removed anyway.
I do see a disadvantage of keeping an option that just gets silently ignored though: it might lead people to think that it has a meaning and including it is important. In my eyes, that's worse than an error about an unknown option.
I agree. My original thought for keeping the backwards compatibility was that it would protect against those aforementioned situations, assuming the documentation would remain unchanged.
Furthermore, given --large
's somewhat benign functionality, I felt that a silent error in this specific situation would be relatively harmless. Giving this more thought, however, I agree that errors in even these situations should be verbose.
It would be safest, most logical and most efficient to instead remove --large
from both ArchiveBot and its documentation entirely, in my opinion (again, assuming this occurs at another point in time).
The only potential issue I could think of is if the
pending-large
queue no longer exists and a pipeline using the old, pre-removal code tries to get a job from it.Control.dequeue_item
uses Redis'sRPOPLPUSH
command, which returnsnil
if thesource
does not exist. I'm not entirely sure how that translates to the Python world, but I'd expect it to result in aNone
, which then gets treated (inControl.reserve_job
). So that should probably work, but we should test it first to ensure we don't crash all running pipelines with this change. (Testing it by modifying the pipeline code to try to retrieve a job from a made-up queue name should suffice.)
I tested this the other day, and it does work. Only after testing did I realise that of course it must work: the pipelines always attempt to dequeue from pending
and pending-ao
anyway, and those queues are often empty. So there's no technical blocker preventing this from going ahead.