waybackpack icon indicating copy to clipboard operation
waybackpack copied to clipboard

not respecting showDupeCount=true; retry without --uniques-only

Open reagle opened this issue 1 year ago • 4 comments

Hi, I'm new to the tool, and don't want to download empty files or files which haven't changed. I tried and got the following. I'm not sure what this means and why it doesn't work...?

❯ waybackpack http://reddit.com/r/self -d ~/Downloads/wayback-reddit --from-date 2008 --to-date 2009  --no-clobber --progress --uniques-only
Traceback (most recent call last):
  File "/Users/reagle/.pyenv/versions/3.12.5/bin/waybackpack", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/reagle/.pyenv/versions/3.12.5/lib/python3.12/site-packages/waybackpack/cli.py", line 142, in main
    snapshots = search(
                ^^^^^^^
  File "/Users/reagle/.pyenv/versions/3.12.5/lib/python3.12/site-packages/waybackpack/cdx.py", line 47, in search
    raise WaybackpackException(
waybackpack.cdx.WaybackpackException: Wayback Machine CDX API not respecting showDupeCount=true; retry without --uniques-only.

reagle avatar Oct 03 '24 17:10 reagle

Thanks for your interest in waybackpack, @reagle. Here's what's happening:

  • The Wayback Machine's CDX API theoretically provides a way to check for (and thus skip over) duplicate content. If you pass --uniques-only, then waybackpack attempts to skip those dupes.
  • ... but the API hasn't always respected the relevant parameter, making it impossible for waybackpack to respect --uniques-only.
  • Because we don't want people who are expecting --uniques-only to get unexpected results when the feature doesn't work, we throw that error.
  • You can remove --uniques-only from your invocation, although that of course won't resolve the underlying issue (which is that you will end up downloading files that haven't changed).

jsvine avatar Oct 18 '24 02:10 jsvine

Okay, thank you. I'm not sure how often --uniques-only fails, but a nice feature for pack would be to check if the files are redundant itself. That is, if the API returns a digest that matches and earlier page, don't write it to disk. If you didn't want to do that and that info is available, perhaps you could include it in the metadata of the HTML, so a wrapper could do it. I found myself single file results (and wanting to tweak default argument values) and so used this wrapper.

#!/usr/bin/env python3

"""Wrap waybackpack to copy files to a single directory."""

import argparse
import os
import shutil
import subprocess


def run_waybackpack(args):
    """Run waybackpack with the given arguments."""
    command = ["waybackpack", "--dir", args.dir, "--delay-retry", str(args.delay_retry)]
    if args.no_clobber:
        command.append("--no-clobber")
    if args.progress:
        command.append("--progress")
    command.extend(args.unknown)

    try:
        subprocess.run(command, check=True)
        print("Waybackpack command executed successfully.")
    except subprocess.CalledProcessError as e:
        print(f"Error executing waybackpack: {e}")
        return False
    return True


def process_files(base_dir):
    """Create files rather than paths from waybackpack."""
    for root, _, files in os.walk(base_dir):
        for file in files:
            if file.endswith(".html"):
                original = os.path.join(root, file)
                relative_path = os.path.relpath(original, base_dir)
                new_filename = relative_path.replace(os.sep, "_")
                new_file_path = os.path.join(base_dir, new_filename)
                shutil.copy(original, new_file_path)
                print(f"Copied {original} to {new_file_path}")


def main():
    """Process arguments and call waybackpack and file processing."""
    parser = argparse.ArgumentParser(description="Waybackpack Wrapper")
    parser.add_argument(
        "--dir", type=str, default="wb", help="Directory for storing results"
    )
    parser.add_argument(
        "--delay-retry", type=int, default=15, help="Delay between retries"
    )
    parser.add_argument(
        "--no-clobber",
        action="store_true",
        default=True,
        help="Do not overwrite existing files",
    )
    parser.add_argument(
        "--progress", action="store_true", default=True, help="Show progress"
    )
    args, unknown = parser.parse_known_args()
    args.unknown = unknown

    if run_waybackpack(args):
        process_files(args.dir)


if __name__ == "__main__":
    main()

reagle avatar Oct 18 '24 12:10 reagle

I got the same error. I vaguely understand the explanation that this is a problem with the Wayback Machine's own API, and waybackpack is doing a good thing by throwing an error instead of falling back to something the user (me) might not want to do after all. But I don't understand at all how this explanation maps onto the actual output of the waybackpack executable! What I see on my screen for this failure mode is:

$ waybackpack --raw --to-date 202401 --uniques-only --dir archive/ http://fq.math.ca/Scanned/28-3/andre-jeannin.pdf
Traceback (most recent call last):
  File "/Users/aodwyer/env/bin/waybackpack", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/aodwyer/env/lib/python3.12/site-packages/waybackpack/cli.py", line 142, in main
    snapshots = search(
                ^^^^^^^
  File "/Users/aodwyer/env/lib/python3.12/site-packages/waybackpack/cdx.py", line 47, in search
    raise WaybackpackException(
waybackpack.cdx.WaybackpackException: Wayback Machine CDX API not respecting showDupeCount=true; retry without --uniques-only.

...Ah, I get it, I had been parsing that message as "hocuspocus [is] not respecting showDupeCount=true; retry without --uniques-only", which confused me. I had in fact passed --uniques-only. So it was confusing to see an error message that claimed to apply only without --uniques-only. But you had meant me to parse it as "hocuspocus [does] not [respect] showDupeCount=true; [please] retry without --uniques-only"! That is, the last part was a command to the user (me), not a description of the failure mode.

I suggest improving the error message in three ways:

  • Actually catch the Python exception and output a proper message to the command-line user; don't just dump a stacktrace.
  • Rephrase the first part in active voice: "--uniques-only requires the Wayback Machine CDX API to respect showDupeCount=true, but in this case it doesn't."
  • Rephrase the second part as a new sentence, imperative, on a second line of text: "Please try again without --uniques-only."

So the final fixed behavior would look like this mockup:

$ waybackpack --raw --to-date 202401 --uniques-only --dir archive/ http://fq.math.ca/Scanned/28-3/andre-jeannin.pdf
Error: --uniques-only requires the Wayback Machine CDX API to respect `showDupeCount=true`, but in this case it doesn't.
Please try again without --uniques-only.

(The phrase "in this case" is super vague, of course, but I don't have the knowledge to improve its specificity.)

Quuxplusone avatar Jan 04 '25 18:01 Quuxplusone

Thank you, @Quuxplusone, for describing the confusion you encountered and for proposing improvements! They sound reasonable to me and will attempt something like that the next time I'm working on the library.

jsvine avatar Jan 10 '25 03:01 jsvine