wget2 icon indicating copy to clipboard operation
wget2 copied to clipboard

Cannot mirror some sites that can be mirrored using wget

Open zorro opened this issue 2 years ago • 3 comments

I tried to mirror https://html.spec.whatwg.org using command wget2 -mkpnp https://html.spec.whatwg.org. Only index.html and robots.txt gets downloaded. No problem when mirroring with wget

zorro avatar Mar 26 '22 05:03 zorro

Hi,

you might use a very old version of wget2. The current one (2.0.0 from git master) creates (with your command line)

html.spec.whatwg.org/
├── demos
│   ├── canvas
│   │   └── blue-robot
│   │       └── index-idle.html
│   └── workers
│       ├── crypto
│       │   └── page.html
│       ├── modules
│       │   └── page.html
│       ├── multicore
│       │   └── page.html
│       ├── multiviewer
│       │   └── page.html
│       ├── primes
│       │   └── page.html
│       └── shared
│           ├── 001
│           │   └── test.html
│           ├── 002
│           │   └── test.html
│           └── 003
│               ├── inner.html
│               └── test.html
├── dev
│   ├── acknowledgements.html
│   ├── browsers.html
│   ├── browsing-the-web.html
│   ├── canvas.html
│   ├── common-dom-interfaces.html
│   ├── common-microsyntaxes.html
│   ├── comms.html
│   ├── custom-elements.html
│   ├── dnd.html
│   ├── dom.html
│   ├── droidserif-bolditalic.woff2
│   ├── droidserif-bold.woff2
│   ├── droidserif-italic.woff2
│   ├── droidserif.woff2
│   ├── dynamic-markup-insertion.html
│   ├── edits.html
│   ├── embedded-content.html
│   ├── embedded-content-other.html
│   ├── form-control-infrastructure.html
│   ├── form-elements.html
│   ├── forms.html
│   ├── grouping-content.html
│   ├── history.html
│   ├── iframe-embed-object.html
│   ├── imagebitmap-and-animations.html
│   ├── image-maps.html
│   ├── images.html
│   ├── index.html
│   ├── indices.html
│   ├── infrastructure.html
│   ├── input.html
│   ├── interaction.html
│   ├── interactive-elements.html
│   ├── introduction.html
│   ├── links.html
│   ├── media.html
│   ├── microdata.html
│   ├── mplus-2p-heavy.woff
│   ├── named-characters.html
│   ├── obsolete.html
│   ├── origin.html
│   ├── references.html
│   ├── scripting.html
│   ├── search.js
│   ├── sections.html
│   ├── semantics.html
│   ├── semantics-other.html
│   ├── server-sent-events.html
│   ├── structured-data.html
│   ├── styles.css
│   ├── syntax.html
│   ├── system-state.html
│   ├── tables.html
│   ├── text-level-semantics.html
│   ├── timers-and-user-prompts.html
│   ├── urls-and-fetching.html
│   ├── webappapis.html
│   ├── web-messaging.html
│   ├── webstorage.html
│   ├── window-object.html
│   ├── workers.html
│   ├── worklets.html
│   └── xhtml.html
├── entities.json
├── fonts
│   ├── Essays1743-BoldItalic.ttf
│   ├── Essays1743-Bold.ttf
│   ├── Essays1743-Italic.ttf
│   └── Essays1743.ttf
├── html-dfn.js
├── images
│   ├── arc1.png
│   ├── arcTo1.png
│   ├── arcTo2.png
│   ├── arcTo3.png
│   ├── asyncdefer.svg
│   ├── baselines.png
│   ├── bidiselect.png
│   ├── content-venn.svg
│   ├── custom-element-reactions.svg
│   ├── drawImage.png
│   ├── focus-tree.png
│   ├── im.png
│   ├── ircfog-modules.svg
│   ├── outline.svg
│   ├── parsing-model-overview.svg
│   ├── premultiplied-example-1.png
│   ├── premultiplied-example-2.png
│   ├── premultiplied-example-3.png
│   ├── premultiplied-example-4.png
│   ├── premultiplied-example-5.png
│   ├── sample-bdi.png
│   ├── sample-datalist.svg
│   ├── sample-details-1.png
│   ├── sample-details-2.png
│   ├── sample-email-1.svg
│   ├── sample-email-2.svg
│   ├── sample-meter.png
│   ├── sample-not-bdi.png
│   ├── sample-progress.png
│   ├── sample-range-2a.png
│   ├── sample-range-2b.png
│   ├── sample-range-labels.png
│   ├── sample-range.png
│   ├── sample-ruby-bopomofo.png
│   ├── sample-ruby-ja.png
│   ├── sample-ruby-pinyin.png
│   ├── sample-url.svg
│   ├── sample-usemap.png
│   ├── select-country-1.png
│   ├── select-country-2.png
│   └── table-scope-diagram.png
├── index.html
├── link-fixup.js
├── multipage
│   ├── acknowledgements.html
│   ├── browsers.html
│   ├── browsing-the-web.html
│   ├── canvas.html
│   ├── common-dom-interfaces.html
│   ├── common-microsyntaxes.html
│   ├── comms.html
│   ├── custom-elements.html
│   ├── dnd.html
│   ├── dom.html
│   ├── dynamic-markup-insertion.html
│   ├── edits.html
│   ├── embedded-content.html
│   ├── embedded-content-other.html
│   ├── form-control-infrastructure.html
│   ├── form-elements.html
│   ├── forms.html
│   ├── grouping-content.html
│   ├── history.html
│   ├── iana.html
│   ├── iframe-embed-object.html
│   ├── imagebitmap-and-animations.html
│   ├── image-maps.html
│   ├── images.html
│   ├── index.html
│   ├── indices.html
│   ├── infrastructure.html
│   ├── input.html
│   ├── interaction.html
│   ├── interactive-elements.html
│   ├── introduction.html
│   ├── links.html
│   ├── media.html
│   ├── microdata.html
│   ├── named-characters.html
│   ├── obsolete.html
│   ├── origin.html
│   ├── parsing.html
│   ├── references.html
│   ├── rendering.html
│   ├── scripting.html
│   ├── sections.html
│   ├── semantics.html
│   ├── semantics-other.html
│   ├── server-sent-events.html
│   ├── structured-data.html
│   ├── syntax.html
│   ├── system-state.html
│   ├── tables.html
│   ├── text-level-semantics.html
│   ├── timers-and-user-prompts.html
│   ├── urls-and-fetching.html
│   ├── webappapis.html
│   ├── web-messaging.html
│   ├── webstorage.html
│   ├── window-object.html
│   ├── workers.html
│   ├── worklets.html
│   └── xhtml.html
├── print.pdf
├── robots.txt
└── styles.css

rockdaboot avatar Mar 26 '22 18:03 rockdaboot

@rockdaboot OK, It's working after updating. When I press Ctrl C the file being downloaded gets removed. How can I quit wget2 without removing the file being downloaded so that I can resume download later?

zorro avatar Mar 27 '22 20:03 zorro

How can I quit wget2 without removing the file being downloaded so that I can resume download later?

Currently, wget2 uses a 10MB buffer (per file/thread) that is lost on CTRL-c. Not sure if this is a real "problem", but it surely depends on your network speed. To flush the buffers on CTRL-c, these need to become accessible from outside the thread. This needs some architectural changes that I'd try to avoid. Another option is to reduce the buffer size, either with a CLI option or automatically depending on the network (or better: data flow) speed.

I leave this issue open as a reminder.

rockdaboot avatar May 15 '22 18:05 rockdaboot