gallery-dl icon indicating copy to clipboard operation
gallery-dl copied to clipboard

[Help needed] file size mismatch

Open lx30011 opened this issue 1 year ago • 4 comments

I'm running into this issue where gallery-dl occasionally chokes on partial content (code 206). It outputs "file size mismatch" after each of the five tries and then gives up. I read the HttpDownloader source but I'm not getting any wiser. Would appreciate your help with debugging.

[urllib3.connectionpool][debug] https://8chan.moe:443 "GET /.media/759a3119a0a06e717c704de7f9dcc8916dc3f8a8039531a399a5982e847e7770.png HTTP/1.1" 200 11982943
./8chan/v/640607 Meta Thread/678236-2 fug.mp4_20220827_142327.941.png
[downloader.http][warning] file size mismatch (1825585 < 11982943) (1/5)
[urllib3.connectionpool][debug] Resetting dropped connection: 8chan.moe
[urllib3.connectionpool][debug] https://8chan.moe:443 "GET /.media/759a3119a0a06e717c704de7f9dcc8916dc3f8a8039531a399a5982e847e7770.png HTTP/1.1" 206 10157358
[downloader.http][debug] Resuming download at byte 1825585
./8chan/v/640607 Meta Thread/678236-2 fug.mp4_20220827_142327.941.png
[downloader.http][warning] file size mismatch (1966080 < 11982943) (2/5)
[urllib3.connectionpool][debug] Resetting dropped connection: 8chan.moe
[urllib3.connectionpool][debug] https://8chan.moe:443 "GET /.media/759a3119a0a06e717c704de7f9dcc8916dc3f8a8039531a399a5982e847e7770.png HTTP/1.1" 206 10016863
[downloader.http][debug] Resuming download at byte 1966080
./8chan/v/640607 Meta Thread/678236-2 fug.mp4_20220827_142327.941.png
[downloader.http][warning] file size mismatch (2097152 < 11982943) (3/5)
[urllib3.connectionpool][debug] Resetting dropped connection: 8chan.moe
[urllib3.connectionpool][debug] https://8chan.moe:443 "GET /.media/759a3119a0a06e717c704de7f9dcc8916dc3f8a8039531a399a5982e847e7770.png HTTP/1.1" 206 9885791
[downloader.http][debug] Resuming download at byte 2097152
./8chan/v/640607 Meta Thread/678236-2 fug.mp4_20220827_142327.941.png
[downloader.http][warning] file size mismatch (2228224 < 11982943) (4/5)
^C

The site is https://8chan.moe/ I was testing using this thread (mostly SFW): https://8chan.moe/v/res/640607.html I think the site serves all somewhat large files using chunks, at least all the files it choked on were larger ones. This thread has a lot of large files (mostly SFW): https://8chan.moe/v/res/673938.html If you run the extractor I recommend setting sleep to 1 to make sure you don't get rate limited. You can let the extractor run and it should choke on some file. Here is my extractor:

# -*- coding: utf-8 -*-

# Copyright 2017-2021 Mike Fährmann
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as
# published by the Free Software Foundation.

from .common import Extractor, Message
from .. import text
import itertools

class _8chanThreadExtractor(Extractor):
  """Extractor for 8chan threads"""
  category = "8chan"
  subcategory = "thread"
  directory_fmt = ("{category}", "{boardUri}",
                   "{threadId} {subject[:50]}")
  filename_fmt = "{postId}{num:?-//} {filename[:80]}.{extension}"
  archive_fmt = "{boardUri}_{postId}_{num}"
  pattern = r"(?:https?://)?8chan\.moe/([^/]+)/res/(\d+)"

  def __init__(self, match):
    Extractor.__init__(self, match)
    self.board, self.thread = match.groups()

  def items(self):
    self.request("https://8chan.moe/")
    dust = "https://8chan.moe/{}/".format(self.board)
    self.request(dust, headers={"Referer": "https://8chan.moe/"})
    url = "https://8chan.moe/{}/res/{}.json".format(self.board, self.thread)
    thread = self.request(url, headers={"Referer": dust}).json()
    thread["postId"] = thread["threadId"]
    posts = thread.pop("posts")

    yield Message.Directory, thread

    for post in itertools.chain((thread,), posts):
      files = post.pop("files", ())
      if files:
        thread.update(post)
        for num, file in enumerate(files):
          file.update(thread)
          file["num"] = num
          file["_http_headers"] = {
            "Referer": "https://8chan.moe/{}/res/{}.html".format(self.board, self.thread)
          }
          url = "https://8chan.moe" + file["path"]
          text.nameext_from_url(file["originalName"], file)
          yield Message.Url, url, file

lx30011 avatar Sep 19 '22 19:09 lx30011

Upon further investigation this doesn't seem to be an issue with gallery-dl, but rather the site. Using wget I'd get Read error at byte 13893632/22135554 (The TLS connection was non-properly terminated.). Retrying. with the same frequency as with gallery-dl. The error message led me to this issue and 8chan uses varnish so it might be relevant. https://github.com/varnish/hitch/issues/127

Once I let gallery-dl retry it eventually succeeds in downloading the files. They look fine, videos playback fine, I'm just wondering whether there might be some corruption. I can't tell.

lx30011 avatar Sep 19 '22 21:09 lx30011

I tried several things - --sleep 5, -o browser=firefox, exactly matching browser headers except cookies - and non of them worked. Sooner or later it would always end up with a [downloader.http][warning] file size mismatch.

What does work is sending captchaid and captchaexpiration cookies exported from a browser. captchaexpiration is easy enough to replicate, but it might be impossible to automatically generate a captchaid.

mikf avatar Sep 20 '22 16:09 mikf

Thank you very much for testing. If you look at my extractor I'm requesting pages in a similar manner as to how a user would (root of the site, then board, then thread) with the goal of having the cookies generate, I noticed gallery-dl uses requests.Session underneath which stores cookies obtained during a session. Though I presume it doesn't work going by your response.

Just to make sure I understand, do you think captchaid and captchaexpiration have anything to do with the file size mismatch error? If the cookies are set, there's no file size mismatch? The cookie has a short expiry date of three minutes so it would need enough runs to be sure that it eliminates the issue.

lx30011 avatar Sep 20 '22 19:09 lx30011

I think I figured it out. We can get both cookies by sending a request to https://8chan.moe/captcha.js with Accept: image/avif,image/webp,*/* and all the other usual headers (use your browser dev tools). This redirects to https://8chan.moe/.global/captchas/123456789e8c6a652e68e225 and sets both captchaid and captchaexpiration.

Afterwards those cookies would either need to be refreshed every few minutes, or maybe it would also work to unset or extend the expires value of captchaid.

Just to make sure I understand, do you think captchaid and captchaexpiration have anything to do with the file size mismatch error?

Yeah, they do. Not sending them causes 8chan to periodically drop the connection during downloads and we get file size mismatch errors, which does not happen when they are included in a request. Just try it out.

I'm requesting pages in a similar manner as to how a user would (root of the site, then board, then thread) with the goal of having the cookies generate

I thinks that's unnecessary. Just remove those two requests.


edit: Using a several hours old captchaid cookie still works, so just removing the expires value from it is the best option, I think

edit2: For reference, I'm using these as _http_headers:

    headers = {
        "Accept": "video/webm,video/ogg,video/*;q=0.9,application/ogg;q=0.7,audio/*;q=0.6,*/*;q=0.5",
        "Accept-Language": "en-US,en;q=0.5",
        "Range": "bytes=0-",
        "DNT": "1",
        "Connection": "keep-alive",
        "Referer": "https://8chan.moe/v/res/673938.html",
        "Cookie": "captchaexpiration=Tue, 20 Sep 2022 16:29:07 GMT; captchaid=6329e99ff54ecb27cafcdbda3ZS8RplhsSLtWoxq8dLyrTvswKEgK2jE2w+u4LirJVr3qnpfsPVwpetZVMTkKHhR6BlaL/Ox9E3QB+voGG7T0A==",
        "Sec-Fetch-Dest": "video",
        "Sec-Fetch-Mode": "no-cors",
        "Sec-Fetch-Site": "same-origin",
        "TE": "trailers",
    }

mikf avatar Sep 20 '22 20:09 mikf

I took your code from https://github.com/mikf/gallery-dl/issues/2938#issue-1378452573, modified it quite a bit, and eventually ended up with a working extractor that does not get interrupted while downloading: https://github.com/mikf/gallery-dl/commit/1696f68a68f0438c270e48ba763911a230124e90.

mikf avatar Oct 11 '22 09:10 mikf