libtorrent icon indicating copy to clipboard operation
libtorrent copied to clipboard

fails to follow redirect to webseed file

Open milahu opened this issue 1 year ago • 8 comments

redirects should be handled in libtorrent per https://github.com/arvidn/libtorrent/issues/7325 but libtorrent still fails to download some files

example: https://archive.org/details/AldousHuxley-BNW

torrent: https://archive.org/download/AldousHuxley-BNW/AldousHuxley-BNW_archive.torrent

$ python -c 'import torf; t = torf.Torrent.read("AldousHuxley-BNW_archive.torrent"); print("\n".join(t.webseeds))'
https://archive.org/download/
http://ia600402.us.archive.org/32/items/
http://ia800402.us.archive.org/32/items/

webseed url: http://ia600402.us.archive.org/32/items/

$ curl -s -I http://ia600402.us.archive.org/32/items/ | grep -i -e ^HTTP -e ^location
HTTP/1.1 200 OK

missing file: Aldous+Huxley_thumb.jpg

$ curl -s -I http://ia600402.us.archive.org/32/items/AldousHuxley-BNW/Aldous+Huxley_thumb.jpg | grep -i -e ^HTTP -e ^location
HTTP/1.1 301 Moved Permanently
Location: http://ia800304.us.archive.org/18/items/AldousHuxley-BNW/Aldous%2BHuxley_thumb.jpg

so the redirect happens not on the webseed url, but on the file url but qbittorrent fails to download the file

qBittorrent: 4.6.4 Libtorrent: 2.0.10.0

this may be a bug in qBittorrent but i assume that libtorrent is responsible for all the network stuff

downstream issues

  • https://github.com/qbittorrent/qBittorrent/issues/15193
  • https://github.com/qbittorrent/qBittorrent/issues/19747
  • https://archive.org/post/1116372/solution-for-unfinished-torrents

milahu avatar Jun 03 '24 15:06 milahu

I'm leaning towards it being an issue with archive.org itself.

stalkerok avatar Nov 11 '24 21:11 stalkerok

The issue with archive.org is they don't regenerate the entry metadata when the servers get moved around. Normally adding content/people leaving a comment triggers a metadata update which then recreates the torrent file.

Doing a mass update on their side I think is infeasible with how much data they host.

Both of these are impossible to do in their current state, with all the attacks that happened recently.

You as the content owner can't even manually regenerate the entry.

Given the circumstances, I believe libtorrent should follow the redirects as that's the path of least resistance.

parkerlreed avatar Nov 11 '24 21:11 parkerlreed

Maybe I don't understand the redirects you're talking about, but libtorrent already follows redirects.

stalkerok avatar Nov 12 '24 18:11 stalkerok

Ok, I'm probably misunderstanding the issue then.

parkerlreed avatar Nov 13 '24 00:11 parkerlreed

I haven't sat down to determine if the problem is libtorrent or the client qbittorrent. But the problem manifests thusly (same experience as another above, but in my words):

Grab a torrent, say: https://archive.org/download/kancolle-movie-720/kancolle-movie-720_archive.torrent

which produces some http sources like: http://ia600109.us.archive.org/27/items/ except that source isn't valid any more. (partial content download)

If you navigate your browser to http://ia600109.us.archive.org/27/items/kancolle-movie-720/, it'll 301 (as of this writing) to https://ia801501.us.archive.org/34/items/kancolle-movie-720/ of which this part "https://ia801501.us.archive.org/34/items/" is a valid source. Update this HTTP source in gbittorrent manually, and your download magically completes.

Yes, archive.org failed to update the torrent file with new webseeds. But why did qbittorrent not find the new source anyway? It had all the information to do it.

That's about as far as I can investigate it at the moment, but that should be reproducible by anyone.

SheepReaper avatar Jan 28 '25 22:01 SheepReaper

any updates on this issue by anyone?

jaycc3000 avatar Mar 12 '25 17:03 jaycc3000

If you navigate your browser to http://ia600109.us.archive.org/27/items/kancolle-movie-720/, it'll 301 (as of this writing) to https://ia801501.us.archive.org/34/items/kancolle-movie-720/ of which this part "https://ia801501.us.archive.org/34/items/" is a valid source. Update this HTTP source in gbittorrent manually, and your download magically completes.

That's about as far as I can investigate it at the moment, but that should be reproducible by anyone.

I am unable to view the contents of any /items/ directory. I am sent to a page reading "Directory browsing not supported at this level. Please go through the Internet Archive web site to access the content." Consequently I'm unable to find an updated HTTP source. Perhaps they've changed some security setting in response to the recent attacks?

peternschmit avatar Jun 08 '25 02:06 peternschmit

i found multiple issues in libtorrent/src/web_peer_connection.cpp

test torrent files

minimal test torrent file: one content file single_file_request == true

mkdir AldousHuxley-BNW
cd AldousHuxley-BNW
wget http://ia600402.us.archive.org/32/items/AldousHuxley-BNW/Aldous+Huxley_thumb.jpg
cd ..
python -c 'import torf; t = torf.Torrent(path="AldousHuxley-BNW", webseeds=["http://ia600402.us.archive.org/32/items/"]); t.generate(); t.write("AldousHuxley-BNW.1.torrent")'
rm -rf AldousHuxley-BNW

minimal test torrent file: two content files single_file_request == false

mkdir AldousHuxley-BNW
cd AldousHuxley-BNW
wget http://ia600402.us.archive.org/32/items/AldousHuxley-BNW/Aldous+Huxley_thumb.jpg
wget http://ia600402.us.archive.org/32/items/AldousHuxley-BNW/AldousHuxley-BNW_meta.xml
cd ..
python -c 'import torf; t = torf.Torrent(path="AldousHuxley-BNW", webseeds=["http://ia600402.us.archive.org/32/items/"]); t.generate(); t.write("AldousHuxley-BNW.2.torrent")'
rm -rf AldousHuxley-BNW

minimal test torrent file: three content files (version 1 with broken redirect on __ia_thumb.jpg) single_file_request == false

mkdir AldousHuxley-BNW
cd AldousHuxley-BNW
wget http://ia600402.us.archive.org/32/items/AldousHuxley-BNW/Aldous+Huxley_thumb.jpg
wget http://ia600402.us.archive.org/32/items/AldousHuxley-BNW/AldousHuxley-BNW_meta.xml
wget http://ia600402.us.archive.org/32/items/AldousHuxley-BNW/__ia_thumb.jpg
cd ..
python -c 'import torf; t = torf.Torrent(path="AldousHuxley-BNW", webseeds=["http://ia600402.us.archive.org/32/items/"]); t.generate(); t.write("AldousHuxley-BNW.3.torrent")'
rm -rf AldousHuxley-BNW

minimal test torrent file: three content files (version 2) single_file_request == false

mkdir AldousHuxley-BNW
cd AldousHuxley-BNW
wget http://ia600402.us.archive.org/32/items/AldousHuxley-BNW/Aldous+Huxley_thumb.jpg
wget http://ia600402.us.archive.org/32/items/AldousHuxley-BNW/AldousHuxley-BNW_meta.xml
wget http://ia600402.us.archive.org/32/items/AldousHuxley-BNW/AldousHuxley-BNW_files.xml
cd ..
python -c 'import torf; t = torf.Torrent(path="AldousHuxley-BNW", webseeds=["http://ia600402.us.archive.org/32/items/"]); t.generate(); t.write("AldousHuxley-BNW.3b.torrent")'
rm -rf AldousHuxley-BNW

problem: in src/web_peer_connection.cpp m_path was appended with the torrent name, for example AldousHuxley-BNW/ m_url was appended with the file path, for example AldousHuxley-BNW/Aldous+Huxley_thumb.jpg the torrent name was not removed from the redirect location and because it had a trailing slash on the next request, the file path was appended so the torrent name appeared twice like AldousHuxley-BNW/AldousHuxley-BNW/Aldous+Huxley_thumb.jpg which results in an "http 404 not found" response

solution: 42b304f97068772ae8c46cd661d738c5b4dfcaa8 and following commits in my magnet-uri-webseeding branch (see also #7960) both m_path and m_url should be appended with the file path and on redirect, the file path should be removed from the redirect location

trivial case: the redirect location does end with the file path so we can create a new webseed and avoid more redirects

complex case: (not implemented) (see also: different files can have different redirects) the redirect location does NOT end with the file path so we have to keep the old webseed and follow the redirect for each file


problem: webseeds are re-used after redirect

solution: 4051e666577d890527e5dc7c9c1b19123c4bf8a1 use "disable" instead of "disconnect" (TODO re-enable the original webseed when files are missing on the new webseed) (see also: different files can have different redirects)


problem: when files are smaller than pieces then file 0 can mark piece 0 as "have" and file 1 can mark piece 0 as "dont have" so the webseed peer can get banned (TODO verify. what exactly does happen?)

solution: round up if we have part of a piece, mark the entire piece as "have"

"fixed" in 1fe0754a307f8d67737ed13572a14ee42e13447b (assume webseed to have all pieces) i dont understand why the old code was needed optimization? avoid requests? what if a webseed has only the last 1% of files? we still should try to fetch all files (in random order) currently, we penalize a webseed if it does not have the first N files (or we mark the webseed as "not interesting")


problem: different files can have different redirects

example torrent: AldousHuxley-BNW.3.torrent

$ curl -sIL http://ia600402.us.archive.org/32/items/AldousHuxley-BNW/Aldous+Huxley_thumb.jpg | grep -i -e ^location -e ^content-length
Location: http://ia600304.us.archive.org/18/items/AldousHuxley-BNW/Aldous%2BHuxley_thumb.jpg
Content-Length: 6501

$ curl -sIL http://ia600402.us.archive.org/18/items/AldousHuxley-BNW/__ia_thumb.jpg | grep -i -e ^location -e ^content-length
Location: https://archive.org/images/notfound2x.png
content-length: 3777

if we re-use the redirect of another file, we get can a different file

$ curl -sIL http://ia600304.us.archive.org/18/items/AldousHuxley-BNW/__ia_thumb.jpg | grep -i -e ^location -e ^content-length
Content-Length: 14272

solution: also check file sizes (not implemented)

in this case, the expected file size is 3777 bytes

$ python -c 'import torf; print(torf.Torrent.read("AldousHuxley-BNW.3.torrent").files[2].size)'
3777

so when the server offers a file with 14272 bytes we already know it will be a wrong file

this is a synthetic test case, but this can happen in real life so we need a way to go back to the original webseed url to resolve the redirect again for this file

currently such a webseed peer is banned for "too many corrupt pieces"


other problems

problem: non-ascii filenames are broken non-ascii characters are replaced with dots

+ AldousHuxley-BNW/01. Schöne neue Welt, 1. Kapitel (Aldous Huxley).afpk
- AldousHuxley-BNW/01. Sch.ne neue Welt, 1. Kapitel (Aldous Huxley).afpk

todo: is this a problem only with examples/exact_source_client?

https://github.com/qbittorrent/qBittorrent/issues/16127

When libtorrent detects an illegal/invalid unicode point in the string it substitutes it with a dot.

https://blog.libtorrent.org/2014/12/filenames/

unicode

If any illegal sequence is encountered, it’s replaced by a replacement character.


my devloop is

mkdir build
cd build
cmake -Dbuild_examples=on ..
# TODO create AldousHuxley-BNW.3b.torrent - see above
LANG=C make -j$(nproc); ( mkdir -p tmp; cd tmp/; rm -rf AldousHuxley-BNW/; ../examples/exact_source_client ../AldousHuxley-BNW.3b.torrent; stty sane )

the output should look like

...
[2227] AldousHuxley-BNW torrent finished downloading
...
[2501] AldousHuxley-BNW: torrent.cpp 10434: maybe_connect_web_seeds
[2501] AldousHuxley-BNW: torrent.cpp 10440: maybe_connect_web_seeds is_finished()
[2501] AldousHuxley-BNW: torrent.cpp 10446: maybe_connect_web_seeds return
[3502] AldousHuxley-BNW: torrent.cpp 10434: maybe_connect_web_seeds
[3502] AldousHuxley-BNW: torrent.cpp 10440: maybe_connect_web_seeds is_finished()
[3502] AldousHuxley-BNW: torrent.cpp 10446: maybe_connect_web_seeds return
[4502] AldousHuxley-BNW: torrent.cpp 10434: maybe_connect_web_seeds
[4502] AldousHuxley-BNW: torrent.cpp 10440: maybe_connect_web_seeds is_finished()
[4502] AldousHuxley-BNW: torrent.cpp 10446: maybe_connect_web_seeds return

milahu avatar Jun 25 '25 14:06 milahu