wget2 icon indicating copy to clipboard operation
wget2 copied to clipboard

Unhelpful "Failed to read 102400 bytes" and "Failed to transcode" errors

Open catharsis71 opened this issue 2 years ago • 17 comments

Trying to run a mirror download, getting a lot of vague errors with no indicating of what URLs are triggering the errors so they can be investigated further, and no indicating of what the errors actually mean

The quantity of "Failed to read 102400 bytes" errors varies from run to run, as do the numbers in parenthesis, whatever that means

Running a re-download (files from previous download still on disk), the "Failed to read" messages disappear but the transcode errors remain, in fact, sometimes there are more of them

Despite all the errors printed, it usually says "Errors: 0" at the end; sometimes it says "Errors: 1" but no additional errors are printed in that case

The ordering of the errors also varies from run to run

Sample run using GnuTLS:


XXXXXXXXX:/mnt/s/wget-temp/temp$ rm -rf *
XXXXXXXXX:/mnt/s/wget-temp/temp$ wget2_gnutls --version
GNU Wget2 2.0.1 - multithreaded metalink/file/website downloader

+digest +https +ssl/gnutls +ipv6 +iri +large-file +nls -ntlm -opie +psl -hsts
+iconv +idn2 +zlib +lzma +brotlidec +zstd +bzip2 +lzip +http2 +gpgme

Copyright (C) 2012-2015 Tim Ruehsen
Copyright (C) 2015-2021 Free Software Foundation, Inc.


Please send bug reports and questions to <[email protected]>.
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$ time wget2_gnutls -v -o log.txt -m -np https://skyqueen.cc/archive/71master/cracky/kareha.pl/
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (32)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (32)
Failed to read 102400 bytes (2)
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
4477 files           129% [==========================================================================================================================================================================================================>]   24.98M    --.-KB/s
4424 files           114% [==========================================================================================================================================================================================================>]   21.61M    --.-KB/s
3778 files           108% [==========================================================================================================================================================================================================>]   19.03M    --.-KB/s
4743 files           100% [==========================================================================================================================================================================================================>]   20.13M    --.-KB/s
3996 files           100% [==========================================================================================================================================================================================================>]   18.13M    --.-KB/s
                          [Files: 21418  Bytes: 103.91M [601.42KB/s] Redirects: 41  Todo: 0  Errors: 0                                                                                                                                ]

real    2m56.952s
user    0m3.625s
sys     0m12.922s
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$ find -type f | wc -l
21418
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$ time wget2_gnutls -v -o log.txt -m -np https://skyqueen.cc/archive/71master/cracky/kareha.pl/
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
35 files             106916% [==========================================================================================================================================================================================================>]    8.34M    6.68MB/s files             56830% [==========================================================================================================================================================================================================>]    1.45M    --.-KB/18 files             2133% [==========================================================================================================================================================================================================>]   46.35K    --.-KB/s9 files              1075% [==========================================================================================================================================================================================================>]   23.37K    --.-KB/s18 files             1617% [==========================================================================================================================================================================================================>]   49.19K    --.-KB/s                          [Files: 113  Bytes: 9.91M [133.76KB/s] Redirects: 41  Todo: 0  Errors: 0                                                                                                                                    ]

real    1m15.954s
user    0m1.406s
sys     0m8.047s
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$ find -type f | wc -l
21418
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$ time wget2_gnutls -v -o log.txt -m -np https://skyqueen.cc/archive/71master/cracky/kareha.pl/
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
21 files             63287% [==========================================================================================================================================================================================================>]    2.80M    --.-KB/20 files             55991% [==========================================================================================================================================================================================================>]    1.42M  219.68KB/s8 files             31539% [==========================================================================================================================================================================================================>]    1.47M    --.-KB/20 files             245995% [==========================================================================================================================================================================================================>]    4.17M    4.81MB/s files             949% [==========================================================================================================================================================================================================>]   41.26K    --.-KB/ss                          [Files: 113  Bytes: 9.91M [131.80KB/s] Redirects: 41  Todo: 0  Errors: 0                                                                                                                                    ]

real    1m17.085s
user    0m1.359s
sys     0m8.625s
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$ find -type f | wc -l
21418
XXXXXXXXX:/mnt/s/wget-temp/temp$

Another run using WolfSSL:


XXXXXXXXX:/mnt/s/wget-temp/temp$ rm -rf *
XXXXXXXXX:/mnt/s/wget-temp/temp$ wget2_wolfssl --version
GNU Wget2 2.0.1 - multithreaded metalink/file/website downloader

+digest +https +ssl/wolfssl +ipv6 +iri +large-file +nls -ntlm -opie +psl -hsts
+iconv +idn2 +zlib +lzma +brotlidec +zstd +bzip2 +lzip +http2 +gpgme

Copyright (C) 2012-2015 Tim Ruehsen
Copyright (C) 2015-2021 Free Software Foundation, Inc.

License GPLv3+: GNU GPL version 3 or later
<http://www.gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
XXXXXXXXX:/mnt/s/wget-temp/temp$ time wget2_wolfssl -v -o log.txt -m -np https://skyqueen.cc/archive/71master/cracky/kareha.pl/
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (32)
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
Failed to read 102400 bytes (11)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
4146 files           122% [==========================================================================================================================================================================================================>]   22.68M    --.-KB/s
4199 files           107% [==========================================================================================================================================================================================================>]   20.10M    --.-KB/s
4119 files           107% [==========================================================================================================================================================================================================>]   19.43M    --.-KB/s
4484 files           100% [==========================================================================================================================================================================================================>]   19.60M    --.-KB/s
4456 files           114% [==========================================================================================================================================================================================================>]   22.00M    --.-KB/s
                          [Files: 21403  Bytes: 103.77M [586.54KB/s] Redirects: 41  Todo: 0  Errors: 0                                                                                                                                ]

real    3m1.207s
user    0m4.625s
sys     0m13.547s
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$ find -type f | wc -l
21404
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$ time wget2_wolfssl -v -o log.txt -m -np https://skyqueen.cc/archive/71master/cracky/kareha.pl/
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
21 files             180854% [==========================================================================================================================================================================================================>]    4.17M    --.-KB26 files             47471% [==========================================================================================================================================================================================================>]    2.82M    1.79MB/s5 files             1842% [==========================================================================================================================================================================================================>]   64.04K  796.22KB/s27 files             56809% [==========================================================================================================================================================================================================>]    1.44M    1.66MB/s4 files             41622% [==========================================================================================================================================================================================================>]    1.41M    1.63MB/s                         [Files: 113  Bytes: 9.91M [130.32KB/s] Redirects: 41  Todo: 0  Errors: 0                                                                                                                                    ]

real    1m17.960s
user    0m1.859s
sys     0m8.828s
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$ find -type f | wc -l
21404
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$ time wget2_wolfssl -v -o log.txt -m -np https://skyqueen.cc/archive/71master/cracky/kareha.pl/
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
34 files             97901% [==========================================================================================================================================================================================================>]    5.58M    --.-KB/21 files             110073% [==========================================================================================================================================================================================================>]    2.80M    1.74MB/s files             28198% [==========================================================================================================================================================================================================>]    1.43M    1.16MB/19 files             2135% [==========================================================================================================================================================================================================>]   46.40K  955.46KB/s17 files             2353% [==========================================================================================================================================================================================================>]   51.14K    1.72MB/s                          [Files: 113  Bytes: 9.91M [133.64KB/s] Redirects: 41  Todo: 0  Errors: 0                                                                                                                                    ]

real    1m16.025s
user    0m1.484s
sys     0m9.031s
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$ find -type f | wc -l
21404

and finally OpenSSL:


XXXXXXXXX:/mnt/s/wget-temp/temp$ rm -rf *
XXXXXXXXX:/mnt/s/wget-temp/temp$ wget2_openssl --version
GNU Wget2 2.0.1 - multithreaded metalink/file/website downloader

+digest +https +ssl/openssl +ipv6 +iri +large-file +nls -ntlm -opie +psl -hsts
+iconv +idn2 +zlib +lzma +brotlidec +zstd +bzip2 +lzip +http2 +gpgme

Copyright (C) 2012-2015 Tim Ruehsen
Copyright (C) 2015-2021 Free Software Foundation, Inc.

License GPLv3+: GNU GPL version 3 or later
<http://www.gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Please send bug reports and questions to <[email protected]>.
XXXXXXXXX:/mnt/s/wget-temp/temp$
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
Failed to read 102400 bytes (2)
4431 files           107% [==========================================================================================================================================================================================================>]   20.83M    --.-KB/s
3944 files           116% [==========================================================================================================================================================================================================>]   20.86M    --.-KB/s
4191 files           115% [==========================================================================================================================================================================================================>]   21.02M    --.-KB/s
4550 files           107% [==========================================================================================================================================================================================================>]   20.98M    --.-KB/s
4302 files           107% [==========================================================================================================================================================================================================>]   20.21M    --.-KB/s
                          [Files: 21418  Bytes: 103.91M [564.22KB/s] Redirects: 41  Todo: 0  Errors: 0                                                                                                                                ]

real    3m8.619s
user    0m4.453s
sys     0m14.938s
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$ find -type f | wc -l
21418
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$ time wget2_openssl -v -o log.txt -m -np https://skyqueen.cc/archive/71master/cracky/kareha.pl/
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
21 files             58071% [==========================================================================================================================================================================================================>]    2.80M    1.16MB/17 files             41770% [==========================================================================================================================================================================================================>]    1.41M    --.-KB/51 files             94444% [==========================================================================================================================================================================================================>]    5.63M    4.79MB/s1 files               0% [ <=>                                                                                                                                                                                                       ]   26.55K    --.-KB/s
13 files             1026% [==========================================================================================================================================================================================================>]   35.67K    --.-KB/s                          [Files: 113  Bytes: 9.91M [128.19KB/s] Redirects: 41  Todo: 0  Errors: 0                                                                                                                                    ]

real    1m19.251s
user    0m1.422s
sys     0m7.688s
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$ find -type f | wc -l
21418
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$ time wget2_openssl -v -o log.txt -m -np https://skyqueen.cc/archive/71master/cracky/kareha.pl/
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
30 files             139911% [==========================================================================================================================================================================================================>]    5.57M    1.68MB/s files             54858% [==========================================================================================================================================================================================================>]    1.40M  119.43KB/21 files             56225% [==========================================================================================================================================================================================================>]    1.43M    1.47MB/s8 files             1420% [==========================================================================================================================================================================================================>]   49.39K    2.33MB/s31 files             28599% [==========================================================================================================================================================================================================>]    1.45M    1.53MB/s                         [Files: 113  Bytes: 9.91M [131.03KB/s] Redirects: 41  Todo: 0  Errors: 0                                                                                                                                    ]

real    1m17.538s
user    0m1.453s
sys     0m8.203s
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$
XXXXXXXXX:/mnt/s/wget-temp/temp$ find -type f | wc -l
21418
XXXXXXXXX:/mnt/s/wget-temp/temp$

Another thing of note is that, on these particular runs, GnuTLS and and OpenSSL reported 21418 files downloaded & actually downloaded that same number of files. WolfSSL reported 21403 files downloaded but actually downloaded 21404 files (14 files missing). However, this actually seems to vary randomly from run to run and is not actually based on the TLS used. I've not yet deep-dived the random missing file issue to see what files are getting skipped, might open another bug after I do.

catharsis71 avatar Oct 09 '22 03:10 catharsis71

Correction, all of the above tests were actually using GnuTLS. Using OpenSSL, I get some of the same errors but there are also differences & new errors:

$ time wget2-openssl --max-threads=1 -o log.txt -m -np https://skyqueen.cc/archive/71master/cracky/kareha.pl/
TLS write error: (null)
TLS write error: (null)
TLS write error: (null)
TLS read error: unexpected eof while reading
Failed to read 102400 bytes (0)
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
TLS write error: (null)
TLS read error: unexpected eof while reading
Failed to read 102400 bytes (0)
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
TLS read error: unexpected eof while reading
Failed to read 102400 bytes (0)
TLS read error: unexpected eof while reading
Failed to read 102400 bytes (0)
TLS write error: (null)
TLS write error: (null)
TLS write error: (null)
TLS write error: (null)
TLS write error: (null)
TLS write error: (null)
TLS read error: unexpected eof while reading
Failed to read 102400 bytes (0)
TLS read error: unexpected eof while reading
Failed to read 102400 bytes (0)
TLS write error: (null)
TLS write error: (null)
TLS write error: (null)
TLS write error: (null)
TLS read error: unexpected eof while reading
Failed to read 102400 bytes (0)
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
TLS write error: (null)
TLS write error: (null)
TLS write error: (null)
TLS write error: (null)
TLS write error: (null)
TLS write error: (null)
TLS write error: (null)
TLS write error: (null)
TLS write error: (null)
TLS write error: (null)
TLS write error: (null)
TLS read error: unexpected eof while reading
Failed to read 102400 bytes (0)
21402 files          110% [==========================================================================================================================================================================================================>]  103.80M    --.-KB/s
                          [Files: 21401  Bytes: 103.79M [563.36KB/s] Redirects: 41  Todo: 0  Errors: 0                                                                                                                                ]

real    3m8.684s
user    0m4.438s
sys     0m16.109s

catharsis71 avatar Oct 10 '22 22:10 catharsis71

From a first glimpse, the write/read errors seem to indicate the the server randomly closes the connection. So the "Failed to read 102400 bytes" is a very low and generic message. The errno value in parenthesis is a bit random and misleadign here, likely because it sometimes got changed when travelling through several layers of the TLS library.

The current master of wget2 changes these messages to

Failed to read 102400 bytes (hostname='skyqueen.cc', ip=185.141.27.108, errno=104)
Failed to read 102400 bytes (hostname='skyqueen.cc', ip=185.141.27.108, errno=17)
Failed to read 102400 bytes (hostname='skyqueen.cc', ip=185.141.27.108, errno=104)
Failed to read 102400 bytes (hostname='skyqueen.cc', ip=185.141.27.108, errno=2)
...

The is a CLI command errno that translates the numbers into strings, e.g.

$ LC_ALL=C errno 104
ECONNRESET 104 Connection reset by peer

These errors are temporary. Wget2 tries to download the affected files again. The number of errors at the end is the number of finally failed downloads, so the above errors do not count in.

We could improve the error messages further if we could rely on the errno from the underlying layers (mostly TLS API).

I still have to test with OpenSSL / WolfSSL.

rockdaboot avatar Oct 31 '22 09:10 rockdaboot

The multibyte translation errors, e.g. errno 84 (Invalid or incomplete multibyte or wide character), often happen when HTML pages have a different encoding then what the server or document say. See this a spurious error that you can't do much about (except you have access to the web pages / web server yourself and want to fix issues).

rockdaboot avatar Oct 31 '22 09:10 rockdaboot

I do have access to the web server & I've verified in multiple ways that everything is UTF8. I routinely check via iconv -f utf-8 -t utf-8, LC_ALL=C.UTF-8 egrep -laxv '.*' and other methods. But if I knew exactly what page was triggering the error I could take a closer look at it specifically.

catharsis71 avatar Oct 31 '22 15:10 catharsis71

Interestingly, I didn't see this error when testing with your site. So I can only give you some hints from remote.

Basically, use wget2 -d -o log.txt --max-threads=1 --no-http2 ... until you think the error occurred (e.g. tail -f log.txt | grep 'Failed to transcode from a second console. No http2 and only 1 thread because of better readability of log.txt.

You then should see in log.txt which file caused this issue. Come back with that file and the relevant part of log.txt if something is unclear, and I'll try to help.

rockdaboot avatar Oct 31 '22 17:10 rockdaboot

Tried doing this:

wget2-master -d -o log.txt --max-threads=1 --no-http2 -m -np https://skyqueen.cc/archive/71master/cracky/kareha.pl/

However, when debug is turned on, it downloads the first page and then takes forever to write the analysis of of that page into the log... in the past I let it go for hours and it was still working on logging the first file. However, with the new master build I think it's going faster than I remember it doing before (maybe not) so I'll leave it running and see if it ever actually moves past the first file.

catharsis71 avatar Oct 31 '22 17:10 catharsis71

Hm, the first file takes 800ms to analyse/parse here, then you'll see the GET for the second file ... etc.

My machine is at least 3 years old (AMD Ryzen).

rockdaboot avatar Oct 31 '22 17:10 rockdaboot

I'm downloading on Ubuntu WSL so that could be complicating things; I'll try it from pure Ubuntu later because WSL sometimes makes weird stuff happen

is it even supposed to be logging stuff like this?

31.125119.729 tr/@class=odd
31.125119.798 td/@class=indexcolicon
31.125119.870 a/@href=1301806368/
31.125119.938 img/@src=/icons/folder.gif
31.125120.010 img/@alt=[DIR]
31.125120.082 td/@class=indexcolname
31.125120.153 a/@href=1301806368/
31.125120.225 ='1301806368/'
31.125120.297 td/@class=indexcollastmod
31.125120.365 ='2022-08-24 17:16  '
31.125120.437 td/@class=indexcolsize
31.125120.507 ='  - '
31.125120.578 ='

it's been about half an hour and the log is only up to 1MB. I don't know why it's going so slow... it's SSD and disk utilization is extremely low so I don't think that's the bottleneck, also it's only using about 0.4% CPU

catharsis71 avatar Oct 31 '22 17:10 catharsis71

Uh weird. I am pretty sure this is a WSL I/O issue then. And yes, the output looks a bit weird, it's basically the tokens from HTML parsing.

rockdaboot avatar Oct 31 '22 18:10 rockdaboot

You could try to output to the console and redirect into a file. Maybe it is faster.

rockdaboot avatar Oct 31 '22 18:10 rockdaboot

if I run with -v instead of -d it runs & logs at normal speed... if it's a file I/O issue I'm not sure why it would only happen with -d and not -v

will try the console thing

catharsis71 avatar Oct 31 '22 18:10 catharsis71

Just from the verbose output, I think I might see what's going on with the "Failed to transcode" errors

I think it's seeing external links containing non-ASCII characters and then outputting errors for those links

Adding URL: http://karlomongaya.wordpress.com/2009/10/22/for-the-love-of-zizek-a-fan’s-confession/
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
URL 'http://karlomongaya.wordpress.com/2009/10/22/for-the-love-of-zizek-a-fan’s-confession/' not followed (no host-spanning requested)

Adding URL: http://tanasinn.info/wiki/⊂二二二( ^ω^)二二二⊃
Failed to transcode 'utf-8' string into 'ANSI_X3.4-1968' (84)
URL 'http://tanasinn.info/wiki/⊂二二二( ^ω^)二二二⊃' not followed (no host-spanning requested)

as I understand it ANSI_X3.4-1968 is basically ASCII... so it's trying to convert from UTF8 to ASCII for some reason, and generating an error because the URLs contain characters that can't be represented in ASCII?

but these are external links and as you can see, host-spanning is turned off.... so why is it even outputting errors about external links when host-spanning is turned off?

catharsis71 avatar Oct 31 '22 18:10 catharsis71

Good finding and good question :-)

We should likely optimize the arrangement of checks (e.g. the host-spanning check). You are right when saying, there is no need to convert it here. Btw, the parsed URL is converted into a local filename and wget2 detected 'ANSI_X3.4-1968' (official name of ASCII) as your local encoding.

rockdaboot avatar Oct 31 '22 18:10 rockdaboot

Pushed a commit where we do some checks earlier to avoid the above situation (and error). This should also reduce memory consumption for recursive downloads (depends on the number of external links, though).

rockdaboot avatar Oct 31 '22 19:10 rockdaboot

if it's a file I/O issue I'm not sure why it would only happen with -d and not -v

-d outputs 1000x more lines (take that number with a grain of salt ;-)). Maybe there is an fsync() with every line written - I currently have no other idea.

rockdaboot avatar Oct 31 '22 19:10 rockdaboot

I started a -d 98 minutes ago and it's only logged 70676 lines (721 lines per minute, still processing the first file) and started an otherwise identical -v 73 minutes ago which has logged 304395 lines (4170 lines per minute), and would have been much higher if I were using HTTP2, probably would have finished in about 10 minutes. So something strange is happening specifically with -d

I'll try it from some other systems later to try to narrow down if it's a WSL quirk or not

catharsis71 avatar Oct 31 '22 19:10 catharsis71

Setting aside the debug thing for now and going back to the error messages specifically... I now have a much better understanding of why the error messages are happening and what they mean although I still think they could be a lot friendlier

for the "failed to transcode" errors, is there any possibility that the URL could be displayed as part of the error? otherwise it's very difficult to know what's going on without employing verbose or debug

for the "failed to read" messages, again, the URL would be helpful, but also, the error numbers don't mean much on their own -- I had no idea about using the "errno" command to lookup the meaning of the error and I doubt most people do either. Is there a reason that the expanded text like "EPIPE 32 Broken pipe" couldn't be included as part of the error rather than just "errno=32"?

catharsis71 avatar Nov 01 '22 03:11 catharsis71