yark icon indicating copy to clipboard operation
yark copied to clipboard

Fails on large channels

Open trifle opened this issue 2 years ago • 7 comments

Hi, thanks for this very nice project. It's really polished and takes a lot of complexity out of yt-dlp, which is great.

I tried running yark on a couple of large-ish channels (10.000s of videos), and it seems to have some issues that yt-dlp also exhibits (if I recall correctly): The initial metadata download takes several hours and requires short of 10GB of RAM, then the subsequent downloads fail after only a handfull of videos.

I haven't looked into the details, but this might be due to some download tokens expiring, or perhaps it's just insufficient retries or so. In any case, it would benefit yark enormously to keep some sort of record regarding the videos that were already downloaded and to then continue archival in chunks, instead of trying to do all in one. yt-dlp has some of this functionality with --download-archive, but that doesn't have any "comfort features", i.e. no checking, pruning, displaying, or automatic management of that resume file.

trifle avatar Jan 09 '23 07:01 trifle

I've been looking for a way to make the metadata step smaller because it includes a lot of extra information which Yark's archive format doesn't use; I'll look into download-archive. At least theres --skip-metadata so you can do downloads in chunks if you get the metadata now :)

Can you send the error of the failed download? That might be a seperate bug

Owez avatar Jan 09 '23 09:01 Owez

Thanks, that sounds great! There was a yt-dlp update today that might have helped, as I'm not seeing anything since tonight. The last instance was:

  • Downloading jVhTmcfQgx4, at 0.7%..
  • Unknown error whilst downloading videos, details below:
[download] Got error: Downloaded 70656 bytes, expected 10820198 bytes, retrying in a few seconds..
  • Unknown error whilst downloading videos, details below:
ERROR: Did not get any data blocks, retrying in a few seconds..
  • Unknown error whilst downloading videos, details below:
ERROR: Did not get any data blocks, retrying in a few seconds..
  • Unknown error whilst downloading videos, details below:
ERROR: Did not get any data blocks, retrying in a few seconds..
  • Unknown error whilst downloading videos, details below:
ERROR: Did not get any data blocks
  • Sorry, failed to download {name}
Please file a bug report if you think this is a problem with Yark!

PS: {name} probably needs a f in front of the f-formatted string :)

trifle avatar Jan 09 '23 12:01 trifle

Yup, this is really not worth a PR, so here's the line:

https://github.com/Owez/yark/blob/d616ae994092d01a8588c067efe27ea9f6cc843c/yark/channel.py#L637

Needs a f.

trifle avatar Jan 09 '23 12:01 trifle

Thanks, that sounds great! There was a yt-dlp update today that might have helped, as I'm not seeing anything since tonight. The last instance was:

  • Downloading jVhTmcfQgx4, at 0.7%..
  • Unknown error whilst downloading videos, details below:
[download] Got error: Downloaded 70656 bytes, expected 10820198 bytes, retrying in a few seconds..
  • Unknown error whilst downloading videos, details below:
ERROR: Did not get any data blocks, retrying in a few seconds..
  • Unknown error whilst downloading videos, details below:
ERROR: Did not get any data blocks, retrying in a few seconds..
  • Unknown error whilst downloading videos, details below:
ERROR: Did not get any data blocks, retrying in a few seconds..
  • Unknown error whilst downloading videos, details below:
ERROR: Did not get any data blocks
  • Sorry, failed to download {name}
Please file a bug report if you think this is a problem with Yark!

PS: {name} probably needs a f in front of the f-formatted string :)

Are you on yark v1.2.3? This should be fixed as of last night

Yup, this is really not worth a PR, so here's the line:

https://github.com/Owez/yark/blob/d616ae994092d01a8588c067efe27ea9f6cc843c/yark/channel.py#L637

Needs a f.

Whoops yep, will add

Owez avatar Jan 09 '23 12:01 Owez

Yes I updated yesterday but thought the error persisted - sorry if that was wrong! Regardless, some sort of chunked metadata + download stage would definitely be a nice addition to reduce memory consumption and make everything smoother.

BTW, I guess youtube doesn't like parallel downloads. I don't know your stance towards lots of external dependencies, but my experience with the fasteners library was quite positive. Might be worth adding a file lock like

lock = fasteners.InterProcessLock('yark.lockfile')

before performing yt-dlp options to prevent multiple instances.

trifle avatar Jan 09 '23 12:01 trifle

Yep definately. I've purposefully let yt-dlp download using default values so far to reduce complexity in these early versions, but chunking + parralelism (if youtube can do it) is needed.

I don't mind having extra dependencies as long as they're worth it compared to downloading and the vuln risk. When downloads are being processed Yark generates a full list of the videos to download and pipes it into yt-dlp so hopefully it'll be easy to parrelelise using yt-dlp's options or otherwise.

Downloading videos is safe to stop at any time so I think metadata is the main concern when tackling this issue because its all or nothing and has that issue with RAM.

Owez avatar Jan 09 '23 12:01 Owez

If I have time this should hopefully be in v1.3 in a months time :)

Owez avatar Jan 09 '23 12:01 Owez