tubesync icon indicating copy to clipboard operation
tubesync copied to clipboard

[Feedback/Request] Scraping is way too slow. 25K Elements needs more than 3 days just to scrap.

Open 0n1cOn3 opened this issue 1 year ago • 15 comments

Hey :)

Scraping takes an extremely long time. I have a session running here with over 25k elements to trade, but this took 3 days and only metadata was loaded into the DB. This is simply too slow and not usable. I have set this up on a HPE ProLiant server as I initially thought this was a problem with my NAS where I first used this project.

But apparently this is a general problem that needs to be addressed. The number of workers has been increased to 4. The speed still remains the same. I understand the limitation for downloading, but not for scraping the information and cover images.

A clarification or idea for improvement would be cool :)

0n1cOn3 avatar Nov 14 '24 16:11 0n1cOn3

  • Would it be possible to display an estimate of how long the indexing would take?
  • The same with the download?

Of course, this should not necessarily be an estimate, but legendary (surveys, personal experience as a reference? etc).

0n1cOn3 avatar Nov 14 '24 19:11 0n1cOn3

This is actually entirely expected and nothing really to do with tubesync. I would strongly suggest you drop the worker count back to 1. YouTube will aggressively rate limit and throttle your IP if you attempt to run it any faster. Yes, if you add a channel or channels with 25k items it'll take some time, potentially a couple of days, to get up to date and index the metadata. Once the initial sync is done though it will only sync new content only, which is fast, so it's a one-off issue. You should have no issues with large channels once the initial sync and index is done.

There is no way to improve this performance, the throttling is at YouTube. The throttling does, from anecdotal reports anyway, seem to be also triggered by scraping metadata too quickly as well. If you have any magical solutions feel free to suggest them.

Potentially there could be an estimate for task completion, but it would be quite complicated to implement and so rough to be likely not that helpful.

meeb avatar Nov 15 '24 04:11 meeb

I would be more willing to wait for complete metadata as long as downloads take priority. I'd rather see the most recent videos that can be downloaded start than all of the downloads waiting for the complete metadata to be stored in the database.

Perhaps an easy fix is to index until it finds a video that it can download. Then reschedule the index scan after that download finished. The next round would stop after two downloads are found, then four, etc.

Is there a compelling reason the metadata has to take priority over downloads that are ready?

tcely avatar Nov 22 '24 16:11 tcely

The metadata is required to determine if a media item is to be and can be downloaded so I just scheduled it first, that's about it. Significantly fancier methods of scheduling are of course possible. There's no guarantee that the metadata is going to be indexed in a logical time order so this may not function as you expect even if implemented. Also the current tasks system this would probably be a bit rough with some arbitrary escalating priority number to schedule tasks in a specific order rather than group priorities.

Main reason though is no-one has previously asked for it or looked into implementing it I would suspect.

meeb avatar Nov 22 '24 16:11 meeb

Also the current tasks system this would probably be a bit rough with some arbitrary escalating priority number to schedule tasks in a specific order rather than group priorities.

I was thinking of something a lot simpler. How about changing the indexing task to stop itself after a certain number of successful additions?

tcely avatar Nov 22 '24 16:11 tcely

The indexing task is just asking yt-dlp to return a massive dict of all media on a channel or playlist, the control isn't that fine grained. If you have a playlist as a source and not a channel just stopping indexing some of the playlist would be very confusing. Channels are generally index-able by time with newest first, but it's not guaranteed.

meeb avatar Nov 22 '24 16:11 meeb

I just wanna provide some feedback after some changes I did on the source code.

  • Workers are set to 8! Not 1
  • Parallel Download increased up to 3 downloads at once. And another few things I dont know straight out of my head.

Since almost two weeks, no throttling, nothing! I do have an YT-Premium account but no cookies are delivered! Yes, it may throttle me at any time, but it got a lot faster than with only one worker. It made within 4 days over 15k on Tasks. And started in day 5 with the content download.

Edit: I also switched from sqlite to postqresql. Seems faster too with that much of "content".

image

0n1cOn3 avatar Dec 07 '24 16:12 0n1cOn3

Thanks for the update. Workers were originally set to 4 or 8 or somewhere around there, however many people (including me) experienced issues that were difficult to resolve (like having to change public IPs). 8 may work for you, just make sure you're using a VPN or some other public endpoint you can trivially cycle because you'll probably get throttled at some point. You'll likely just notice downloads work, they're just extremely slow at some point.

And yes with a large media library using Posgres will provide a significant boost.

meeb avatar Dec 09 '24 05:12 meeb

A few minor cleanups to the metadata JSON will allow savings of about 90% and make db.sqlite3 much nicer to work with. I will probably create a trigger to use json_remove during INSERT at some point rather than the cleanup.sql file I'm using, periodically, now.

I started by using this one UPDATE; for anyone who wants to try it:

UPDATE OR ROLLBACK "sync_media" SET metadata = json_remove(metadata, '$."automatic_captions"');

Ideally, the application code would parse the JSON and create a new structure that keeps only the information that it will use later.

Switching to PostgreSQL will help with concurrency issues you may have encountered more than even the write-ahead-log mode for SQLite would.

tcely avatar Dec 09 '24 16:12 tcely

Thanks for the update. Workers were originally set to 4 or 8 or somewhere around there, however many people (including me) experienced issues that were difficult to resolve (like having to change public IPs). 8 may work for you, just make sure you're using a VPN or some other public endpoint you can trivially cycle because you'll probably get throttled at some point. You'll likely just notice downloads work, they're just extremely slow at some point.

And yes with a large media library using Posgres will provide a significant boost.

Interesting!

I can't change my IP since i am pay for a static IP address.

And I don't see any throttling yet in the logs. But I will keep updating you guys if the case will change anytime.

The workers are set to 4 in the source code :-)

Edit:

Theres only one issue keept:

A channel with 6k of Videos, 7 days for rescan the channel is not enough. It stopped downloading by approximatly 25% being downloading and has restarted indexing the channel. Thats a bummer. Could it arranged that it only restart indexing when all tasks from the particular channel has been finished?

0n1cOn3 avatar Dec 09 '24 18:12 0n1cOn3

Indexing the channel isn't that time consuming, or shouldn't be, even with 6k videos it shouldn't take that long. All the "media indexing" does really is just list all the media item IDs (YouTube video IDs in this case) and checks if there are any new, previously unseen, IDs. This won't take more than a few minutes, maybe up to an hour for the largest of channels. Unless you're being throttled that is, one of the symptoms of being throttled is indexing and metadata collection is just extremely slow.

meeb avatar Dec 10 '24 05:12 meeb

Indexing speed was increased in #824 recently.

tcely avatar Mar 15 '25 00:03 tcely

Indexing speed was increased in #824 recently.

Awesome! Is it already included in the latest Docker Compose container or do I have

  • To wait till @meeb has updated the container
  • To compile it from the source code

? :D

0n1cOn3 avatar Mar 18 '25 03:03 0n1cOn3

This is already in :latest, just pulling an updated container should be sufficient.

meeb avatar Mar 18 '25 03:03 meeb

This is already in :latest, just pulling an updated container should be sufficient.

Aight! I'v pulled it 20mins ago 🫶🏻

0n1cOn3 avatar Mar 18 '25 03:03 0n1cOn3