tubesync Feature Request: More control through a Higher Order Function.

I see that there are a lot of feature requests that have various needs for filtering out unwanted videos. It would be a difficult and never-ending task to fulfill all such requests. Also, each channel/playlist will have it's own quirks, naming conventions and other patterns that might benefit from a bespoke filtering strategy. I have a proposal that might help alleviate this challenge.

Phase 1

This is (hopefully) a quick and dirty way to get things started. Deep breath... How about eval()? OK, I know, lots of stigma and rightly so, for production code. This solution is for homelabs so it would allow for tinkering and you can put a big disclaimer in the form.

Why?

The user gets more control without needing code changes. Also, this allows the community to help others by hand writing custom scripts for those that don't know python (like myself, but willing to learn if this feature is implemented).

The idea is that you would add a choice between a video title regex search pattern and a higher order function that gets called with instance as a parameter. If there is a way to also serialize and somehow output instance, that would be great (e.g., an ENV debug flag that would print a serialized instance in docker logs).

Something like this Before: https://github.com/meeb/tubesync/blob/e3b5d63501a0a8e2fffac64ee1a48b9a69421d3e/tubesync/sync/signals.py#L129

After:

if instance.source.is_match(instance)

where is_match will be able to ascertain whether to eval and call the user defined function with instance or revert back to calling is_regex_match with the video title.

Phase 2

To transition away from eval, you could ship https://nodered.org/ as a service and allow users a more visual way of providing bespoke filtering logic. Once integrated, this service could be used as an event emitter (e.g., start a workflow once a media file is downloaded, or an error occurs, etc...) and allow for even more control for the average homelabber's automation. For example, I could connect Node-RED a jellyfin HA integration via MQTT and automatically trigger library re-scans.

I think the features outlined above would have a very good value/effort ratio and would have compounding effects within the community, such that, members could more effectively help each other.

Jun 24 '24 13:06 smolattack

I can see the logic behind the suggestion but I'd be very wary about adding a freeform text field that people can put Python code into that is eval()'d in-process. By default the tubesync has no authentication so the interface becomes a one-click RCE vector. If this ever existed it would be behind a ENABLE_EVAL_YES_I_REALLY_KNOW_WHAT_IM_DOING=True environment variable. If you know what you're doing you can always just use Docker volumes to spam in changes to signals anyway.

As for nodered, that might be quite complicated to bundle, plus I assume that would require at minimum an HTTP request to an internal API every time a "should I download this?" flow needs to be checked which could get pretty expensive.

I'll put integration with external nodered (or comparable) testing of services on the wishlist. While I probably wouldn't ever bundle it directly, I'm not opposed to adding support for integrating with other services providing the performance overhead isn't prohibitive.

Jun 24 '24 16:06 meeb

RE environment variable, yes, this would be an advanced feature so putting in extra hoops to prevent the average user from shooting themselves in the foot would be perfectly fine.

As for Node Red, definitely a wishlist feature that will take time to implement. Not sure how much of a con expensive flows and performance overheads really matter in a home environment where servers stay mostly idle. Sure, if you need to pull down a whole channel in the shortest amount of time so you don't miss your flight, maybe faffing around with workflows wouldn't be a good idea. But just like the regex search filter, it could be an opt-in. For everyone else, performance wouldn't be a huge priority as even a slow workflow will be quicker to download a video than you watching said video :).

Jun 24 '24 20:06 smolattack

The performance overheads when using an outside decision tree for logic choices comes when things are updated. For example, you change the source requirement codec for a large channel then on saving the source every item known on the channel would need to be spammed at the external API one at a time to recheck its "downloadable" status. This could generate thousands of requests per click in the control panel.

Also, this wouldn't ever replace the current built in configuration options (download cut-offs, formats etc.) so this could get very confusing as to why something is or is not downloading.

If you updated your external nodered logic as well you'd need some way to force a re-evaluation of every item against the updated nodered flow which would result in at best one API call per media item known to tubesync.

I can see the logic why offloading more advanced logic to already developed decision logic software would be desirable, but integration in a way that wouldn't cause mass confusion would be pretty complicated.

Jun 26 '24 09:06 meeb

If you were to offload decision-making to an external (low code or otherwise) rules engine might necessitate completely different user flow. Maybe a special type of source that can not be edited but only deleted. It is a pie-in-the sky idea for sure.

Another possible way of giving control to the end-user would be to adopt a SyncThing-style .ignore file capability. This is how it could work:

Add option for index-only sources (or pause tick box after index completes).
Add functionality to export index as csv/json
Add a text input for explicitly allowed titles (which would be obtained from the exported report)
Filter/disallowed text box to either block specific files, or exclude all files unless explicitly allowed (in step 3)
Add functionality to resume job
User can check skipped files and see reason (e.g., skipped "due to exclusion list rule")

Ok, that's not exactly how SyncThing does it but that's because their syntax forces you to put a bang in front of allowed file names so you can't just copy-paste a list of titles/names in one step.

I'll log an issue that I'm having later that would demonstrate where some of these ideas are coming from.

Jun 26 '24 10:06 smolattack

@smolattack can you explain some of the ways you want to filter things? I'm working on solving https://github.com/meeb/tubesync/issues/266 currently, and I feel like having regex matching on titles, length limits, date limits, keyword exclusion, you start to cover all bases. Having a list of ways you want to filter that still only require knowledge of the current media item will help guide the work I'm doing in #266.

I think the duplicate in playlists (#510) will always be hard to solve, because it requires knowledge of other videos (and if those videos have downloaded).

I think it would be trivial to also have a "hook" style where you could include a custom file (as @meeb mentioned, via docker you can override files easily, or add extra files), and that a function in that custom file would be called if present. This would allow custom python code, which you can do whatever you want with. But it's only possible to override that via docker, keeping things secure.

Jul 11 '24 08:07 timwhite

I'm not sure how date limits and length limits would work in in regex but those are definitely helpful. Are we also able to filter on description as well? It really depends on a channel by channel basis as to what kind of shenanigans we're trying to filter out.

Regarding the duplicates, yes, we'd need a list of already downloaded files and their metadata. This would work in a higher-order function where you could just pass another parameter but I agree that it wouldn't work with regex. If we could expose an API that would let us query the indexed media and then an external process can send instructions on what should and shouldn't be download.

Regarding the custom function docker environment config, I thought we could just add a flag to allow for injected code support that would just enable a textbox for each channel/playlist. I'm not sure if there's any benefit of editing the docker file every time you want to update a rule

Jul 11 '24 11:07 smolattack

Just to mention again, any evaluation of download-ability based on extended metadata could well be extremely slow to evaluate. It's possible, but adding a filter on description for example would require every single media items metadata to be expanded from JSON and checked. Length is possible, but probably best extracting runtime as seconds and storing it as an indexed integer field by itself so comparisons are fast.

Obviously I'm not going to consider any PRs that tank performance to solve a (likely) relatively seldom used feature. I'm happy to review new features that don't impact any existing installations adversely.

Realistically, evaluating every media item and its metadata on source update against a remote API as an easy to use feature is likely to be a non-starter. Some people have over a hundred thousand media items in their library and already rely on CLI commands to be able to manage their tubesync instances. Issuing 100k+ remote API requests on updating a few source defaults isn't going to be sensible. This feature is possible but it's very much going to be a niche, opt-in toggle if it gets added.

Jul 11 '24 11:07 meeb

Just to mention again, any evaluation of download-ability based on extended metadata could well be extremely slow to evaluate. It's possible, but adding a filter on description for example would require every single media items metadata to be expanded from JSON and checked. Length is possible, but probably best extracting runtime as seconds and storing it as an indexed integer field by itself so comparisons are fast.

Thanks for the heads up. I hadn't actually gotten as far as checking where seconds are stored. I already have a slow enough Tubesync instance, so I'll certainly have a look at the extra time required to fetch this from metadata vs storing it in a new field. If I do store it in a new field, I may have it fallback to metadata if it's never stored the duration, and make sure all new items store the duration at metadata download time.

As for the higher order function. I think the only way it should be implemented is with a "dummy" function in it's own file, that users can then override via docker (or if I work out how, a dummy function that's only called if the file is present). This way the user has to have some understanding of what they are doing, before they slow down their system, and it's not exposed via the web interface for someone to exploit.

Jul 11 '24 11:07 timwhite

@meeb not sure the best place to continue the discussion, so here it is.

My testing shows that fetching extended metadata takes significant time, when no title filtering is active (which is a metadata attribute), fetching the duration metadata can be 70% of the total filtering time. So approximately 0.003-0.005s per metadata field we access. So for large libraries it's certainly worth storing the metadata we filter on in it's own field in the database.

My thoughts are to store the duration in it's own database field, at the point we download metadata for an item. This does mean old items won't have it, and we'll have to fallback to the metadata field. Running a migration to populate all those fields isn't going to be quick, so best to leave it with the fallback method.

This then raises the question of the regex filter on the title. Currently it fetchs that from metadata, do we also want to move Title to it's own field in the database? Or is this duplication of data not worth it given the size (100 chars) of the text field that will be needed? Again, I think this field would need to be populated when we fetch the metadata the first time, and have a fallback for items that predate this branch.

Happy for me to add a title column as well to avoid needing to fetch the metadata? If done correctly, this will improve performance in a few other places that reference the title too.

Jul 12 '24 03:07 timwhite

Architecturally adding a positive int column for duration in seconds and an indexed text field for title would make the most sense. Having them updated on metadata save signal should mean they're always in sync (or 0 / NULL when no metadata has been downloaded). The only argument against this really is duplication, however with storing the metadata as a JSON blob the table gets massive anyway. Also keeping an eye on the model so the number of columns doesn't get get out of hand. However, in this case the benefits likely outweigh the duplication mess.

Jul 12 '24 05:07 meeb