Series Opencast Call can fire too frequently
If [series] in the galicaster configuration is left unconfigured (it is not configured by default) Galicaster will attempt to retrieve all series from the opencast it is pointed at. This happens at galicaster init and it also does this on every long heartbeat (default 60 seconds). Normally this would be OK however if you have 1000+ series in opencast and many Galicaster capture agents it may become a bit of a problem. for two reasons:
- it impacts opencast performance calling the series REST endpoint so frequently
- the call from galicaster may be seen by firewalls as malicious as many many HTTP calls are made in a single second
I would suggest a few changes around the default behaviour maybe? possible ideas could be to increase the results per page hard-coded variable or have this also in the [series] configuration https://github.com/teltek/Galicaster/blob/c066b5abd3b32ed038a633cd2a9069c37bdafb5a/galicaster/opencast/series.py#L23
maybe also have the ability to make series polling less frequent? say once at initialisation then just nightly?
I also noticed this was causing a huge load on our admin node. We do not use this data on our Galicasters at all, so an option to disable the calls completely would fix it fast for us.
Maybe it would be possible to only update very infrequently (nightly?) in case the machine is offline, but do a live query after the first few letters are typed when entering a series if the machine is online? This would dramatically reduce the load and having something to filter on would mean less results when you do call the series endpoint.
Hmmm, I can see how this can be an issue. I guess you don't really need to change the metadata for scheduled recordings? So it may make sense to allow configuring the frequency of these calls (or make them stop altogether).
There is #547 to use the more efficient json endpoint.
The use-case for selecting a series for us is only for ad-hoc recordings, or when you want to ingest a recorded event into a different series than it was scheduled for (not very common, but sometimes helpful).
I think on startup, once an hour or once a day would probably be fine.
@Alfro yes we have no need for the series data at all as everything is scheduled and nobody can even get to the UI to use it.
i just stopped the series stuff completely on our machines. this is the result on our admin node:

this is a 6x2GHz server so a significant reduction in CPU usage! we have ~100 galicasters.
I guess we really need to have a way of running jobs at arbitrary intervals rather than just on the long/short timers in order to be able to do "once an hour" etc. and/or a way of guaranteeing that a job can get run once at startup.
I didn't try the code in #547 so not sure what difference that makes.
Thanks for the info @ppettit! That looks like a serious improvement. I'll look into adding #547 to Galicaster and adding a configuration value to the series endpoint to disable it/change the frequency.