StreaMonitor icon indicating copy to clipboard operation
StreaMonitor copied to clipboard

Script/FFMPEG does not close task

Open ThEnGI opened this issue 3 years ago • 51 comments

I noticed that FFMPEG doesn't close tasks even after many days. Also when I stop the script (ctrl-c or quit ) it remains pending as if waiting for all streams (ffmpeg task) to finish. Forcing me to kill everything. After this some files remain unreadable, for example the 12 day old stream. (Which I assume is finished) Dont close stop Is a FFMPEG problem or a script problem ? Should we add a "-t 01:00:00" and limit to 1 hour file ? or something to force the connection down and wait for the bot to restart the download(i dont mind loosing 1 minute of stream) How to recover "damaged" files ?Do I have to re-encode ?

I add that there have been some internet connection drops, but like 1 every 2/3 days, it's hard to trace whether it's their fault

Thanks in advance ThEnGi

ThEnGI avatar Dec 11 '22 15:12 ThEnGI

That's indirectly a script issue (see #42 ).

ffmpeg can hang when network issues occur or so. Actually there is nothing implemented which communicates to ffmpeg and check if it's crashed or stuck. Only if it's just stuck it could be possible to bring it down cleanly.

You can try to use untrunc. Take a working video of the same model as working video and the corrupted video. Eventually you have luck that it restores the file.

DerBunteBall avatar Dec 11 '22 16:12 DerBunteBall

OK, as always fast and clear. Untrunc works well, not the best solution....but it works If the FFMPEG task is active, the bot doesn't open another one, right? So if there are any streams in between they are lost

As i said before, is there any way to force close FFMPEG after x seconds? (as ffmpeg command) In theory at that point the bot does not see the recording in progress and start a new one.

I would like to avoid a script that restarts the server every time it loses the connection :-)

ThEnGI avatar Dec 11 '22 17:12 ThEnGI

I don't know whats planned here in future.

The bot will not recognize that ffmpeg stucks or hangs.

def execute():
        try:
            process = subprocess.Popen(args=cmd, stdin=subprocess.PIPE, stderr=subprocess.DEVNULL, stdout=subprocess.DEVNULL)
        except OSError as e:
            if e.errno == errno.ENOENT:
                self.logger.error('FFMpeg executable not found!')
                return
            else:
                raise

As you can see the process is spawned with output but it's piped to /dev/null.

The whole thingw ould be need to be refactored in a way that the output of the process is parsed. As far as I know ffmpeg also can open a things like UNIX Sockets for communication.

How every the code need the have mchanism to monitor the subprocess. It would be possible to implement something like a stop after 1 hour. But I think this wouldn't help much. As described in the other issue: It's totally needed to have really stable and clear connectivity. I think it would be needed to define conditions when FFmpeg is considered as stuck or hanging and then tried to be stopped gracefully and eventually killed after that. You will never have a real possibility to prevent the need of untruncing as long as you not download as MPEG-TS. MPEG-TS is playable always unimportant how ffmpeg is killed. An architecture with a Downloader and a Post Processor component would be better I think.

DerBunteBall avatar Dec 11 '22 17:12 DerBunteBall

Therefore, if I understand correctly: at each disconnection I "lose" all the streams, FFMPEG does not close and I have to kill it (getting the truncated videos). At this point If i write a script that if it detects no connection it executes "killall ffmpeg", The streams started before the disconnection are already "lost" by doing so I don't lose those on reconnection,it's right ? or do I have to restart the script as well? It's not the best of solutions, but it's the one that limits the damage (also the only one in my capacity)

ThEnGI avatar Dec 11 '22 18:12 ThEnGI

The Bot wouldn't recognize when the ffmpeg process is killed ans so doens't check the state further. You could test this in a seperate installation and simply kill it away and check whether the status of the job changes (I don't think so).

I think the big issue here is that ffmpeg get's in problematic states when e.g. DSL reconnects or the stream closes uncleanly or something like that. ffmpeg is really sensible when ripping a stream.

Switching to MPEG-TS wouldn't change this. The code would need to be refactored in a way that status checking, downloading and postprocessing are seperate tasks. So that you can do things like: I saw model x is only, starting a download job, when I recognize ffmpeg stucks or hangs bring it down (in parallel further check the model state and check if the dl job still runs) and finally doing post processing so make sure there is a working file for that. So the code needs to be more modular and cooperative.

I think you would need to bring down the whole Downloader, search for all ffmpegs bring the hanging ones down and restart. The other only way would be an external box with good IP connection so in fact a question of money. When the problem here still occures it also would need further investigation.

DerBunteBall avatar Dec 11 '22 18:12 DerBunteBall

I think you would need to bring down the whole Downloader, search for all ffmpegs bring the hanging ones down and restart. The other only way would be an external box with good IP connection so in fact a question of money.

That's what I'll do, Downloader.py stop -> kill all ffmpeg -> Kill downloader -> Restart downloader Since the connection drop "just" 1 time a week, is not a big deal.

If i had to spend the money i would go for a better script since my internet connection is not improvable. Or simply pay to record the cams and make them available for streaming

I'm in a rich country with a third world connection.

Thanks again

ThEnGI avatar Dec 11 '22 19:12 ThEnGI

I would change to MPEG-TS.

Then you can stop the Downloader and kill all ffmpegs. You then can do a post processing by converting the files from MPEG-TS to MP4 and fix corrupted stuff if necessary. Should be possible to do this with a bit Python magic. MPEG-TS doesn't corrupt like MP4 so you can do the stop, move and restart and after that do the conversion stuff.

I know the connection issues in some places of the world. You should be able to find something with a Xeon CPU, 32GB RAM, 8TB disk space on a 100mbit sync line with static IP for around 40 €. It's a question of money and how much you want to pay for recording womans in front of a camera.

DerBunteBall avatar Dec 11 '22 19:12 DerBunteBall

Forgot: Wireless LAN and Powerline are also crap for something like StreaMonitor. Everything needs to be wired well.

DerBunteBall avatar Dec 11 '22 19:12 DerBunteBall

Wifi only for smartphones.Server and PCs via Gigabit LAN. The problem is the isp disconnection. it rarely happens but it does happen. For about 10$ there is a streaming recording service, I don't need a 40$ server, also because then I wouldn't know what to do with it. The goal is not to pay

ThEnGI avatar Dec 11 '22 19:12 ThEnGI

Then you should switch to MPEG-TS, take the smallest possible resolutions and build a good wrapper around everything.

Then this should be good. I would use separate UNIX box. Windows really isn't the platform for StreaMonitor.

Edit: I see it's a Debian you have. That's fine.

DerBunteBall avatar Dec 11 '22 20:12 DerBunteBall

You treat me like a person who knows what you're talking about, i know very little about pyhton (C much better). Now I just have to choose whether to do the procedure by hand or do it via script. (open/close is the best I know how to program XD). Do you think it's better: first "Downloader.py stop" and then Kill FFMPEG. or viceversa ? In the first case it should stop the streams still in progress, right?

ThEnGI avatar Dec 12 '22 16:12 ThEnGI

I think you should do the following:

  1. You should have a setup where it's safe that Controller.py works. Due to the fact that you can't really simple inject something to Downloader.py CLI directly you should use Controller.py for a bunch stop.
  2. Frist you should trigger a bunch stop Controller.py stop * (should be that).
  3. Then you should kill Downloader.py (check in Code whether the Threads are Daemon Threads). When It's not a daemon Thread all childs should go away with the kill of Downloader.py. If it's a daemon the thread runs detached and the child should stay. Keep in mind: The ffmpeg processes are Thread Childs. A Thread is part of the running Downloader.py Process.
  4. If the childs stay kill the ffmpeg's. Simply do something like a search for ffmpeg processes. And kill them. Or just fire a killall. Keep in mind to do eventually two steps: SIGTERM and then SIGKILL.
  5. Then move the files away, after that restart Downloader.py and manage untruncing/conversion. There are nice libs for ffmpeg controlling in Python.

You will not be able to stop the Downloader in an easy way that will safely keep running streams. So you would make sure that killing Downloader.py doesn't kill spawned childs and then need to inspect whether ffmpeg is hanging or crashed and just kill hanging or crashed ones. I think that's a complex task.

MPEG-TS should just need conversion. Also a hard kill should give a convertable working MPEG-TS file. You can verify truncation with mediainfo. There is a library binding for mediainfo which uses the library directly. It should be possible to write a test for a truncated file.

The above also could be done with all other Languages like Bash, Ruby, Go or which ever you like. When someone is able to read or write C Python is a really simple thing. And in general helpful for fast software development and system automation.

DerBunteBall avatar Dec 12 '22 17:12 DerBunteBall

It happened again, SIGTERM doesn't work, I have to use Kill -9 PID. I didn't read your post well, so I've used "quit" in CLI, all but one FFMPEG task stopped. The script waited, i killed the last FFMPEG and then killed the script (was "stuck"). Then i restarted everything I didn't check which stream was blocked, but the others seem to have stopped correctly. Next time I'll try to do as you say !

I use TMUX to run the script,I could send CLI commands through TMUX. I can't see the result, but i can wait x seconds and then proceed with the script. To your knowledge does: "Controller.py Stop * " work or do I have to close one stream at a time?

ThEnGI avatar Dec 15 '22 17:12 ThEnGI

That seems to be plausible.

thread = Thread(target=execute)
thread.start()

The downloader thread (all is threaded in the script) isn't a deamon. So the main process will stuck until the thread is stopped (also called an attached thread). Because ffmpeg isn't reacting to SIGTERM this indicates a crashed ffmpeg process. So just SIGKILL will bring it down. The fact that you could stop the other stuff indicates that the script doesn't have the biggest problem with a dead thread. There is no handling for dead threads in the script. Also it would need investigation to check whether the thread died or just the subprocess. If the thread is still is alive there would be a possiblity to handle this situation in the thread and bring ffmpeg down with SIGKILL. Eventually it's nneded to have monitoring thread.

The problem in general is that threads aren't the way for I/O stuff. For that the asynchronous approach is much better today. Threads were designed for CPU bound stuff. SteaMonitor is hard I/O bound. It always waits for the network or slwo hard disks to do something.

In fact with tmux or screen there should be ways to inject keyboard input to the session/window. You could also use this and check whether you can figure out output. Eventually you are partially blind in your cleanup script when you choose this way.

If you are on the latest code state the stop command supports the asterisk. So in CLI as well as in Controller.py you should be able to do stop * which should stop all threads (bring them to state "Not running"). A died thread wouldn't react to this I think. You could try to build something which does list before and after the stop command and check the differences.

There is no well defined protocol which leads to the fact that ZeroMQ communication needs real parsing work to parse the list command output. Sadly you can't simply write something which checks stuff via sending and getting a bit JSON or so.

DerBunteBall avatar Dec 15 '22 18:12 DerBunteBall

Check this: py2tmux - a wrapper for tmux ommand.

It seems also to be possible to get output directly back by capture and read buffers of panes - look ehre.

DerBunteBall avatar Dec 15 '22 18:12 DerBunteBall

Killing Downloader.py directly leaves ffmpeg subprocesses alive, So I will do as above (Kill FFmpeg and then Downloader.py). Every time the connection is interrupted, all the streams (5/6) crash. Doing the analysis you suggested is too expensive. Since it's all for personal use and I don't have to make money on it, losing a few streams doesn't matter. If the interest is high I can recover the part before the crash with "untrunc"

If you want to give me a perfectly functional script (3rd word internet proof) for Christmas, I gladly accept it XD Before I do anything concrete it will already be next year

ThEnGI avatar Dec 17 '22 16:12 ThEnGI

The reason now is clear.

If the still running ffmpegs stay alive after closing Downloader.py the subprocess spawning seems not to be influenced by the Thread type. So the subprocess seems not to be depending on the thread it spawns.

I think it's better to kill Downlaoder.py, sending SIGTERM to ffmpeg, wait for a moment and then sending SIGKILL. So alive processes have the chance to shutdown cleanly. It would need a verification whether the inverted way could bring Downlaoder.py gets in a state where it's crashing still alive subprocesses. So to stop ffmpeg's first could produce more problems.

You could also try sending SIGTERM and check if the script behaves like a normal stream end. If that's the case you could send SIGTERM, wait, kill Downloader and dead ffmpeg's in one step. Because Downloader.py reacts to SIGTERM you can also SIGTERM Downlaoder.py in a further step.

I think the situation is that special that it's to much work to build a tool to handle this. As told the software needs a really good connected environment. So it's a thing which need to be calculated at home lines.

DerBunteBall avatar Dec 17 '22 17:12 DerBunteBall

small update: Downloader.py quit ->script stopped -> Killall -9 ffmpeg -> script closed correctly (corrupted files, obviously) If it happens again (definitely), I will try sigterm first. The "quit" is not enough to bring down the "half healthy" tasks? do you think it's better to try a sigterm anyway ?

ThEnGI avatar Dec 17 '22 17:12 ThEnGI

If quit command leads to a stop (Downloader.py doesn't leave to prompt) that indicates it hangs.

That's because of the above described fact that Threads have tow modes. We have someting like this:

Downlaoder.py (threaded main process) -> Manager Thread --> Model Thread ---> Downloader Thread ----> ffmpeg Subprocess

Formally the above is wrong because Threds have no childs in that sense so Manager, Model and Downlaoder Threads run within the main process (threads are like "thin" processes within a process). Just the ffmpeg Subprocess is a new process (Threads as well as processes can spawn processes).

As long as Threads are attached the main process just leaves when all Threds are completed. So it depends how the behaviour looks like for the thread which spawned the subprocess. The thread should just stop when the subprocess ends. Because ffmpeg crashes the thread could be a live but stuck or also the thread is crashed (dead). The fact that sending SIGKILL to all ffmpeg's leads to the end of the Downlaoder.py process that for me indicates that the klling of the subprocess gives something to the Downloader Thread that tells him to run further. So I think the Downloader Thread itself is still alive.

-9 for kill/killall is a SIGKILL. A really crashed ffmpeg reacts to nothing so only SIGKILL will end it from the kernel side. A still alive one will react to SIGTERM. Dead ones shouldn't be harmed by SIGTERM. So a SIGTERM shouldn't give you a lose. You could simply count the ffmpeg's and see which ones were alive.

Because the Downloader.py just quits after killall -9 it's clear that it hangs in quit command until the subprocesses are gone.

DerBunteBall avatar Dec 17 '22 17:12 DerBunteBall

As a short idea which eventually help:

The code spawns ffmpeg in a really naiv way. The CLI has something like -retries, -retries-fragment and -reconnect and further -reconnect_* things.

You could try a bit around with that. Eventually this triggers stuff in the ffmpeg code which stops hanging. It's possible that the CLI (ffmpeg command line tool) uses a really silly technique by default. It's just the wrapper tool for the libav* stack so the libs obviously can more. Check the ffmpeg website for that. The man page and some other resources are full of holes. So there are options just documented in the full documentation at website.

DerBunteBall avatar Dec 17 '22 17:12 DerBunteBall

I investigated a bit.

  1. Some sources mention that ffmpeg has something like a stalled state. So I think it's possible that the ffmpeg's are alive but stalled.
  2. The reconnect options really could help you. I thing the problem is the reconnect of your line.

I would sugget something like this:

ffmpeg -user-agent agent -reconnect 1 -reconnect_on_entwork_error 1 -reconnect_streamed 1 -reconnect_delay_max 60 -i url -c:a copy -c:v copy filename

-reconnect meas enable reconnection -reconnect_on_entwork_error meas if network fails reconnect -reconnect_streamed means also reconnect at non seekable (should mean aspecially live streams) -reconnect_delay_max means give up after x seconds.

This should lead to the following: The stream gets written, if connection get's lost for let's say 20 seconds and reconnect is successful it will proceed recording at the point where the stream is after reconnect. The 20 seconds will be missing in the ending file. Reconnection tries get stopped after 1 minute and ffmpeg should quit.

For StreaMonitor this could look like a normal stream end which triggers status checking again.

The cmd list would look like this (in ffmpeg.py in downloaders folder):

cmd = [
    'ffmpeg',
    '-user_agent', self.headers['User-Agent'],
    '-reconnect', '1',
    '-reconnect_on_network_error', '1',
    '-reconnect_streamed', '1',
    '-reconnect_delay_max', '60',
    '-i', url,
    '-c:a', 'copy',
    '-c:v', 'copy',
    filename
]

See this in ffmpeg documentation.

Use at own risk.

DerBunteBall avatar Dec 17 '22 19:12 DerBunteBall

This error returns:

Traceback (most recent call last):
  File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.9/threading.py", line 892, in run
    self._target(*self._args, **self._kwargs)
  File "/home/XXXXX/StreaMonitor/streamonitor/downloaders/ffmpeg.py", line 49, in execute
    raise
RuntimeError: No active exception to reraise

Line 49 (after the addition) if process.returncode and process.returncode != 0 and process.returncode != 255: raise Am I wrong or does it give me an error in the line where it checks that the registration has started? i'm not sure but the script is looking for something to return

Don't worry, before making any changes I make a backup of the script

Best regards

ThEnGI avatar Dec 19 '22 16:12 ThEnGI

This raises an exception when ffmpeg fails to start.

The concret error is related to the fact that the code has no exception types. So it looses the exception chain.

Check your ffmpeg version. If it's an Debian Stable (Bullseye) it could be incompatible with the mentioned options (Bullseye has 4.3.5). You need at least 4.4.1.

So I think this occures because ffmpeg has an option error.

Check here for debian pkgs and here for static builds

DerBunteBall avatar Dec 19 '22 17:12 DerBunteBall

I messed something up with ffmpeg now even the orignal script doesn't work, I'll look into it tomorrow

Good night

ThEnGI avatar Dec 19 '22 19:12 ThEnGI

Installed newest FFMPEG, ffmpeg -version return: ffmpeg version 5.1.2

But now I get this error even without the change:

Traceback (most recent call last):
  File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.9/threading.py", line 892, in run
    self._target(*self._args, **self._kwargs)
  File "/home/xxxx/StreaMonitor/streamonitor/downloaders/ffmpeg.py", line 45, in execute
    raise
RuntimeError: No active exception to reraise

Thinking it was my coding error I also used the backup, but nothing

WTF is hapening ?

ThEnGI avatar Dec 20 '22 07:12 ThEnGI

Use this to try to download a Bongacams stream.

bc.py modelname

This should show what ffmpeg is doing.

DerBunteBall avatar Dec 20 '22 11:12 DerBunteBall

Compilation error: https protocol not found, recompile FFmpeg with openssl, gnutls or securetransport enabled. Recompiled now the script started working again. The "reconnect" function continues to give raise errors.

Sorry but i'm too used to using "apt install"

ThEnGI avatar Dec 20 '22 12:12 ThEnGI

Modify the bc.py to use the command with reconnect options to see what fails then. Eventually the http protocol has a further build option to support this feature.

DerBunteBall avatar Dec 20 '22 12:12 DerBunteBall

Unrecognized option 'reconnect_at_network_error'. Without this option it works (bc.py)

ThEnGI avatar Dec 20 '22 13:12 ThEnGI

-reconnect_on_network_error 1

DerBunteBall avatar Dec 20 '22 13:12 DerBunteBall