"podcast-transcribe-episode" doesn't manage to transcode files with non-video "video" streams, e.g. mjpeg

Open pypt opened this issue 4 years ago • 0 comments

Podcast transcoding fails for some episodes because:

$ docker service logs $(docker service ls | grep podcast-transcribe-episode-temporal-worker | awk '{ print $1 }')
<...>
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    | INFO podcast_transcribe_episode.workflow: Fetching, transcoding, storing episode for story 2017569382...
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    | INFO podcast_transcribe_episode.transcode: Found a supported audio stream
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    | INFO podcast_transcribe_episode.transcode: Transcoding '/tmp/fetch_transcode_store_episodec6iy_g28/raw_enclosure' to '/tmp/fetch_transcode_store_episodec6iy_g28/transcoded_episode'...
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    | [mp3 @ 0xaaaaf46417d0] Skipping 1 bytes of junk at 62145.
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    | [mp3 @ 0xaaaaf46417d0] Estimating duration from bitrate, this may be inaccurate
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    | Input #0, mp3, from '/tmp/fetch_transcode_store_episodec6iy_g28/raw_enclosure':
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |   Metadata:
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |     title           : EVERYTHING YOU EVER WANTED TO KNOW ABOUT COVID THAT THE GOVERNMENT WON'T TELL YOU
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |     lyrics-ENG      : <p>INTRODUCTION; WHY OBESITY IS A BIG RISK FACTOR; ZINC AND ACTIVATORS; NUTRACEUTICALS AND BOTANICALS; GARLIC, A SUPERFOOD</p>
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |     album           : The Michael Savage Show
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |     genre           : Podcast
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |     date            : 2021
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |   Duration: 00:59:06.64, start: 0.000000, bitrate: 192 kb/s
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |     Stream #0:0: Audio: mp3, 44100 Hz, mono, fltp, 192 kb/s
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |     Stream #0:1: Video: mjpeg (Progressive), yuvj420p(pc, bt470bg/unknown/unknown), 500x500 [SAR 72:72 DAR 1:1], 90k tbr, 90k tbn, 90k tbc (attached pic)
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |     Metadata:
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |       title           : image
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |       comment         : Other
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    | Stream map '0:v' matches no streams.
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    | To ignore this, add a trailing '?' to the map.
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    | Activity PodcastTranscribeActivities::fetch_transcode_store_episode failed: CalledProcessError(Command '['ffmpeg', '-nostdin', '-hide_banner', '-i', '/tmp/fetch_transcode_store_episodec6iy_g28/raw_enclosure', '-map', '-0:v', '/tmp/fetch_transcode_store_episodec6iy_g28/transcoded_episode']' returned non-zero exit status 1.)
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    | Traceback (most recent call last):
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |   File "/usr/local/lib/python3.8/dist-packages/temporal/activity_loop.py", line 69, in activity_task_loop_func
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |     return_value = await fn(*args)
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |   File "/opt/mediacloud/src/podcast-transcribe-episode/python/podcast_transcribe_episode/workflow.py", line 124, in fetch_transcode_store_episode
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |     raw_enclosure_transcoded = transcode_file_if_needed(
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |   File "/opt/mediacloud/src/podcast-transcribe-episode/python/podcast_transcribe_episode/transcode.py", line 88, in transcode_file_if_needed
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |     subprocess.check_call(ffmpeg_command)
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |   File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    |     raise CalledProcessError(retcode, cmd)
mediacloud_podcast-transcribe-episode-temporal-worker.1.bi957ibrx176@bd-misc    | subprocess.CalledProcessError: Command '['ffmpeg', '-nostdin', '-hide_banner', '-i', '/tmp/fetch_transcode_store_episodec6iy_g28/raw_enclosure', '-map', '-0:v', '/tmp/fetch_transcode_store_episodec6iy_g28/transcoded_episode']' returned non-zero exit status 1.

(Sample episode that fails: https://traffic.megaphone.fm/ADV5935473959.mp3?updated=1628579716)

To make transcriptions work, we remove video streams from incoming episodes if we find any:

https://github.com/mediacloud/backend/blob/f32b21bb80778de9a152bf0d1675274a451236b2/apps/podcast-transcribe-episode/src/python/podcast_transcribe_episode/transcode.py#L74-L77

Whether or not the episode has video streams is determined here:

https://github.com/mediacloud/backend/blob/f32b21bb80778de9a152bf0d1675274a451236b2/apps/podcast-transcribe-episode/src/python/podcast_transcribe_episode/media_info.py#L184-L185

But it turns out that quite a few episodes have their episode's static thumbnail attached as a "video" stream, e.g.:

$ ffmpeg -i ADV5935473959.mp3
<...>
[mp3 @ 0x55f3de42a2c0] Skipping 1 bytes of junk at 62145.
[mp3 @ 0x55f3de42a2c0] Estimating duration from bitrate, this may be inaccurate
Input #0, mp3, from 'ADV5935473959.mp3':
  Metadata:
    title           : EVERYTHING YOU EVER WANTED TO KNOW ABOUT COVID THAT THE GOVERNMENT WON'T TELL YOU
    lyrics-ENG      : <p>INTRODUCTION; WHY OBESITY IS A BIG RISK FACTOR; ZINC AND ACTIVATORS; NUTRACEUTICALS AND BOTANICALS; GARLIC, A SUPERFOOD</p>
    album           : The Michael Savage Show
    genre           : Podcast
    date            : 2021
  Duration: 00:59:06.10, start: 0.000000, bitrate: 192 kb/s
    Stream #0:0: Audio: mp3, 44100 Hz, mono, fltp, 192 kb/s
    Stream #0:1: Video: mjpeg (Progressive), yuvj420p(pc, bt470bg/unknown/unknown), 500x500 [SAR 72:72 DAR 1:1], 90k tbr, 90k tbn, 90k tbc (attached pic)
    Metadata:
      title           : image
      comment         : Other
At least one output file must be specified

(That's Stream #0:1 here.)

FFMPEG advises us to "add a trailing '?' to the map" but that probably won't work with the speech to text engine, so let's remake transcode_file_if_needed() to remove all non-audio streams, e.g. video, attached JPEGs, text files, etc. - one can attach quite a few things to media files: https://ffmpeg.org/doxygen/trunk/group__lavu__misc.html#ga9a84bba4713dfced21a1a56163be1f48)

@jtotoole, could you:

Make transcode_file_if_needed() to remove all non-audio streams instead of just video streams; and
Add a test file to media-samples (which we use as a submodule: https://github.com/mediacloud/backend/tree/master/apps/podcast-transcribe-episode/tests/data) which would have similar structure to this sample file that's failing, i.e. a single audio stream and a "video" stream of type mjpeg, in order to confirm that we're in fact able to transcode those?

Aug 17 '21 12:08 pypt