Add Webm medium quality + fix MP4 high quality presets
Currently we have access to a low quality option, and the default of high quality, but something in the middle feels like it'd be better for backup uses, like the same options as high quality, but limited to 720/1080p instead of the highest quality available
Even just an option to set the max download resolution could work to resolve this
Presets are defined here:
https://github.com/openzim/python-scraperlib/blob/main/src/zimscraperlib/video/presets.py
We are definitely open to have a medium quality video preset, only constraint is that we need to use VP9 for webm and H264 for mp4, and we need someone to significantly test the preset to ensure it works correctly on multiple kind of videos.
Exactly like we did in https://github.com/openzim/python-scraperlib/issues/79
Unfortunately we've lost the 3 test videos we used in this issue, but we can definitely find a new set of videos. Original set was meant to illustrate a variety of content we usually faced. There was a khan academy video where someone was writing with a white chalk on a black board (illustrating issues of tiny letters on a very homogenous background), a small ted video with fast movements and a short news spot. But we are open to other set of test videos if they sufficiently represent the diversity we can face.
I'd be more than happy to do extensive testing with different types of content, I'll start gathering a collection of short videos to test with, feel free to link any recommendations
Also I noticed in that linked issue the users intended to have the low quality option be 480p, but with the current settings it's more like 270p with a standard 16:9 video as they set a required width, instead of the height (480p is 480 pixels tall, not wide, 854x480 instead of 480x270), resolving this mistake(?) could have a decent boost to quality of the low-quality option, though it would increase the output file size
The 480 width comes from ... very long ago. I'm not sure at all it was intended to be 480p. Someone seems to have been confused in the linked issue and believed it was going to be 480p, but that's all.
Ah, fair enough
So, I gathered some youtube videos displaying a range of things, like animations, music, TED talks and educational content including some chalkboard stuff, and played around with some of the encoding options to get what I think would be a good medium setup
To start with, I'd recommend adding -row-mt 1 to all webm encoding options, this improves performance by a decent bit in my testing, some videos going from 0.6x to 1.1x encode speed
ffmpeg docs; https://trac.ffmpeg.org/wiki/Encode/VP9#rowmt
With this flag the below script took just over 2 hours to encode the webm files, and without the flag just over 3 hours, both producing the same size and quality file
I'd also recommend adding -movflags "+faststart" to all mp4 options, instead of just low
For medium settings I settled on the following for webm, basically right in-between low and high, locked to 720p; I use this scale value in my testing to only scale videos down to 720p, instead of scaling all up or down to 720p, pick whichever method of locking to 720p you prefer
-b:v "240k" -qmin "32" -qmax "58" -g "240" -quality "good" -speed "3" -vf "scale='-2:min(720,ih)':force_original_aspect_ratio=increase" -row-mt 1 -codec:a "libvorbis" -b:a "48k" -ar "44100" -ac "2"
key differences being bitrates, qmin, qmax, scale and speed
And for mp4 match the high settings, but with a crf of 32 and audio bitrate of 64k, locked to 720p again
Webm input; 556MB Outputs; low: 113MB (-79%) med: 144MB (-74%) high: 216MB (-61%)
Mp4 input; 522MB low: 135MB (-74%) med: 187MB (-64%) high: 984MB (+88%) may want to consider changing these settings
Comparison images
Let me know if you'd like more
And last but not least, the admittedly overcomplicated script I used for testing, this will download all the videos I used, and run an encoding with low, medium and high settings, so you can easily compare all of them yourself, and add some of your own test videos (I'd recommend running in a dedicated folder as it will make its own files and folders to work with) ((You might need to change the concurrency depending on your specs)) (((This can take hours to run, 2+ with a concurrency of 3 for me)))
#!/bin/bash
# ⋉ Unfortunately, Dunk was here
##############################
# Requires ffmpeg and yt-dlp #
##############################
set -uo pipefail
trap 'echo "Killing any background jobs..."; kill $(jobs -p) 2>/dev/null; wait $(jobs -p)' EXIT
# Set the max number of parallel ffmpeg jobs
# Don't set this the same as you would for the youtube scraper,
# "-row-mt 1" makes the webm processes use way more CPU, so it very quickly becomes diminishing returns
# (95% usage on a 14c/28t machine with a concurrency of 3, they spicy)
CONCURRENCY=3
# Separate value for downloads
DOWN_CONCURRENCY=4
wait_for_slot() {
while [[ $(jobs -r -p | wc -l) -ge "${1:-${CONCURRENCY:-2}}" ]]; do
wait -n || true
done
}
mkdir -p ./input/{webm,mp4};
mkdir -p ./output/{webm,mp4}/{LOW,MED,HIGH};
# All videos are sourced from youtube
# You can also add your own non-youtube videos to input/(mp4|webm)/file.ext
declare -a IDS=("CHTMZiXeB6A" "tUfvOTYBXQQ" "1GrOLainIiA" "YNvIp9xx3P8" "l789l6np-qA" "--lPz7VFnKI" "D-_qS_3KXBA" "eIho2S0ZahI" "H0-WkpmTPrM" "BMrAIzCcNLk" "CySwZBgM_j4" "ybqHUv8cLVk" "uDV7y2yY7pw" "JCDDh7mwx30" "aWhWjd3o78A")
for ID in "${IDS[@]}"; do
if [ ! -f "./input/webm/${ID}.webm" ] ; then
wait_for_slot "${DOWN_CONCURRENCY:-2}"
echo "Downloading ${ID}.webm"
(yt-dlp -q -f "best[ext=webm]/bestvideo[ext=webm]+bestaudio[ext=webm]/best" --merge-output-format webm --output "./input/webm/${ID}" -- "${ID}") &
fi
if [ ! -f "./input/mp4/${ID}.mp4" ] ; then
# y'know, I have no idea why webm gets an extension but mp4 doesn't, also had to switch the formats around to get audio
# swap the lines if stuff behaves different for you for some damn reason
wait_for_slot "${DOWN_CONCURRENCY:-2}"
echo "Downloading ${ID}.mp4"
(yt-dlp -q -f "bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]/best" --merge-output-format mp4 --output "./input/mp4/${ID}.%(ext)s" -- "${ID}") &
# (yt-dlp -q -f "best[ext=mp4]/bestvideo[ext=mp4]+bestaudio[ext=m4a]/best" --merge-output-format mp4 --output "./input/mp4/${video}" -- "${video}") &
fi
done
wait
# Run through each file applying all encoding combinations for webm and mp4 files
for FILE in ./input/webm/*.webm; do
BASE=$(basename "${FILE}")
# My way of skipping files in the input without moving them (comment out the ID in a file named "ignore")
if grep -q "# ${BASE%.*}" ./ignore; then
continue
fi
# Current low webm options
wait_for_slot
echo "Encoding LOW ${BASE}"
(ffmpeg -hide_banner -loglevel error -i "${FILE}" -codec:v "libvpx-vp9" -b:v "140k" -qmin "30" -qmax "40" -g "240" -quality "good" -speed "4" -vf "scale='480:trunc(ow/a/2)*2'" -row-mt 1 -codec:a "libvorbis" -b:a "48k" -ar "44100" -ac "2" -y "./output/webm/LOW/${BASE}") &
# Proposed medium webm options
wait_for_slot
echo "Encoding MED ${BASE}"
(ffmpeg -hide_banner -loglevel error -i "${FILE}" -codec:v "libvpx-vp9" -b:v "240k" -qmin "32" -qmax "58" -g "240" -quality "good" -speed "3" -vf "scale='-2:min(720,ih)':force_original_aspect_ratio=increase" -row-mt 1 -codec:a "libvorbis" -b:a "48k" -ar "44100" -ac "2" -y "./output/webm/MED/${BASE}") &
# Current high webm options
wait_for_slot
echo "Encoding HIGH ${BASE}"
(ffmpeg -hide_banner -loglevel error -i "${FILE}" -codec:v "libvpx-vp9" -b:v "340k" -qmin "26" -qmax "54" -g "240" -quality "good" -speed "1" -row-mt 1 -codec:a "libvorbis" -b:a "48k" -ar "44100" -ac "2" -y "./output/webm/HIGH/${BASE}") &
done
wait
for FILE in ./input/mp4/*.mp4; do
BASE=$(basename "${FILE}")
if grep -q "# ${BASE%.*}" ./ignore; then
continue
fi
# Current low mp4 options
wait_for_slot
echo "Encoding LOW ${BASE}"
(ffmpeg -hide_banner -loglevel error -i "${FILE}" -codec:v "libx264" -b:v "300k" -maxrate "300k" -minrate "300k" -qmin "30" -qmax "42" -vf "scale='480:trunc(ow/a/2)*2'" -codec:a "aac" -ar "44100" -b:a "48k" -movflags "+faststart" -ac "2" -y "./output/mp4/LOW/${BASE}") &
# Proposed medium mp4 options
wait_for_slot
echo "Encoding MED ${BASE}"
(ffmpeg -hide_banner -loglevel error -i "${FILE}" -codec:v "libx264" -crf 32 -vf "scale='-2:min(720,ih)':force_original_aspect_ratio=increase" -codec:a "aac" -b:a "64k" -ar "44100" -movflags "+faststart" -ac "2" -y "./output/mp4/MED/${BASE}") &
# Current high mp4 options
wait_for_slot
echo "Encoding HIGH ${BASE}"
(ffmpeg -hide_banner -loglevel error -i "${FILE}" -codec:v "libx264" -crf 20 -codec:a "aac" -y "./output/mp4/HIGH/${BASE}") &
done
wait
Thank you very much! I will need some time to analyze all these inputs, but this looks (very) promising.
Could you propose something as well to fix the high MP4? Increase size by 88% does not seems like a good decision at all 🤣
Got a bit distracted, for the high MP4 option I'd propose changing the CRF to 28, this preserves a level of quality almost identical to the input with high and low quality files, while making the output a bit smaller than the input (~9% smaller with the above script)
Transferring to proper repo.
@kevinmcmurtrie since you've helped fine tune low and high webm settings, I imagine you would be interested by this issue. Any feedback would be greatly appreciated, this is not an area where I excel at all.
Transferring to proper repo.
@kevinmcmurtrie since you've helped fine tune low and high webm settings, I imagine you would be interested by this issue. Any feedback would be greatly appreciated, this is not an area where I excel at all.
I'll need some ramp-up time because I've forgotten all the quirks/bugs. The problem is that FFmpeg tries to keep a stable set of common parameters then translate them for each codec. It's buggy, and there's a lot of bad advice online from people who did little testing.
I was testing using a chalkboard lecture for low motion, a TED talk for panning, and a goth rock music video for high random motion. A broken compression subsystem would show up as a bad quality:bitrate ratio on one of those. If somebody has specific videos they want to see working, share them.
Thank you! No worries about the ramp-up time, mine would be way worse. Much appreciated as always.