python-scraperlib icon indicating copy to clipboard operation
python-scraperlib copied to clipboard

Add Webm medium quality + fix MP4 high quality presets

Open DasDunkel opened this issue 1 month ago • 9 comments

Currently we have access to a low quality option, and the default of high quality, but something in the middle feels like it'd be better for backup uses, like the same options as high quality, but limited to 720/1080p instead of the highest quality available

Even just an option to set the max download resolution could work to resolve this

DasDunkel avatar Nov 07 '25 21:11 DasDunkel

Presets are defined here:

https://github.com/openzim/python-scraperlib/blob/main/src/zimscraperlib/video/presets.py

We are definitely open to have a medium quality video preset, only constraint is that we need to use VP9 for webm and H264 for mp4, and we need someone to significantly test the preset to ensure it works correctly on multiple kind of videos.

Exactly like we did in https://github.com/openzim/python-scraperlib/issues/79

Unfortunately we've lost the 3 test videos we used in this issue, but we can definitely find a new set of videos. Original set was meant to illustrate a variety of content we usually faced. There was a khan academy video where someone was writing with a white chalk on a black board (illustrating issues of tiny letters on a very homogenous background), a small ted video with fast movements and a short news spot. But we are open to other set of test videos if they sufficiently represent the diversity we can face.

benoit74 avatar Nov 08 '25 10:11 benoit74

I'd be more than happy to do extensive testing with different types of content, I'll start gathering a collection of short videos to test with, feel free to link any recommendations

Also I noticed in that linked issue the users intended to have the low quality option be 480p, but with the current settings it's more like 270p with a standard 16:9 video as they set a required width, instead of the height (480p is 480 pixels tall, not wide, 854x480 instead of 480x270), resolving this mistake(?) could have a decent boost to quality of the low-quality option, though it would increase the output file size

DasDunkel avatar Nov 08 '25 14:11 DasDunkel

The 480 width comes from ... very long ago. I'm not sure at all it was intended to be 480p. Someone seems to have been confused in the linked issue and believed it was going to be 480p, but that's all.

benoit74 avatar Nov 09 '25 18:11 benoit74

Ah, fair enough

So, I gathered some youtube videos displaying a range of things, like animations, music, TED talks and educational content including some chalkboard stuff, and played around with some of the encoding options to get what I think would be a good medium setup

To start with, I'd recommend adding -row-mt 1 to all webm encoding options, this improves performance by a decent bit in my testing, some videos going from 0.6x to 1.1x encode speed ffmpeg docs; https://trac.ffmpeg.org/wiki/Encode/VP9#rowmt With this flag the below script took just over 2 hours to encode the webm files, and without the flag just over 3 hours, both producing the same size and quality file

I'd also recommend adding -movflags "+faststart" to all mp4 options, instead of just low

For medium settings I settled on the following for webm, basically right in-between low and high, locked to 720p; I use this scale value in my testing to only scale videos down to 720p, instead of scaling all up or down to 720p, pick whichever method of locking to 720p you prefer

-b:v "240k" -qmin "32" -qmax "58" -g "240" -quality "good" -speed "3" -vf "scale='-2:min(720,ih)':force_original_aspect_ratio=increase" -row-mt 1 -codec:a "libvorbis" -b:a "48k" -ar "44100" -ac "2"

key differences being bitrates, qmin, qmax, scale and speed

And for mp4 match the high settings, but with a crf of 32 and audio bitrate of 64k, locked to 720p again

Webm input; 556MB Outputs; low: 113MB (-79%) med: 144MB (-74%) high: 216MB (-61%)

Mp4 input; 522MB low: 135MB (-74%) med: 187MB (-64%) high: 984MB (+88%) may want to consider changing these settings

Comparison images

Let me know if you'd like more

Image Image Image Image Image Image

And last but not least, the admittedly overcomplicated script I used for testing, this will download all the videos I used, and run an encoding with low, medium and high settings, so you can easily compare all of them yourself, and add some of your own test videos (I'd recommend running in a dedicated folder as it will make its own files and folders to work with) ((You might need to change the concurrency depending on your specs)) (((This can take hours to run, 2+ with a concurrency of 3 for me)))

#!/bin/bash
# ⋉ Unfortunately, Dunk was here 
##############################
# Requires ffmpeg and yt-dlp #
##############################
set -uo pipefail
trap 'echo "Killing any background jobs..."; kill $(jobs -p) 2>/dev/null; wait $(jobs -p)' EXIT

# Set the max number of parallel ffmpeg jobs
# Don't set this the same as you would for the youtube scraper,
# "-row-mt 1" makes the webm processes use way more CPU, so it very quickly becomes diminishing returns
# (95% usage on a 14c/28t machine with a concurrency of 3, they spicy)
CONCURRENCY=3
# Separate value for downloads
DOWN_CONCURRENCY=4

wait_for_slot() {
    while [[ $(jobs -r -p | wc -l) -ge "${1:-${CONCURRENCY:-2}}" ]]; do
        wait -n || true
    done
}

mkdir -p ./input/{webm,mp4};
mkdir -p ./output/{webm,mp4}/{LOW,MED,HIGH};

# All videos are sourced from youtube
# You can also add your own non-youtube videos to input/(mp4|webm)/file.ext
declare -a IDS=("CHTMZiXeB6A" "tUfvOTYBXQQ" "1GrOLainIiA" "YNvIp9xx3P8" "l789l6np-qA" "--lPz7VFnKI" "D-_qS_3KXBA" "eIho2S0ZahI" "H0-WkpmTPrM" "BMrAIzCcNLk" "CySwZBgM_j4" "ybqHUv8cLVk" "uDV7y2yY7pw" "JCDDh7mwx30" "aWhWjd3o78A")
for ID in "${IDS[@]}"; do
    if [ ! -f "./input/webm/${ID}.webm" ] ; then
        wait_for_slot "${DOWN_CONCURRENCY:-2}"
        echo "Downloading ${ID}.webm"
        (yt-dlp -q -f "best[ext=webm]/bestvideo[ext=webm]+bestaudio[ext=webm]/best" --merge-output-format webm --output "./input/webm/${ID}" -- "${ID}") &
    fi
    if [ ! -f "./input/mp4/${ID}.mp4" ] ; then
        # y'know, I have no idea why webm gets an extension but mp4 doesn't, also had to switch the formats around to get audio
        # swap the lines if stuff behaves different for you for some damn reason
        wait_for_slot "${DOWN_CONCURRENCY:-2}"
        echo "Downloading ${ID}.mp4"
        (yt-dlp -q -f "bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]/best" --merge-output-format mp4 --output "./input/mp4/${ID}.%(ext)s" -- "${ID}") &
        # (yt-dlp -q -f "best[ext=mp4]/bestvideo[ext=mp4]+bestaudio[ext=m4a]/best" --merge-output-format mp4 --output "./input/mp4/${video}" -- "${video}") &
    fi
done

wait

# Run through each file applying all encoding combinations for webm and mp4 files
for FILE in ./input/webm/*.webm; do
    BASE=$(basename "${FILE}")
    # My way of skipping files in the input without moving them (comment out the ID in a file named "ignore")
    if grep -q "# ${BASE%.*}" ./ignore; then
        continue
    fi
    # Current low webm options
    wait_for_slot
    echo "Encoding LOW  ${BASE}"
    (ffmpeg -hide_banner -loglevel error -i "${FILE}" -codec:v "libvpx-vp9" -b:v "140k" -qmin "30" -qmax "40" -g "240" -quality "good" -speed "4" -vf "scale='480:trunc(ow/a/2)*2'" -row-mt 1 -codec:a "libvorbis" -b:a "48k" -ar "44100" -ac "2" -y "./output/webm/LOW/${BASE}") &
    # Proposed medium webm options
    wait_for_slot
    echo "Encoding MED  ${BASE}"
    (ffmpeg -hide_banner -loglevel error -i "${FILE}" -codec:v "libvpx-vp9" -b:v "240k" -qmin "32" -qmax "58" -g "240" -quality "good" -speed "3" -vf "scale='-2:min(720,ih)':force_original_aspect_ratio=increase" -row-mt 1 -codec:a "libvorbis" -b:a "48k" -ar "44100" -ac "2" -y "./output/webm/MED/${BASE}") &
    # Current high webm options
    wait_for_slot
    echo "Encoding HIGH ${BASE}"
    (ffmpeg -hide_banner -loglevel error -i "${FILE}" -codec:v "libvpx-vp9" -b:v "340k" -qmin "26" -qmax "54" -g "240" -quality "good" -speed "1" -row-mt 1 -codec:a "libvorbis" -b:a "48k" -ar "44100" -ac "2" -y "./output/webm/HIGH/${BASE}") &
done

wait

for FILE in ./input/mp4/*.mp4; do
    BASE=$(basename "${FILE}")
    if grep -q "# ${BASE%.*}" ./ignore; then
        continue
    fi
    # Current low mp4 options
    wait_for_slot
    echo "Encoding LOW  ${BASE}"
    (ffmpeg -hide_banner -loglevel error -i "${FILE}" -codec:v "libx264" -b:v "300k" -maxrate "300k" -minrate "300k" -qmin "30" -qmax "42" -vf "scale='480:trunc(ow/a/2)*2'" -codec:a "aac" -ar "44100" -b:a "48k" -movflags "+faststart" -ac "2" -y "./output/mp4/LOW/${BASE}") &
    # Proposed medium mp4 options
    wait_for_slot
    echo "Encoding MED  ${BASE}"
    (ffmpeg -hide_banner -loglevel error -i "${FILE}" -codec:v "libx264" -crf 32 -vf "scale='-2:min(720,ih)':force_original_aspect_ratio=increase" -codec:a "aac" -b:a "64k" -ar "44100" -movflags "+faststart" -ac "2" -y "./output/mp4/MED/${BASE}") &
    # Current high mp4 options
    wait_for_slot
    echo "Encoding HIGH ${BASE}"
    (ffmpeg -hide_banner -loglevel error -i "${FILE}" -codec:v "libx264" -crf 20 -codec:a "aac" -y "./output/mp4/HIGH/${BASE}") &
done

wait

DasDunkel avatar Nov 12 '25 20:11 DasDunkel

Thank you very much! I will need some time to analyze all these inputs, but this looks (very) promising.

Could you propose something as well to fix the high MP4? Increase size by 88% does not seems like a good decision at all 🤣

benoit74 avatar Nov 13 '25 09:11 benoit74

Got a bit distracted, for the high MP4 option I'd propose changing the CRF to 28, this preserves a level of quality almost identical to the input with high and low quality files, while making the output a bit smaller than the input (~9% smaller with the above script)

DasDunkel avatar Nov 25 '25 19:11 DasDunkel

Transferring to proper repo.

@kevinmcmurtrie since you've helped fine tune low and high webm settings, I imagine you would be interested by this issue. Any feedback would be greatly appreciated, this is not an area where I excel at all.

benoit74 avatar Nov 27 '25 07:11 benoit74

Transferring to proper repo.

@kevinmcmurtrie since you've helped fine tune low and high webm settings, I imagine you would be interested by this issue. Any feedback would be greatly appreciated, this is not an area where I excel at all.

I'll need some ramp-up time because I've forgotten all the quirks/bugs. The problem is that FFmpeg tries to keep a stable set of common parameters then translate them for each codec. It's buggy, and there's a lot of bad advice online from people who did little testing.

I was testing using a chalkboard lecture for low motion, a TED talk for panning, and a goth rock music video for high random motion. A broken compression subsystem would show up as a bad quality:bitrate ratio on one of those. If somebody has specific videos they want to see working, share them.

kevinmcmurtrie avatar Nov 27 '25 08:11 kevinmcmurtrie

Thank you! No worries about the ramp-up time, mine would be way worse. Much appreciated as always.

benoit74 avatar Nov 27 '25 08:11 benoit74