mpv Scaletempo2, the new default for adjusting audio playback speed, sounds noticeably worse in some situations

Important Information

Provide following Information:

mpv version: freshly compiled latest from master (mpv 0.33.0-109-gd0c530919d)
macOS

This is relevant because scaletempo2 was changed to the default from scaletempo in #8376

Reproduction steps

Try listening to some 5.1 audio at 0.95x speed using the now default scaletempo2 filter.
$ mpv --speed=0.95 --af=scaletempo2 some_audio Listen for the poor quality in some situations.

Now add the old default, scaletempo, to the af filter chain and listen for the better quality. $ mpv --speed=0.95 --af=scaletempo some_audio

Also listen to the recorded sample files below. You can use the original_source.mkv file included to reproduce the samples I recorded.

Expected behavior

The default should not make things worse.

Actual behavior

scaletempo2 is worse for minor speed changes in the 0.85x - 1.2x range. It's MUCH better for the big speed changes though, and I really appreciate it for that.

While I do appreciate scaletempo2 for big adjustments, I usually only make minor speed changes so for me so it's not a good default. I suspect that playback speed adjustments in the 0.9-1.2x range are much more common amongst users. This comment is where another users seems to have been caught up in this default change.

Sample files

scaletempo2_0.95x_speed.wav (Bad)
scaletempo_0.95x_speed.wav (Better)
1x_speed.wav (1x speed, recorded in the same way)
original_source_audio.mkv

Apr 07 '21 22:04 varenc

I suggest voting with +1 and -1 on the original post to vote on changing the default. +1 to vote for changing the default back to scaletempo (unless there is some fix) -1 to vote for keeping the current default.

Edit: yes, can reproduce

Apr 07 '21 22:04 Hrxn

Suggestion: scaletempo3, which uses scaletempo for speeds between 0.80 to 1.3 and scaletempo2 for speeds outside of that range.

Apr 07 '21 23:04 CounterPillow

Or maybe have it configurable separately as we have it with scale algorithms.

Apr 08 '21 07:04 TiGR

The best solution would just be to make scaletempo2 work better even at minor speed changes! Chrome's own audio scaling, which scaletempo2 is a port of, seems to work fine with minor adjustments.

@DorianRudolph, perhaps you might have an idea of why scaletempo2 performs worse than Chrome does at 0.95x speed? Is there any hope for just tweaking it to handle this use case? That would of course be the ideal solution!

My thinking is that if scaletempo is going to be restored as the default, that should happen soon to avoid further confusion for people. Also I'm basing this on the assuming that minor speed changes in the 0.85x - 1.2x range are far more common amongst MPV users, like they are for me, though I'm not sure if that's true. No matter the outcome, I'll also submit a PR for a docs update which adds a section explaining to users how to easily change the default to another audio scaler.

(@TiGR I do think the how mpv lets you choose your "audio scaling" filter is a bit idiosyncratic and hard to discover, but I think that's for a different discussion!)

Apr 10 '21 23:04 varenc

If it works fine in Chrome it may be because some versions of their code switched to resampling between speed 0.95 - 1.06. Personally I prefer to use resampling when close to normal speed, like when playing a 25 Hz movie on a 24 Hz display. And mpv can do sync to vsync with resampling and be configured to do that so 25 Hz movies are automatically resampled to 24 Hz. My only need for preserving pitch is when playing at a fast speed like > 1.5. And that is what I would have expected most users need scaletempo2/scaletempo for. But apparently that may not be true but cannot be determined without asking a lot of users.

As I have started working on some fixes to scaletempo2 (not related to speed near 1) it would be good to quickly decide which scaletempo version to use (there is one more atempo in ffmpeg) as maintaining several WSOLA implementations will just be confusing for users and additional work for maintainers. But may be needed if one cannot solve all users needs.

Apr 11 '21 09:04 DanOscarsson

When I built mpv from git, the first thing I noticed were rather severe audio glitches when listening to audiobooks at speeds 0.9 and 1.1 (depending on whether the narrator is too fast or too slow.) It sounds like a scratched CD where the CD player is skipping.

scaletempo produces perfect results at these playback speeds. You can't even tell the sound is slowed down or sped up. It really sounds like the narrator is just reading slower or faster.

There doesn't seem to be an option to tell mpv which filter to use, so I had to put af-add=scaletempo in my config. Unfortunately, this disabled mpv's automatic filter removal when the filter is not needed. The filter is always active and shows up in the OSD all the time.

Something like an --audio-speed-filter option would be very nice to have instead of hardcoding scaletempo2 in the mpv source code.

May 29 '21 12:05 realnc

[	 no-osd af add "@tempo:scaletempo" ; no-osd add speed "-0.1"
]	 no-osd af add "@tempo:scaletempo" ; no-osd add speed "+0.1"
BS	 no-osd af remove @tempo ; no-osd set speed 1.0

May 29 '21 13:05 garoto

[	 no-osd af add "@tempo:scaletempo" ; no-osd add speed "-0.1"
]	 no-osd af add "@tempo:scaletempo" ; no-osd add speed "+0.1"
BS	 no-osd af remove @tempo ; no-osd set speed 1.0

I can't see what speed I'm setting.

May 29 '21 15:05 realnc

@realnc please file a new issue, with logs and everything else which the template requests

If you can bisect it to find the exact first commit where the issue happens - it would great info to add.

May 29 '21 16:05 avih

It sounds like a scratched CD where the CD player is skipping

@realnc could you please open a new issue for this? All the reports we have so far are about subjective quality, but what you're describing is new, and could very well be an actual bug - which none of us is able to reproduce.

So please file a new issue, with logs, preferably sample files, bisect if you can, etc. It would help us identify a yet-unknown bug.

May 31 '21 10:05 avih

I will be not too helpful commenting here, but I just want to confirm this report.

I upgraded do 0.34 and wondered why voices sound robotic at speed 1.1 until I figured out that apparrently the default was changed to scaletempo2. I added af=scaletempo as option in mpv 0.34 and aparrently things went back to normal.

Unfortunately, I can't offer any samples and it may subjective, but to me it was clear as day that something had changed and voices sounded very robotic with a lot of videos (but not all!). There seem to be some exceptions, but for me, scaletempo2 is way worse.

At least please don't remove scaletempo, for me scaletempo2 is very hard to bear for many files. I can try to see if I notice some kind of regularity such as audio codecs, but for me, there is something very wrong with scaletempo2.

Nov 14 '21 14:11 kevin-stuart

Use atempo instead.

Nov 14 '21 16:11 richardpl

I tried atempo. It sounds similar to scaletempo2 to me (i.e. robotic). It is also not documented in the mpv manual, so I did not get the idea to use this ffmpeg filter. For me scaletempo sounds best. Is it possible that there is some kind of bug in mpv that makes scaletempo2 or atempo sound much worse for only some people?

Nov 17 '21 09:11 kevin-stuart

@kevin-stuart I don't think there's any reason why the exact same media played with the exact same version of MPV would result in any difference in sound between people. That said, I opened this issue because I observed that 6 channel audio with scaletempo2 seemed to give worse results than scaletempo when there's a very minor speed adjustment. But the issue went away with most stereo audio. I suspect you're experiencing the same issue. If you can post a small sample that'll help people confirm.

Also I agree that atempo also performs well, but atempo isn't fully supported by mpv and it will eventually lead to an out of sync audio and video. But if you're just playing audio you might not care. I described the atempo issue and some very janky workarounds here: https://github.com/mpv-player/mpv/issues/4418#issuecomment-643099263 For me, scaletempo2 removes my need for atempo.

Given how long scaletempo2 has been the default at this point, unless a lot more people find this issue and concur, I think leaving it the default will be the least disruptive for the most folks. In the meantime just making it easy to switch back to scaletempo is an easy solution. Maybe adding that to the default input.conf to could help. (though tough to decide on the key)

(I use $ af toggle scaletempo in my input.conf to make the $ key toggle it)

Nov 27 '21 22:11 varenc

You are right, I observed my problems with scaletempo2 with 6 channel audio. I mainly use 1.1 as speedup and scaletempo2 and atempo sound bad for me with this setup. I have set scaletempo in my config. I just hope that scaletempo2 is improved in the future and that scaletempo is not removed until then. For me, scaletempo2 became the new default only very recently when I upgraded to 0.34

Nov 27 '21 23:11 kevin-stuart

I also noticed occasionally very bad sound with scaletempo2. Here's an example from a movie with 2 channel audio, comparing scaletempo and scaletempo2 at 1.1x and 1.21x speeds: scaletempo mpv test.zip

Jan 09 '22 00:01 dardoor

You might want to try out --af=scaletempo2=search-interval=50:window-size=40. I've tried the example from @dardoor (original (1x).opus) and it sounds great at various speeds (>1).

Aug 05 '22 01:08 christoph-heinrich

You might want to try out --af=scaletempo2=search-interval=50:window-size=40. I've tried the example from @dardoor (original (1x).opus) and it sounds great at various speeds.

It sounds horrible to me with speech with a speed of 0.94. Some words sound robotic, metallic and choppy.As a quick test, I was listening to this podcast:

https://www.youtube.com/watch?v=cnFubyqJ3Ro

Prime example is at the very beginning (0:0:45s) where he says "that the community left for us". If you set the speed to 0.94, scaletempo2 is attrocious. scaletempo is perfect.

Whether I use your paremeters or not doesn't change anything for me in this regard.

Aug 05 '22 07:08 realnc

mpv --no-config --start=44 --speed=0.94 --af=<filter> 'https://www.youtube.com/watch?v=cnFubyqJ3Ro' I don't hear a problem with scaletempo2, but maybe I'm so used to it that I don't even notice it anymore. test.zip

Admittedly I never actually listen to anything at <1 speed, so maybe I would have noticed something at some point if I did. (videos are always >=1.25 speed for me, but I also tested with smaller values >=1)

Aug 05 '22 15:08 christoph-heinrich

scaletempo2=search-interval=50:window-size=40 does sound good on the sample I posted, at 1.1 and 1.2 speeds, even a bit better than scaletempo, I think.

But it sounds bad on that last sample at 0.94, at least the "basically we" part. scaletempo2 with no parameters sounds better, and scaletempo even better.

(I also mostly play media at faster speeds and I would guess that's true for most people too.)

Oct 09 '22 20:10 dardoor

Interestingly: After I've changed scaletempo2 to scaletempo in f_auto_filters.c p->sub.filter = mp_create_user_filter(f, MP_OUTPUT_CHAIN_AUDIO, "scaletempo", NULL);

--af=scaletempo=speed= none, both and tempo sound about the same - like I expect tempo to sound. af=scaletempo=speed=pitch works as expected. But when I've commented out that line sound was played at 1x speed regardless of video speed. Seems none and both values to option speed do not work as expected from man page.

    both
        Scale both tempo and pitch.
    none
        Ignore speed changes.

Jul 24 '23 08:07 mars4science

Is this issue still valid on builds from current master? Also please try rubberband from #12479 build

Sep 25 '23 17:09 llyyr

Still relevant for current builds. af=scaletempo considerably improves audio at 1.2x.

Oct 30 '23 04:10 StrangePeanut

Still relevant for current builds. af=scaletempo considerably improves audio at 1.2x.

Do you have an example?

Oct 30 '23 04:10 christoph-heinrich

Do you have an example?

@christoph-heinrich, I think I have a sample where the audio is distorted at 1.1x.

sample.zip

At least on my computer, I can hear noticeable distortions with the audio where the voices sound robotic, particularly at 00:22 with the line "... if you have to die to get it..." as well as at 00:40 with the line "...are going to attack..."

Increasing the speed to 1.2x makes the distortion less severe, but I can still hear it. When the speed is > 1.3x, the distortion seems to no longer be present.

When using af=scaletempo, there is no issue at any speed above 1.

Mar 15 '24 23:03 raziel711

@raziel711 You're right, scaletempo sounds much better then scaletempo2 at 1.1x speed on that sample. However scaletempo2 sounds better then scaletempo at 2x speed (both aren't perfect though).

Because they don't use the same metric for finding a suitable overlap position, there will always be edgecases where one works better then the other, however I've done a lot of testing comparing both with the same parameters (for https://github.com/mpv-player/mpv/pull/12487) and scaletempo2 is generally significantly better then scaletempo.

You can try playing with the parameters of each to see if you find ones that better suite your needs. Keep in mind that what scaletempo2 calls window-size is stride * overlap for scaletempo, in case you want to compare them. The reasoning behind the current defaults of scaletempo2 can be found in https://github.com/mpv-player/mpv/pull/12580

Mar 16 '24 01:03 christoph-heinrich

I ran into a file that had bad results with scaletempo2 (worse then the sample above) and noticed it had 6 audio channels, the same as the sample from @raziel711. Then I used channelmap=map=2-FL|2-FR to get the voices only and that sounded great, both on that sample and on my file. Finally I tried using scaletempo=overlap=0.5:search=40:stride=24 with my changes from #12487 and that also sounded good.

Looks to me like scaletempo2 has a problem with 6 channels for some reason (or probably anything >2), which sounds like a bug to me. I won't be able to have a look at the code this weekend, but maybe @ferreum has any ideas about what might be the cause?

Edit: I had a few minutes and didn't notice anything obvious in the code, but I reverted all changes to af_scaletempo2_internals.c that were made since it was introduced, and the problem still exists there.

Mar 16 '24 13:03 christoph-heinrich

Replacing the similarity measure with what I'm using in #12487 sounds a lot better, which suggests that somewhere in the calculation of that channels aren't handled correctly, but I couldn't find that mistake so far.

Replacement diff

diff --git a/audio/filter/af_scaletempo2_internals.c b/audio/filter/af_scaletempo2_internals.c
index 534f4f672a..a41a71828f 100644
--- a/audio/filter/af_scaletempo2_internals.c
+++ b/audio/filter/af_scaletempo2_internals.c
@@ -93,17 +93,19 @@ static void multi_channel_moving_block_energies(
 }
 
 static float multi_channel_similarity_measure(
-    const float* dot_prod_a_b,
-    const float* energy_a, const float* energy_b,
-    int channels)
-{
-    const float epsilon = 1e-12f;
-    float similarity_measure = 0.0f;
-    for (int n = 0; n < channels; ++n) {
-        similarity_measure += dot_prod_a_b[n]
-            / sqrtf(energy_a[n] * energy_b[n] + epsilon);
+    float **a, int frame_offset_a,
+    float **b, int frame_offset_b,
+    int channels,
+    int num_frames)
+{
+    float distance = 0;
+    for (int c = 0; c < channels ; c++) {
+        float *source = b[c];
+        float *target = a[c];
+        for (int i = 0; i < num_frames; i++)
+            distance += fabs(target[i + frame_offset_a] - source[frame_offset_b + i]);
     }
-    return similarity_measure;
+    return -distance;
 }
 
 #if HAVE_VECTOR
@@ -229,18 +231,14 @@ static int decimated_search(
     const float *energy_target_block, const float *energy_candidate_blocks)
 {
     int num_candidate_blocks = search_segment_frames - (target_block_frames - 1);
-    float dot_prod [MP_NUM_CHANNELS];
     float similarity[3];  // Three elements for cubic interpolation.
 
     int n = 0;
-    multi_channel_dot_product(
+    similarity[0] = multi_channel_similarity_measure(
         target_block, 0,
         search_segment, n,
         channels,
-        target_block_frames, dot_prod);
-    similarity[0] = multi_channel_similarity_measure(
-        dot_prod, energy_target_block,
-        &energy_candidate_blocks[n * channels], channels);
+        target_block_frames);
 
     // Set the starting point as optimal point.
     float best_similarity = similarity[0];
@@ -251,14 +249,11 @@ static int decimated_search(
         return 0;
     }
 
-    multi_channel_dot_product(
+    similarity[1] = multi_channel_similarity_measure(
         target_block, 0,
         search_segment, n,
         channels,
-        target_block_frames, dot_prod);
-    similarity[1] = multi_channel_similarity_measure(
-        dot_prod, energy_target_block,
-        &energy_candidate_blocks[n * channels], channels);
+        target_block_frames);
 
     n += decimation;
     if (n >= num_candidate_blocks) {
@@ -268,15 +263,11 @@ static int decimated_search(
     }
 
     for (; n < num_candidate_blocks; n += decimation) {
-        multi_channel_dot_product(
+        similarity[2] = multi_channel_similarity_measure(
             target_block, 0,
             search_segment, n,
             channels,
-            target_block_frames, dot_prod);
-
-        similarity[2] = multi_channel_similarity_measure(
-            dot_prod, energy_target_block,
-            &energy_candidate_blocks[n * channels], channels);
+            target_block_frames);
 
         if ((similarity[1] > similarity[0] && similarity[1] >= similarity[2]) ||
             (similarity[1] >= similarity[0] && similarity[1] > similarity[2]))
@@ -323,7 +314,6 @@ static int full_search(
     const float* energy_candidate_blocks)
 {
     // int block_size = target_block->frames;
-    float dot_prod [sizeof(float) * MP_NUM_CHANNELS];
 
     float best_similarity = -FLT_MAX;//FLT_MIN;
     int optimal_index = 0;
@@ -332,12 +322,10 @@ static int full_search(
         if (in_interval(n, exclude_interval)) {
             continue;
         }
-        multi_channel_dot_product(target_block, 0, search_block, n, channels,
-            target_block_frames, dot_prod);
 
         float similarity = multi_channel_similarity_measure(
-            dot_prod, energy_target_block,
-            &energy_candidate_blocks[n * channels], channels);
+            target_block, 0, search_block, n, channels,
+            target_block_frames);
 
         if (similarity > best_similarity) {
             best_similarity = similarity;

Mar 18 '24 16:03 christoph-heinrich

I think the problem is the stuff with energies. I don't know why they screw things up, and I wasn't able to find any mistakes in their calculation, but simply removing them makes things sound way better.

diff --git a/audio/filter/af_scaletempo2_internals.c b/audio/filter/af_scaletempo2_internals.c
index 534f4f672a..ee78940ba1 100644
--- a/audio/filter/af_scaletempo2_internals.c
+++ b/audio/filter/af_scaletempo2_internals.c
@@ -100,8 +100,7 @@ static float multi_channel_similarity_measure(
     const float epsilon = 1e-12f;
     float similarity_measure = 0.0f;
     for (int n = 0; n < channels; ++n) {
-        similarity_measure += dot_prod_a_b[n]
-            / sqrtf(energy_a[n] * energy_b[n] + epsilon);
+        similarity_measure += dot_prod_a_b[n];
     }
     return similarity_measure;
 }

However there is no way the chromium devs went through the effort of doing that energy stuff if it didn't create better results, but I've been looking for hours and can't find any mistakes.

Test it and if enough people agree it's better without energy, then we can remove that.

Mar 18 '24 21:03 christoph-heinrich

However there is no way the chromium devs went through the effort of doing that energy stuff if it didn't create better results, but I've been looking for hours and can't find any mistakes.

Removing the energy calculation seemingly bypasses a large part of the algorithm (see decimated_search), namely energy_target_block and &energy_candidate_blocks[n * channels] passed to the similarity measure become pointless. At that point I'd be more inclined to believe something is doing a whoopsie with those two values than that Chromium devs wrote a whole lot of complicated code for negative benefit.

I don't think it's a numerical precision issue, when I looked at the values in that term (and what the standard says about fsqrt's precision) it seemed fine.

I've printf'd some values along the way and they're not obscenely huge, but plotting the difference between the similarity_measure result with energy and the one without yields something I guess:

This seems not super out of whack and not biased to specifically one side so if this is a bug in the implementation rather than a bad design of the algorithm then it'll be a pain to find. Maybe I should repeat this with each channel isolated (though adding an --audio-channels=mono doesn't seem to affect the robotic-ness at all).

EDIT: And here's the absolute difference for the entire sample. It'd probably be more meaningful as like a fraction of either of the values, though by having seen quite a few numbers in my time I can tell it is actually fairly big:

Mar 18 '24 23:03 CounterPillow

mpv mpv copied to clipboard

Scaletempo2, the new default for adjusting audio playback speed, sounds noticeably worse in some situations

Important Information

Reproduction steps

Expected behavior

Actual behavior

Sample files

mpv
mpv copied to clipboard