Sushi Consider multiple candidates when searching for audio substream

Right now when Sushi searches for audio substream in the destination audio, it only considers the best match, even though OpenCV calculates the diff value for every possible candidate. There is no reason why we can't use this info for more accurate postprocessing.

The idea is to remember multiple best candidates so that during postprocessing we could check if replacing the selected shift with some of the other candidates would make the value more similar to its surroundings.

Working implementation below.

splits = np.array_split(result[0], 50)
len_so_far = 0
candidates = []
for split in splits:
    min_index = np.argmin(split)
    candidates.append((min_index + len_so_far, split[min_index]))
    len_so_far += len(split)
candidates.sort(key=lambda x: x[1])
candidates = candidates[:10]

We split the entire diff array into 50 ranges, find the best match in all of them and then select 10 best matches from those 50. The best match in the entire array is stored in candidates[0].

While this does find correct shift for some of the test, it still fails on many problematic cases with a lot of silence in the audio stream. Better ways of improving search accuracy might be preferable.

Nov 02 '14 14:11 tp7

For problem with silence segment, I think most of cases they are typesetting lines. If I'm not wrong typesetting line are grouped as one search group.

To handle this silence segment, how about the search group shift & snap to nearest end keyframe? Since most typesetting lines end at keyframes. Or to be more safe calculate the interval of frames between the end of search group and the nearest end keyframe, so it can use it for shifting

Apr 06 '15 12:04 shinchiro

Sushi is already doing something like that in keyframe correction section.

Basing initial search on keyframes is not feasible for two main reasons:

We might not have keyframes at all
Scene length might change (e.g. different IVTC, redrawing)

Plus it'd require significant changes in our search algo and it's not clear how to pick the appropriate scene (say if you have two candidate scenes each 50 frames long near each other).

I think some audio preprocessing or merging lines into even larger search groups might yield much better results for typesetting, although right now I don't know of any specific way to do so (other than maybe merging all overlapping/adjacent lines ).

Apr 06 '15 13:04 tp7

Since silence segment mostly indicate slow motion scene so I think it can use nearest keyframe as a last resort when audio search fail. But I think this doesnt apply in some rare cases

Linking search groups together is not bad idea. Every search group hold information about its before & after search group so if it fail in audio search it can refer its before search group to apply the shifts

Apr 06 '15 17:04 shinchiro