Bright DPX images make MP3 identification regex extremely slow
Summary
Identification gets stuck for over a minute on bright production DPX images at:
trying b'(?s)\\xff[\\xfa\\xfb\\xf2\\xf3][\\x10-\\xeb].{46,1439}\\xff[\\xfa\\xfb\\xf2\\xf3][\\x10-\\xeb].{46,1439}\\Z'
at: https://raw.githubusercontent.com/openpreserve/fido/refs/heads/main/fido/conf/formats-v116.xml
$ curl -s https://raw.githubusercontent.com/openpreserve/fido/refs/heads/main/fido/conf/formats-v116.xml | grep -A 16 '<puid>fmt/134</puid>'
<puid>fmt/134</puid>
<mime>audio/mpeg</mime>
<name>MPEG 1/2 Audio Layer 3</name>
<version />
<alias>MP3</alias>
<pronom_id>687</pronom_id>
<extension>mp3</extension>
<apple_uti>public.mp3</apple_uti>
<signature>
<name>MPEG-1 Audio Layer 3 with ID3v2 Tag</name>
<note>Regularly-spaced frame headers should always be discoverable near EOF. An ID3v1 tag of up to 355 bytes may be present at EOF.</note>
<pattern>
<position>EOF</position>
<pronom_pattern>FFFB[10:EB]{46-1439}FFFB[10:EB]{46-1439}FFFB[10:EB]{46-1439}FFFB[10:EB]{46-1439}FFFB[10:EB]{46-1439}FFFB[10:EB]{46-1439}FFFB[10:EB]{47-1795}</pronom_pattern>
<regex>(?s)\xff\xfb[\x10-\xeb].{46,1439}\xff\xfb[\x10-\xeb].{46,1439}\xff\xfb[\x10-\xeb].{46,1439}\xff\xfb[\x10-\xeb].{46,1439}\xff\xfb[\x10-\xeb].{46,1439}\xff\xfb[\x10-\xeb].{46,1439}\xff\xfb[\x10-\xeb].{47,1795}\Z</regex>
</pattern>
<pattern>
Brightness seems like a red herring, but actually causes this regex to partially match and makes it slow.
I'm looking at improving this but any and all feedback would be much appreciated.
Steps to reproduce
I have the production images and can reproduce this locally. I'm currently trying to generate similar images with random data that makes this regex slow, but haven't gotten it to work yet. I'll attach one such image and rough instructions to create it below:
# Create 4k all white png in GIMP
# Convert it to dpx:
$ ffmpeg -i white.png white.dpx
# Create 0x17bb00 * 0x10 bytes of random xxd formatted data:
$ cat rand.py
import os
def rand_line(offset):
"""
<hex-offset>: (f[0-f]<rand> <rand><rand> ){4}
"""
line = f"{offset:08x}: "
rand = os.urandom(4)
byte1 = f"{rand[0] | 0xf0:02x}"
byte2 = f"{rand[1]:02x}"
byte3 = f"{rand[2]:02x}"
byte4 = f"{rand[3]:02x}"
for i in range(4):
line += f"{byte1}{byte2} {byte3}{byte4} "
return f"{line[:-1]}\n"
with open("rand.xxd", "w") as outfile:
for i in range(0x17bb00):
offset = i * 0x10
if offset % 1024**2 == 0:
print(f"{offset:08x}/{0x17bb00*0x10:08x}\r", end="")
outfile.write(rand_line(offset))
# Create the binary
$ xxd -r rand.xxd > rand.bin
# Remove all image date from white.dpx
# i.e., all ff bytes after offset 0x680
# 00000680: ffff ffff ffff ffff ffff ffff ffff ffff ................
$ vim -b white.dpx
# Concatenate new random image data
$ cat white.dpx rand.bin > rand.dpx
This patch fixes the performance, but doesn't actually address the slow regex:
Hardly ideal, but we'll use this kind of hot patch for now to fix our pipelines. I'll try to find some time and generate a test file to reproduce this issue. Maybe some more random data after the matching fff[0-f] bytes.
EDIT: I was able to create an image that reproduces this with some gimp, color picker, brush, ffmpeg && hex editor action. I'll attach one such image here soon.
Script to reproduce:
from fido.fido import Fido
fido = Fido()
fido.identify_file("repro.dpx")
$ time python3 test-slow-fido.py > /dev/null 2>&1
________________________________________________________
Executed in 136.69 secs fish external
usr time 135.99 secs 507.00 micros 135.99 secs
sys time 0.07 secs 229.00 micros 0.07 secs
With this file:
This file has been generated with gimp. Create 4k canvas, use color picker to choose color fffbeb, fill the whole canvas with that color and export as png with 0 compression. Finally, convert the png into a dpx with ffmpeg: $ ffmpeg -i repro.png repro.dpx.
Hi @jukuisma thanks very much for reporting this and the work behind the scenes. We're a little behind on FIDO and it's due some attention soon. We will make sure that this gets picked up in the first quarter of next year and will incorporate your changes if all is well the.