puremagic icon indicating copy to clipboard operation
puremagic copied to clipboard

same (mp3) file, different name ... different output: mp3 versus koz

Open sanderjo opened this issue 3 years ago • 5 comments

same (mp3) file, different name ... different output

Make a copy: sander@brixit:~/git/puremagic$ cp test/resources/audio/test.mp3 test/resources/audio/testblabla.bla Verify it's there with same size:

sander@brixit:~/git/puremagic$ ll test/resources/audio/test.mp3 test/resources/audio/testblabla.bla
-rw-rw-r-- 1 sander sander 26989 jun 11 10:36 test/resources/audio/testblabla.bla
-rw-rw-r-- 1 sander sander 26989 jun 11 10:35 test/resources/audio/test.mp3

... and same contents:

sander@brixit:~/git/puremagic$ md5sum test/resources/audio/test.mp3 test/resources/audio/testblabla.bla
3de8d656af21a836f2ba4f2949feb77c  test/resources/audio/test.mp3
3de8d656af21a836f2ba4f2949feb77c  test/resources/audio/testblabla.bla

... but puremagic says the first one is mp3 and the second is ... koz?

sander@brixit:~/git/puremagic$ python3 -m puremagic test/resources/audio/test.mp3 test/resources/audio/testblabla.bla
'test/resources/audio/test.mp3' : .mp3
'test/resources/audio/testblabla.bla' : .koz

Is this wanted behaviour, or a bug?

PS: Linux' file reports it correctly as mp3:

sander@brixit:~/git/puremagic$ file  test/resources/audio/test.mp3 test/resources/audio/testblabla.bla
test/resources/audio/test.mp3:       Audio file with ID3 version 2.3.0, contains:MPEG ADTS, layer III, v1, 128 kbps, 44.1 kHz, JntStereo
test/resources/audio/testblabla.bla: Audio file with ID3 version 2.3.0, contains:MPEG ADTS, layer III, v1, 128 kbps, 44.1 kHz, JntStereo

sanderjo avatar Jun 11 '21 08:06 sanderjo

Ah, thanks to @safihre

>>> import puremagic
>>> filename = "test/resources/audio/testblabla.bla"
>>> bla = puremagic.magic_file(filename)

>>> for i in bla:
...     print(i)
...
PureMagicWithConfidence(byte_match=b'ID3\x03\x00\x00\x00', offset=0, extension='.koz', mime_type='', name='Sprint Music Store audio', confidence=0.7)
PureMagicWithConfidence(byte_match=b'ID3', offset=0, extension='.mp3', mime_type='audio/mpeg', name='MPEG-1 Audio Layer 3 (MP3) audio file', confidence=0.3)

>>> bla[0].extension
'.koz'
>>> bla[1].extension
'.mp3'

So ... puremagic thinks (0.7 probability) it's .koz (because of the longer matching bytestring?), and 0.3 probability it's .mp3

In the real world, I would say mp3 is much more likely than koz. So each extension would have a Real World Probablity. Wild guess:

.mp3: 99% .koz: 1%

So based on that, mp3 would be more likely for this case. So, I would need to interpret / combine the pure puremagic indication with Real World Probabilities.

sanderjo avatar Jun 11 '21 09:06 sanderjo

Real World Probability: common extensions on https://www.computerhope.com/issues/ch001789.htm

>>> mylikelyextlist = [ '3g2','3gp','7z','ai','aif','apk','arj','asp','aspx','avi','bak','bat','bin','bin','bmp','c','cab','cda','cer','cfg','cfm','cgi','cgi','cgi','class','com','cpl','cpp','cs','css','csv','cur','dat','db','dbf','deb','dll','dmg','dmp','doc','docx','drv','email','eml','emlx','exe','flv','fnt','fon','gadget','gif','h','h264','htm','html','icns','ico','ico','ini','iso','jar','java','jpeg','jpg','js','jsp','key','lnk','log','m4v','mdb','mid','midi','mkv','mov','mp3','mp4','mpa','mpeg','mpg','msg','msi','msi','odp','ods','odt','oft','ogg','ost','otf','part','pdf','php','php','pkg','pl','pl','pl','png','pps','ppt','pptx','ps','psd','pst','py','py','py','rar','rm','rpm','rss','rtf','sav','sh','sql','svg','swf','swift','sys','tar','tar','gz','tex','tif','tiff','tmp','toast','ttf','txt','vb','vcd','vcf','vob','wav','wma','wmv','wpd','wpl','wsf','xhtml','xls','xlsm','xlsx','xml','z','zip' ]

>>> 'mp3' in mylikelyextlist 
True
>>> 'koz' in mylikelyextlist 
False

List generated like this:

sander@brixit:~$ lynx --dump 'https://www.computerhope.com/issues/ch001789.htm'  | grep "\* \." | awk -F\- '{ print $1 }' | tr -d "*" | sed -e 's/and/\n/g' | sed -e 's/or/\n/g'  | tr -d " " | sort | sed -e "s/\./','/g"  | tr -d '\n'


','3g2','3gp','7z','ai','aif','apk','arj','asp','aspx','avi','bak','bat','bin','bin','bmp','c','cab','cda','cer','cfg','cfm','cgi','cgi','cgi','class','com','cpl','cpp','cs','css','csv','cur','dat','db','dbf','deb','dll','dmg','dmp','doc','docx','drv','email','eml','emlx','exe','flv','fnt','fon','gadget','gif','h','h264','htm','html','icns','ico','ico','ini','iso','jar','java','jpeg','jpg','js','jsp','key','lnk','log','m4v','mdb','mid','midi','mkv','mov','mp3','mp4','mpa','mpeg','mpg','msg','msi','msi','odp','ods','odt','oft','ogg','ost','otf','part','pdf','php','php','pkg','pl','pl','pl','png','pps','ppt','pptx','ps','psd','pst','py','py','py','rar','rm','rpm','rss','rtf','sav','sh','sql','svg','swf','swift','sys','tar','tar','gz','tex','tif','tiff','tmp','toast','ttf','txt','vb','vcd','vcf','vob','wav','wma','wmv','wpd','wpl','wsf','xhtml','xls','xlsm','xlsx','xml','z','zip

sanderjo avatar Jun 11 '21 11:06 sanderjo

Sorry for late reply, not getting notifications for this repo even though watched it seems.

Just to explain the behavior a bit you were seeing at first, is that you are right koz was higher confidence so it was winning when the file extension didn't match. However if it matches both file extension and content, it is given the highest confidence.

Definitly something to consider for real world scenarios. May have to check and see how file handles stuff like that.

cdgriffith avatar Sep 14 '21 14:09 cdgriffith

Thanks for replying.

I've implemented it in SABnzbd like this:

  • define list with likely extensions in the real word. For example: mp3 is, koz isn't
  • if current file extension is already likely: don't check, don't change
  • if not likely extension: let puremagic return all possible extensions ... and choose the extension that is in list with likely extensions.

See https://github.com/sabnzbd/sabnzbd/blob/9b870e64d252ef9b7521269844fb6250a0d5728c/sabnzbd/utils/file_extension.py#L257-L263f

sanderjo avatar Sep 14 '21 15:09 sanderjo

This affects ID3v2.3.0 version files which share the same header (sometimes) as .koz. Basically, with ID3v2 you have:

  • ID3 for first three bytes
  • A version: 0x0300 in the case of the example above
  • Some single bit flags that represent various file settings (experimental, extended header, unsync)

To improve confidence, we can do a couple of things:

  • Make longer confidence matches much like I did for .pcx files on #50, this would obviously increase the number of entries as you would have to account for all variants of the flags
  • Add some multi match magic, for example if a file still has a TAG header in the last 128 bytes for old v1.1 tags
  • Try and pin down an additional byte or two that's fixed in place elsewhere, for example there almost always seems to be a P at bytes 10 or 11

I'm reading/playing around to see what would give the best consistant results.

NebularNerd avatar Apr 28 '24 08:04 NebularNerd

Adding a longer versioned match for .mp3 and adding TAG at -128 gives us 80% confidence, beating .koz by 10%. This only works if the file has tags, untagged files would still match .koz.

This is from my own Python script purely for confidence testing.

(01) Adamski - Killer.mp3
Most likely match:
Format:        MPEG-1 Audio Layer 3 (MP3) ID3v2.3.0 audio file
Confidence:    80.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        -128
Bytes Matched: b'ID3\x03\x00TAG'
Hex:           4944 3303 0054 4147
String:        ID3TAG

Alternate match #1
Format:        Sprint Music Store audio
Confidence:    70.0%
Extension:     .koz
MIME:
Offset:        0
Bytes Matched: b'ID3\x03\x00\x00\x00'
Hex:           4944 3303 0000 00
String:        ID3

Alternate match #2
Format:        MPEG-1 Audio Layer 3 (MP3) audio file
Confidence:    60.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        -128
Bytes Matched: b'ID3TAG'
Hex:           4944 3354 4147
String:        ID3TAG

Alternate match #3
Format:        MPEG-1 Audio Layer 3 (MP3) ID3v2.3.0 audio file
Confidence:    50.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        0
Bytes Matched: b'ID3\x03\x00'
Hex:           4944 3303 00
String:        ID3

Alternate match #4
Format:        MPEG-1 Audio Layer 3 (MP3) audio file
Confidence:    30.0%
Extension:     .mp3
MIME:          audio/mpeg
Offset:        0
Bytes Matched: b'ID3'
Hex:           4944 33
String:        ID3

Let's see what else we can match against in case TAG is not present. 🤔

NebularNerd avatar May 01 '24 16:05 NebularNerd

Updated in 1.23, thanks @NebularNerd !

cdgriffith avatar May 03 '24 15:05 cdgriffith