filetype.py Use a file signatures table to speed up the file type recognition

Use a file signatures table to speed up the file type recognition

Open vuolter opened this issue 7 years ago • 7 comments

I think that pre-build a dict and put there all the magic signatures for the file header lookup is more time efficient than call time to time each type object to find the matching file header.

Jun 26 '17 23:06 vuolter

That might be true, but without metrics we don't know. Also, IMO the library is considerably fast enough for the 99% of the use cases. Why do you care of excellent performance here? What's currently impacting you?

Jun 27 '17 07:06 h2non

I tried with a cluster of several thousand of files and performances wasn't so great, but, I admit, mine was a case very at the edge. :p

Jun 27 '17 12:06 vuolter

Interesting... my impression is that this is a CPython limitation, more than an implementation performance issue, but we can try improving things. If you can lead this by preparaing some performance test suites scenarios that I can easily reproduce, that would be great.

Jun 27 '17 21:06 h2non

Hi, preparing a general performance test suite is a bit difficult here because of the nature of the phisycal medium on which the test will be performed. If we try to process in parallel so many files stored on a single HDD, then its I/O limit will be reached very quickly, but if all the files would be splitted in more SSDs, then the result should more less limited by drive performances.

Jul 07 '17 10:07 vuolter

I would suggest that a performance for this scenarios test should not involve any I/O at all. That would make the performance testing goal inaccurate, and therefore irrelevant.

Instead, the performance suite should only cover the boundaries of the actual code logic to measure. In this context that would imply passing a binary buffer representing the file signature, up to 256 bytes. That's all you need, no disk I/O impact here.

Jul 07 '17 18:07 h2non

Ok, I'll try to preprare a draft of the new code and make a PR, ;)

Jul 17 '17 12:07 vuolter

Magic bytes don't work for complex container types like ISO-BMFF (MP4, MOV, HEIF/HEIC) and Matroska (MKV, WEBM). The headers need to be parsed to determine the format.

Jan 11 '19 05:01 ghost

filetype.py filetype.py copied to clipboard

Use a file signatures table to speed up the file type recognition

filetype.py
filetype.py copied to clipboard