filetype.py
filetype.py copied to clipboard
Use a file signatures table to speed up the file type recognition
I think that pre-build a dict and put there all the magic signatures for the file header lookup is more time efficient than call time to time each type object to find the matching file header.
That might be true, but without metrics we don't know. Also, IMO the library is considerably fast enough for the 99% of the use cases. Why do you care of excellent performance here? What's currently impacting you?
I tried with a cluster of several thousand of files and performances wasn't so great, but, I admit, mine was a case very at the edge. :p
Interesting... my impression is that this is a CPython limitation, more than an implementation performance issue, but we can try improving things. If you can lead this by preparaing some performance test suites scenarios that I can easily reproduce, that would be great.
Hi, preparing a general performance test suite is a bit difficult here because of the nature of the phisycal medium on which the test will be performed. If we try to process in parallel so many files stored on a single HDD, then its I/O limit will be reached very quickly, but if all the files would be splitted in more SSDs, then the result should more less limited by drive performances.
I would suggest that a performance for this scenarios test should not involve any I/O at all. That would make the performance testing goal inaccurate, and therefore irrelevant.
Instead, the performance suite should only cover the boundaries of the actual code logic to measure. In this context that would imply passing a binary buffer representing the file signature, up to 256 bytes. That's all you need, no disk I/O impact here.
Ok, I'll try to preprare a draft of the new code and make a PR, ;)
Magic bytes don't work for complex container types like ISO-BMFF (MP4, MOV, HEIF/HEIC) and Matroska (MKV, WEBM). The headers need to be parsed to determine the format.