I'm back! Need help with scanner basics.
Hi @cdgriffith
I've not abandoned you, but Real Lifeβ’ has been absorbing much of my time. I'm looking to get to work on porting my .mp3 lists to the scanner format, taking a rough look at the pdf_scanner I see how that mostly behaves and that will be a good starting point. However, I'm a little confused as to how the head and footer matching work.
-
headis easy enough in that it starts at byte 0 with%PDF- -
footneeds to count backwards to pick upstartxref
How do we tell the scanners to move forward and back in this new method? Or can I create my own methods using standard included imports as I think works best?
Looking forward to stretching the grey matter again, been having a few ideas and want to get this basic(ish) one out the way to build my understanding. π
Hey @NebularNerd all good, I have been away from github for a while myself!
Check out the stuff in develop for the new 2.0 scanning type! https://github.com/cdgriffith/puremagic/tree/develop/puremagic/scanners you can create a new scanner file for mp3 and get the full file as input and do whatever needed to detect stuff in it π
Reformatted this as it was getting a bit confusing to read, some follow up questions:
- Picking on the
pdf_scanneragain, are you searching the whole document rather than byte ranging? * - In
json_scanneryou return a match withconfidence=1.0would this be preferred for mp3's? - Do these scanners work for streams or is it for files only?
- Are we deleting entries from
magic_data.jsononce we have a scanner? I guess this in part depends on the above. - One other thing with the current
.match_bytesare we just reading from the start of the file until we hit a match? I guess for most byte 0 stuff that's fine, but what if the match is deeper? e.g.["6d6174726f736b61", 24, ".mkv", "video/x-matroska", "Matroska stream file"], would it be worth looking to update that function to explicitly scan the correct part of the file? - I was using black and flake8 for proofing the code, I see you have switched away to uv and ruff, I'm using VS Code what changes/plugins/settings would you recommend to avoid issues with PR formatting.
* Looking at sndhdr_scanner.py in #114 I can see there you are byte ranging the matches, I think I can use that as a rough template.
What's passed into the main of each function is the file_path, header, and footer. So if you need to look outside the standard header or footer range, you can open the file itself and process as you want.
For things that are for sure mp3s, yeah 1.0 confidence can be set :)
Files only as of now
Keeping the magic_data. It is still faster and the first thing checked (and these deeper scans can be disabled for speed)
The whole start of the file is passed in the header, if it's further in than that, would have to open the file_path
I don't use VS Code (prefer Jetbrains stuff so PyCharm). Just make sure the pre-commit is installed and should run those for you!
I don't use VS Code (prefer Jetbrains stuff so PyCharm). Just make sure the pre-commit is installed and should run those for you!
I'll look at adding uv and ruff into my VS Code flow for this repo, don't want to be creating a dozen commits for spacing errors or similar like the old days π€£
I'll start playing when I have a free weekend, looking at your scanners I can see where I would like to go and working with files makes things far easier, as you say once I get the file passed to the scanner I can chew it over as I see fit. Something like an .mp3 is riddled with fingerprints so securing a 100% confidence should be achievable unless it's a real fringe case (likely a malformed or incomplete file).
I'm also looking forward to digging out stuff from some of the formats where confidences were not as high, BMP's for example have that weird data lump that needs to be decoded.
@cdgriffith take a look at #120 π