puremagic icon indicating copy to clipboard operation
puremagic copied to clipboard

I'm back! Need help with scanner basics.

Open NebularNerd opened this issue 2 months ago β€’ 5 comments

Hi @cdgriffith

I've not abandoned you, but Real Lifeβ„’ has been absorbing much of my time. I'm looking to get to work on porting my .mp3 lists to the scanner format, taking a rough look at the pdf_scanner I see how that mostly behaves and that will be a good starting point. However, I'm a little confused as to how the head and footer matching work.

  • head is easy enough in that it starts at byte 0 with %PDF-
  • foot needs to count backwards to pick up startxref

How do we tell the scanners to move forward and back in this new method? Or can I create my own methods using standard included imports as I think works best?

Looking forward to stretching the grey matter again, been having a few ideas and want to get this basic(ish) one out the way to build my understanding. 😎

NebularNerd avatar Oct 22 '25 17:10 NebularNerd

Hey @NebularNerd all good, I have been away from github for a while myself!

Check out the stuff in develop for the new 2.0 scanning type! https://github.com/cdgriffith/puremagic/tree/develop/puremagic/scanners you can create a new scanner file for mp3 and get the full file as input and do whatever needed to detect stuff in it πŸ˜„

cdgriffith avatar Oct 23 '25 21:10 cdgriffith

Reformatted this as it was getting a bit confusing to read, some follow up questions:

  • Picking on the pdf_scanner again, are you searching the whole document rather than byte ranging? *
  • In json_scanner you return a match with confidence=1.0 would this be preferred for mp3's?
  • Do these scanners work for streams or is it for files only?
  • Are we deleting entries from magic_data.json once we have a scanner? I guess this in part depends on the above.
  • One other thing with the current .match_bytes are we just reading from the start of the file until we hit a match? I guess for most byte 0 stuff that's fine, but what if the match is deeper? e.g. ["6d6174726f736b61", 24, ".mkv", "video/x-matroska", "Matroska stream file"], would it be worth looking to update that function to explicitly scan the correct part of the file?
  • I was using black and flake8 for proofing the code, I see you have switched away to uv and ruff, I'm using VS Code what changes/plugins/settings would you recommend to avoid issues with PR formatting.

* Looking at sndhdr_scanner.py in #114 I can see there you are byte ranging the matches, I think I can use that as a rough template.

NebularNerd avatar Oct 24 '25 08:10 NebularNerd

What's passed into the main of each function is the file_path, header, and footer. So if you need to look outside the standard header or footer range, you can open the file itself and process as you want.

For things that are for sure mp3s, yeah 1.0 confidence can be set :)

Files only as of now

Keeping the magic_data. It is still faster and the first thing checked (and these deeper scans can be disabled for speed)

The whole start of the file is passed in the header, if it's further in than that, would have to open the file_path

I don't use VS Code (prefer Jetbrains stuff so PyCharm). Just make sure the pre-commit is installed and should run those for you!

cdgriffith avatar Nov 04 '25 18:11 cdgriffith

I don't use VS Code (prefer Jetbrains stuff so PyCharm). Just make sure the pre-commit is installed and should run those for you!

I'll look at adding uv and ruff into my VS Code flow for this repo, don't want to be creating a dozen commits for spacing errors or similar like the old days 🀣

I'll start playing when I have a free weekend, looking at your scanners I can see where I would like to go and working with files makes things far easier, as you say once I get the file passed to the scanner I can chew it over as I see fit. Something like an .mp3 is riddled with fingerprints so securing a 100% confidence should be achievable unless it's a real fringe case (likely a malformed or incomplete file).

I'm also looking forward to digging out stuff from some of the formats where confidences were not as high, BMP's for example have that weird data lump that needs to be decoded.

NebularNerd avatar Nov 04 '25 19:11 NebularNerd

@cdgriffith take a look at #120 😎

NebularNerd avatar Nov 15 '25 17:11 NebularNerd