magika icon indicating copy to clipboard operation
magika copied to clipboard

Small bug in python's code feature extraction

Open reyammer opened this issue 2 years ago • 0 comments

It seems we have a small bug in the feature extraction code. (caught by @ia0)

https://github.com/google/magika/blob/main/python/magika/magika.py#L310

mid_idx = (file_size - beg_trimmed_size - end_trimmed_size) // 2 should likely be something like mid_idx = beg_trimmed_size + (file_size - beg_trimmed_size - end_trimmed_size) // 2

The fix is trivial, but I'm thinking about it a bit more to make sure this is what we should do / actually matches what we do at training time.

We should also add more in-depth test cases about this, to make sure that extract_features_from_bytes == extract_features_from_path

reyammer avatar Feb 20 '24 13:02 reyammer