magika
magika copied to clipboard
Small bug in python's code feature extraction
It seems we have a small bug in the feature extraction code. (caught by @ia0)
https://github.com/google/magika/blob/main/python/magika/magika.py#L310
mid_idx = (file_size - beg_trimmed_size - end_trimmed_size) // 2 should likely be something like mid_idx = beg_trimmed_size + (file_size - beg_trimmed_size - end_trimmed_size) // 2
The fix is trivial, but I'm thinking about it a bit more to make sure this is what we should do / actually matches what we do at training time.
We should also add more in-depth test cases about this, to make sure that extract_features_from_bytes == extract_features_from_path