tika
tika copied to clipboard
TIKA-1180: Add MatroskaDetector for improved MKV/WEBM detection
This PR introduces a MatroskaDetector to more accurately identify MKV and WEBM files based on EBML headers and DocType strings. It improves detection in cases where:
- File extensions are missing
- DocType is compressed, shifted, or partially missing
- Signatures in tika-mimetypes.xml alone are insufficient
Includes:
MatroskaDetector.java- Test file:
MatroskaDetectorTest.java META-INF/services/org.apache.tika.detect.Detectorentry
Benchmarked against:
- Default Tika
- Wladimir Leite's mime-type additions
- This detector
Results indicate this approach improves detection accuracy in nuanced cases not handled by mime magic alone.