tika icon indicating copy to clipboard operation
tika copied to clipboard

TIKA-1180: Add MatroskaDetector for improved MKV/WEBM detection

Open sirajahmadzai opened this issue 6 months ago • 0 comments

This PR introduces a MatroskaDetector to more accurately identify MKV and WEBM files based on EBML headers and DocType strings. It improves detection in cases where:

  • File extensions are missing
  • DocType is compressed, shifted, or partially missing
  • Signatures in tika-mimetypes.xml alone are insufficient

Includes:

  • MatroskaDetector.java
  • Test file: MatroskaDetectorTest.java
  • META-INF/services/org.apache.tika.detect.Detector entry

Benchmarked against:

  • Default Tika
  • Wladimir Leite's mime-type additions
  • This detector

Results indicate this approach improves detection accuracy in nuanced cases not handled by mime magic alone.

sirajahmadzai avatar Jun 14 '25 21:06 sirajahmadzai