vobsubocr icon indicating copy to clipboard operation
vobsubocr copied to clipboard

Unable to process idx file without palette definition

Open waterkip opened this issue 2 years ago • 0 comments

I have an idx file which I extracted with mkvextract from an mkv I ripped from a DVD, the process is something like this:

mplayer -dvd-device $dvd dvd://$nav -dumpstream -dumpfile $vob
ffmpeg \
    -fflags +genpts \
    -analyzeduration 1000000k \
    -probesize 1000000k \
    -i $vob \
    -c copy \
    $mapping \
    $metadata_audio \
    $metadata_subs \
    -y \
    $mkv

mkvextract $mkv tracks id:idxfile

This generates .idx files which look like this:

# VobSub index file, v7 (do not modify this line!)
langidx: 0

id: en, index: 0
timestamp: 00:00:04:200, filepos: 000000000
# etc etc

For vobsub2srt I added the line custom colors: ON, tridx: 1000, colors: 000000, ffffff, 000000, 000000 to improve on the OCR bit, but it's been a bit of a hit and miss with that.

So my idx files look like this:

# VobSub index file, v7 (do not modify this line!)
langidx: 0

custom colors: ON, tridx: 1000, colors: 000000, ffffff, 000000, 000000

id: en, index: 0
timestamp: 00:00:04:200, filepos: 000000000
# etc etc

However vobsubocr croaks with errors on these idx files:

An error occured: Could not parse VOB subtitles from 13-dut.idx: Could not parse 13-dut.idx

It seems you want to have palette present (I snipped piece of a idx I saw in one of the bugreports here) for the tool to work:

# The palette of the generated file
palette: 000000, f0f0f0, cccccc, 999999, 3333fa, 1111bb, fa3333, bb1111, 33fa33, 11bb11, fafa33, bbbb11, fa33fa, bb11bb, 33fafa, 11bbbb

This is what you seem to be able to process:

# VobSub index file, v7 (do not modify this line!)
langidx: 0

palette: 000000, f0f0f0, cccccc, 999999, 3333fa, 1111bb, fa3333, bb1111, 33fa33, 11bb11, fafa33, bbbb11, fa33fa, bb11bb, 33fafa, 11bbbb

id: en, index: 0
timestamp: 00:00:04:200, filepos: 000000000
# etc etc

I don't think we need to have the palette (or custom colors for that matter) present. It would be best if we generate an image that tesseract likes best:

  • black on white
  • add a border around the image/text (https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html#borders)

waterkip avatar Jan 04 '24 19:01 waterkip