warcbase icon indicating copy to clipboard operation
warcbase copied to clipboard

Image Data Creeping into Plain Text

Open ianmilligan1 opened this issue 10 years ago • 8 comments

Some images have been sneaking into the extracted plain text, perhaps because (as per @anjackson) we are trusting server Content-Type. The binary data throws off/breaks text analysis workflows. See figure below:

Error message

Current workaround is:

strings [input-file] > [output-file]

(or can be baked into workflows)

ianmilligan1 avatar Jul 15 '15 19:07 ianmilligan1

Just re-ran the plain text extractor and this is still an issue. These are images where I think the mime type is erroneously set to html. Related to #163, is there any easy way to refine keepValidPages further?

ianmilligan1 avatar Nov 24 '15 03:11 ianmilligan1

I suppose we could run Tika MIME type detection, but this would considerably slow processing speed down...

lintool avatar Nov 24 '15 13:11 lintool

Given some of the other stuff you're planning, I'd be surprised if Tika slows you down much. If you do decide to try it, you best bet is to just include tika-core (and not tika-parsers) on your classpath. In that case, the MIME-type detection will not open up and parse container/complex formats. It will just do binary signature detection, which is enough to spot common image formats. The only 'trick' is to use a buffered stream wrapper so the Tika code can parse the first few K and you can then reset the stream pointer to the start of the payload.

anjackson avatar Nov 24 '15 19:11 anjackson

Ah, I see you've tried it before. Your current implementation instantiates Tika on every request:

https://github.com/lintool/warcbase/blob/26d2ef518fb33632dcfd6a0bcfbd00ff7731d232/src/main/scala/org/warcbase/spark/matchbox/DetectMimeTypeTika.scala#L35

Tika isn't intended to be used this way, and will be very slow (as it's re-parsing the signature files etc every time). You could try re-using a singleton Tika instance instead.. I believe it's supposed to be threadsafe, but even if it isn't you could wrap it as a ThreadLocal singleton.

anjackson avatar Nov 25 '15 07:11 anjackson

Just keeping this alive – was playing around with an ExtractEntities call on another test collection and crashed on:

Unparseable header line: [?slÑ???r???]QQoGâ?XyÚ  6?YÛ¤¶i·J­Ö¤Rö ?âØÌ6M·_¿Ï?©Òd ÙçóÝ}Çábw?Ó
                                                                              Á?vxÿg? ¹©¥ju?\iå
                                                                                               r3Ρ5wR«Øsp+Eߨùí;à$
  fPLÓ=É?ÍYJ~Î
              ¨µiðí¡Ï£Ql eÃÖ´â<?+}Ê?å<Áë*K1 ©Òª&Yöu?¸½Pü?.  ½WIhIÓÈǺ¿3ñ²×õ3ÞÈNXÇ8Øxgä#¢S­µV`÷¡£s?ãJVj<Ej£ÖÄþÙô)o ÉzIHù"                                                   Ðt
iÙC§Ö5·.?6 ®´³ Õ=·VÖapP<Av¸X`Ì£uX¼Â8^fáÁw??ñ{Væ
                                               A6pr?¦y~C?¢?¥éd¸jô ¹|;Ïÿ¼?_þï ®çá¾ë ´éb` å°1]-?Èwé×? 1br??? ????????] (Offset 11).

My sense is this is binary data sneaking into things?

ianmilligan1 avatar Dec 16 '15 04:12 ianmilligan1

Yes, as a workaround for now, add a .filter(...) and exclude that page by hand?

lintool avatar Dec 16 '15 14:12 lintool

I'm not quite sure how to grab the record name, as I've got limited errors thrown. I'll put the gist here and maybe we can quickly chat about it today when I'm up in DC.

https://gist.github.com/ianmilligan1/8822295cf487b98d083e

ianmilligan1 avatar Dec 16 '15 14:12 ianmilligan1

What's the script that you're running? Can you isolate which WARC file the error is coming from? That would be a start...

lintool avatar Dec 17 '15 22:12 lintool