engine icon indicating copy to clipboard operation
engine copied to clipboard

Extract terms through OCR for non-text source documents

Open clementbiron opened this issue 3 years ago • 3 comments

With the following declaration (the dedicated branch is here https://github.com/OpenTermsArchive/declarations-france/commit/5d1c1c3bb1e43ad68e86b5ed346090d3c29e78f3 )

{
  "name": "Desigual",
  "documents": {
    "Commercial Terms": {
      "fetch": "https://www.desigual.com/on/demandware.static/-/Library-Sites-DsglSharedLibrary/default/dw98507c8d/docs/legal/Footer_legal_documents/Francia/FRANCIA-Condiciones_Generales_Venta_Vfinal_FR_230321.pdf"
    },
    "Privacy Policy": {
      "fetch": "https://www.desigual.com/on/demandware.static/-/Library-Sites-DsglSharedLibrary/default/dw77e5bf6a/docs/legal/Footer_legal_documents/Francia/FRANCIA-POLITICA_DE_PRIVACIDAD_Vfinal_FR_230321.pdf"
    }
  }
}

i get empty version for Commerical Terms and the following wrong version for Privacy Policy


2  

 

 

 

 

   - 

- 

 

 

 

 

  

 

3

The snapshots are good.

clementbiron avatar Feb 22 '22 13:02 clementbiron

Unfortunately these documents are protected: if I access the PDF and try to copy their contents, I also only get spaces. I don't think this is an issue with Open Terms Archive (or rather, with the dependency @Accordproject). However, it is worth reflecting on whether we can detect this automatically and how we should handle such cases, as it is pretty much the PDF equivalent to an HTTP 403.

MattiSG avatar Mar 04 '22 18:03 MattiSG

And for the record, it is NOT fixed by https://github.com/ambanum/OpenTermsArchive/issues/836

Considering how fast the answer from accordproject was on the whitespace matter, I suggest we create an issue in their repo to see if they can do something about it (even though I doubt)

martinratinaud avatar May 11 '22 10:05 martinratinaud

The source file has been vectorised. There is indeed no text in the PDF. The only way to obtain the content would be to use OCR. This could be useful. I'll rename this issue accordingly. Please add other example cases where this would enable extraction!

MattiSG avatar Apr 24 '23 09:04 MattiSG