iscc-specs icon indicating copy to clipboard operation
iscc-specs copied to clipboard

Change wording for text extraction scope.

Open titusz opened this issue 5 years ago • 1 comments

Currently: "While text-extraction is out of scope for this specification ..."

Proposed Change: "While detailed procedures for text-extraction from various document formats are out of scope for this specification ..."

For reproducible Content-ID-Text components the definition of the extraction tool/version is part of the normative specification. It might be updated with some future version of the ISCC (ideally only after some compatibility tests). Due to the comprehensive text-normalization (especially with the upcoming ISCC v1.1) the impact of different text extraction tools/versions should be minimal. Even if two different implementations of the ISCC would generate slightly different Content-IDs this is not regarded as a failure to produce a valid ISCC code. The similarity preserving nature of the component would still produce a match or near-duplicate match when comparing ISCC codes.

titusz avatar Apr 30 '19 18:04 titusz

the definition of the extraction tool/version is part of the normative specification

Mandating a specific tool only works if and only if you also tie it to a version of that tool (as you may be implying). But since software is known to have vulnerabilities that would require systems to update - it is unreasonable/unacceptable to take this approach.

Additionally, it would prevent innovation in this area especially in complex formats such as PDF.

lrosenthol avatar May 13 '20 15:05 lrosenthol