iscc-specs
iscc-specs copied to clipboard
Change wording for text extraction scope.
Currently: "While text-extraction is out of scope for this specification ..."
Proposed Change: "While detailed procedures for text-extraction from various document formats are out of scope for this specification ..."
For reproducible Content-ID-Text components the definition of the extraction tool/version is part of the normative specification. It might be updated with some future version of the ISCC (ideally only after some compatibility tests). Due to the comprehensive text-normalization (especially with the upcoming ISCC v1.1) the impact of different text extraction tools/versions should be minimal. Even if two different implementations of the ISCC would generate slightly different Content-IDs this is not regarded as a failure to produce a valid ISCC code. The similarity preserving nature of the component would still produce a match or near-duplicate match when comparing ISCC codes.
the definition of the extraction tool/version is part of the normative specification
Mandating a specific tool only works if and only if you also tie it to a version of that tool (as you may be implying). But since software is known to have vulnerabilities that would require systems to update - it is unreasonable/unacceptable to take this approach.
Additionally, it would prevent innovation in this area especially in complex formats such as PDF.