OpenSearch
OpenSearch copied to clipboard
Improve the validation on TikaDocTests
Describe the issue
https://github.com/opensearch-project/OpenSearch/blob/main/plugins/ingest-attachment/src/test/java/org/opensearch/ingest/attachment/TikaDocTests.java#L79
This test tries to parse a bunch of different documents and asserts whether the resulting parsed content in not null/empty - that's it.
String parsedContent = TikaImpl.parse(bytes, new Metadata(), -1);
assertNotNull(parsedContent);
assertFalse(parsedContent.isEmpty());
logger.debug("extracted content: {}", parsedContent);
Any dependency change in tika libraries or else which could possibly impact the content of parsed content will be untracked and can introduce unknown bugs.
Recommending storing of parsed content at present version for future comparison by introduction of test which verifies the same. Also, comparing it with the parsed content from an older version to ensure that the content has remain unchanged from older OpenSearch versions.
Related component
Indexing
Additional Details
Plugins
ingest-attachment
plugin in core.
@peternied I'de like to work on this issue.
thanks for picking this up @finnegancarroll. I've assigned to you.
Hi @sandeshkr419! Can I pick your brain a little more regarding this issue? Is there functionality which opensearch adds to apache tika that we hope to capture with these tests? That is to say, how would you recommend I avoid redundancy with tika's own unit tests? Thanks!
Hi @finnegancarroll!
We certainly don't want things to be redundant for sure. I have not been able to look into the tika package to check what testing has been available there.
Consider the scenario - where tika changes the end result of parsed content somehow in a later release and OpenSearch depends on the exact state of the resultant parsed content - is there a way to capture and identify it other than checking non-null value. In the present scenario, we may end up just upgrading the library because we will see no failures.
I understand this may be a futile exercise if we have other mechanisms that can capture it - somehow I am not aware of them at the moment.
Please take my comments with a pinch of salt that I don't have the expertise on this plugin. Probably we should tag in some experts of the area as well to fill in.