OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

Improve the validation on TikaDocTests

Open sandeshkr419 opened this issue 11 months ago • 5 comments

Describe the issue

https://github.com/opensearch-project/OpenSearch/blob/main/plugins/ingest-attachment/src/test/java/org/opensearch/ingest/attachment/TikaDocTests.java#L79

This test tries to parse a bunch of different documents and asserts whether the resulting parsed content in not null/empty - that's it.

            String parsedContent = TikaImpl.parse(bytes, new Metadata(), -1);
            assertNotNull(parsedContent);
            assertFalse(parsedContent.isEmpty());
            logger.debug("extracted content: {}", parsedContent);

Any dependency change in tika libraries or else which could possibly impact the content of parsed content will be untracked and can introduce unknown bugs.

Recommending storing of parsed content at present version for future comparison by introduction of test which verifies the same. Also, comparing it with the parsed content from an older version to ensure that the content has remain unchanged from older OpenSearch versions.

Related component

Indexing

Additional Details

Plugins ingest-attachment plugin in core.

sandeshkr419 avatar Mar 25 '24 02:03 sandeshkr419

[Triage - attendees 1 2 3 4 5 6 7 8] @sandeshkr419 Thanks for creating this issue

peternied avatar May 01 '24 15:05 peternied

@peternied I'de like to work on this issue.

finnegancarroll avatar May 01 '24 23:05 finnegancarroll

thanks for picking this up @finnegancarroll. I've assigned to you.

mch2 avatar May 01 '24 23:05 mch2

Hi @sandeshkr419! Can I pick your brain a little more regarding this issue? Is there functionality which opensearch adds to apache tika that we hope to capture with these tests? That is to say, how would you recommend I avoid redundancy with tika's own unit tests? Thanks!

finnegancarroll avatar May 08 '24 06:05 finnegancarroll

Hi @finnegancarroll!

We certainly don't want things to be redundant for sure. I have not been able to look into the tika package to check what testing has been available there.

Consider the scenario - where tika changes the end result of parsed content somehow in a later release and OpenSearch depends on the exact state of the resultant parsed content - is there a way to capture and identify it other than checking non-null value. In the present scenario, we may end up just upgrading the library because we will see no failures.

I understand this may be a futile exercise if we have other mechanisms that can capture it - somehow I am not aware of them at the moment.

Please take my comments with a pinch of salt that I don't have the expertise on this plugin. Probably we should tag in some experts of the area as well to fill in.

sandeshkr419 avatar May 08 '24 07:05 sandeshkr419