tika icon indicating copy to clipboard operation
tika copied to clipboard

TIKA-4254 - Fix non-idempotent unit test `TestMimeTypes#testJavaRegex`

Open kaiyaok2 opened this issue 9 months ago • 5 comments

Fixes https://issues.apache.org/jira/projects/TIKA/issues/TIKA-4254

Brief Description of the Bug

The test TestMimeTypes#testJavaRegex is non-idempotent, as it passes in the first run but fails in the second run in the same environment. The source of the problem is that each test execution initializes a new media type (MimeType) instance testType (same problem for testType2), and all media types across different test executions attempt to use the same name pattern "rtg_sst_grb_0\\.5\\.\\d{8}". Therefore, in the second execution of the test, the line this.repo.addPattern(testType, pattern, true); will throw an error, since the name pattern is already used by the testType instance initiated from the first test execution. Specifically, in the second run, the addGlob() method of the Pattern class will assert conflict patterns and throw aMimeTypeException(line 123 in Patterns.java).

Failure Message in the 2nd Test Run:

org.apache.tika.mime.MimeTypeException: Conflicting glob pattern: rtg_sst_grb_0\.5\.\d{8}
	at org.apache.tika.mime.Patterns.addGlob(Patterns.java:123)
	at org.apache.tika.mime.Patterns.add(Patterns.java:71)
	at org.apache.tika.mime.MimeTypes.addPattern(MimeTypes.java:450)
	at org.apache.tika.mime.TestMimeTypes.testJavaRegex(TestMimeTypes.java:851)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)

Reproduce

Use the NIOInspector plugin that supports rerunning individual tests in the same environment:

cd tika-parsers/tika-parsers-standard/tika-parsers-standard-package
mvn edu.illinois:NIOInspector:rerun -Dtest=org.apache.tika.mime.TestMimeTypes#testJavaRegex

Proposed Fix

Declare testType and testType2 as static variables and initialize them at class loading time. Therefore, repeated runs of testJavaRegex() will not conflict each other. All tests pass and are idempotent after the fix.

Necessity of Fix

A fix is recommended as unit tests shall be idempotent, and state pollution shall be mitigated so that newly introduced tests do not fail in the future due to polluted shared states.

kaiyaok2 avatar May 11 '24 08:05 kaiyaok2

The repo is refreshed with each unit test in the @BeforeEach call, though. Is NIODetector respecting that?

tballison avatar May 11 '24 10:05 tballison

The repo is refreshed with each unit test in the @BeforeEach call, though. Is NIODetector respecting that?

@tballison Yes, NIOInspector uses the JUnit Jupiter engine and takes into account of all setup and teardown methods.

It seems that the patterns matcher of repo is not reset by the following code in @BeforeEach:

TikaConfig config = TikaConfig.getDefaultConfig();
repo = config.getMimeRepository();

kaiyaok2 avatar May 11 '24 10:05 kaiyaok2

Maybe I get it: repo = config.getMimeRepository(); isn't creating anything new, it's retrieving something that is changed later by the test? If my understanding is correct then it's a deeper problem.

THausherr avatar May 11 '24 11:05 THausherr

getMimeRepository

I think it might the case. I wrote this dummy test, and it fails under surefire:

@Test
public void testResetRepo() throws Exception {
    TikaConfig config0 = TikaConfig.getDefaultConfig();
    MimeTypes repo0 = config0.getMimeRepository();
    MimeType testType0 = new MimeType(MediaType.parse("baz/bar"));
    String pattern = "rtg_sst_grb_0\\.5\\.\\d{9}";
    repo0.addPattern(testType0, pattern, true);

    TikaConfig config1 = TikaConfig.getDefaultConfig();
    MimeTypes repo1 = config1.getMimeRepository();
    MimeType testType1 = new MimeType(MediaType.parse("baz/bar"));
    repo1.addPattern(testType1, pattern, true);
}

Error Message:

[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.857 s <<< FAILURE! -- in org.apache.tika.mime.TestMimeTypes
[ERROR] org.apache.tika.mime.TestMimeTypes.testResetRepo -- Time elapsed: 0.846 s <<< ERROR!
org.apache.tika.mime.MimeTypeException: Conflicting glob pattern: rtg_sst_grb_0\.5\.\d{9}
	at org.apache.tika.mime.Patterns.addGlob(Patterns.java:123)
	at org.apache.tika.mime.Patterns.add(Patterns.java:71)
	at org.apache.tika.mime.MimeTypes.addPattern(MimeTypes.java:450)
	at org.apache.tika.mime.TestMimeTypes.testResetRepo(TestMimeTypes.java:873)
        ...

kaiyaok2 avatar May 11 '24 11:05 kaiyaok2

@THausherr @tballison I confirmed that the two lines in @BeforeEach does not create a new repo if one exists from a previous test run:

TikaConfig config = TikaConfig.getDefaultConfig();
repo = config.getMimeRepository();

TikaConfig.getDefaultConfig() simply calls the default TikaConfig() constructor (https://github.com/apache/tika/blob/b068e4290ad311b1e5f1ddaa6afa40be9e7bd797/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java#L390).

When the system property 'tika.config' and the environment variable 'TIKA_CONFIG' are both not set, the mimeTypes field (accessible by getMimeRepository() - which is repo in our context) of the constructed config will be getDefaultMimeTypes(getContextClassLoader())(https://github.com/apache/tika/blob/b068e4290ad311b1e5f1ddaa6afa40be9e7bd797/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java#L246).

Now take a look at getDefaultMimeTypes() - when a classloader is given (getContextClassLoader() in our context), it first tries to retrieve from a HashMap via CLASSLOADER_SPECIFIC_DEFAULT_TYPES.get(classLoader); (https://github.com/apache/tika/blob/b068e4290ad311b1e5f1ddaa6afa40be9e7bd797/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java#L150). Notice that CLASSLOADER_SPECIFIC_DEFAULT_TYPES is not an instance variable, but a static HashMap.

So in the first test execution, the CLASSLOADER_SPECIFIC_DEFAULT_TYPES is empty, so types after the line types = CLASSLOADER_SPECIFIC_DEFAULT_TYPES.get(classLoader); will be null, and is later initialized by MimeTypesFactory.create() as desired. After this, the initialized types is put to the static CLASSLOADER_SPECIFIC_DEFAULT_TYPES map (https://github.com/apache/tika/blob/b068e4290ad311b1e5f1ddaa6afa40be9e7bd797/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java#L166).

Now in the second test execution, the CLASSLOADER_SPECIFIC_DEFAULT_TYPES already has the key of the context class loader, with corresponding types initialized from the previous run. So CLASSLOADER_SPECIFIC_DEFAULT_TYPES.get(classLoader) will return such initialized object directly. In other words, repo would be the same object across repeated test runs.

I think the essential idea of CLASSLOADER_SPECIFIC_DEFAULT_TYPES is 1-to-1 map between classloaders and default types, so this implementation does not seem buggy for me, but please confirm.

kaiyaok2 avatar May 11 '24 21:05 kaiyaok2