tika
tika copied to clipboard
TIKA-4254 - Fix non-idempotent unit test `TestMimeTypes#testJavaRegex`
Fixes https://issues.apache.org/jira/projects/TIKA/issues/TIKA-4254
Brief Description of the Bug
The test TestMimeTypes#testJavaRegex
is non-idempotent, as it passes in the first run but fails in the second run in the same environment. The source of the problem is that each test execution initializes a new media type (MimeType
) instance testType
(same problem for testType2
), and all media types across different test executions attempt to use the same name pattern "rtg_sst_grb_0\\.5\\.\\d{8}"
. Therefore, in the second execution of the test, the line this.repo.addPattern(testType, pattern, true);
will throw an error, since the name pattern is already used by the testType
instance initiated from the first test execution. Specifically, in the second run, the addGlob()
method of the Pattern
class will assert conflict patterns and throw aMimeTypeException
(line 123 in Patterns.java
).
Failure Message in the 2nd Test Run:
org.apache.tika.mime.MimeTypeException: Conflicting glob pattern: rtg_sst_grb_0\.5\.\d{8}
at org.apache.tika.mime.Patterns.addGlob(Patterns.java:123)
at org.apache.tika.mime.Patterns.add(Patterns.java:71)
at org.apache.tika.mime.MimeTypes.addPattern(MimeTypes.java:450)
at org.apache.tika.mime.TestMimeTypes.testJavaRegex(TestMimeTypes.java:851)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
Reproduce
Use the NIOInspector
plugin that supports rerunning individual tests in the same environment:
cd tika-parsers/tika-parsers-standard/tika-parsers-standard-package
mvn edu.illinois:NIOInspector:rerun -Dtest=org.apache.tika.mime.TestMimeTypes#testJavaRegex
Proposed Fix
Declare testType
and testType2
as static variables and initialize them at class loading time. Therefore, repeated runs of testJavaRegex()
will not conflict each other. All tests pass and are idempotent after the fix.
Necessity of Fix
A fix is recommended as unit tests shall be idempotent, and state pollution shall be mitigated so that newly introduced tests do not fail in the future due to polluted shared states.
The repo
is refreshed with each unit test in the @BeforeEach
call, though. Is NIODetector respecting that?
The
repo
is refreshed with each unit test in the@BeforeEach
call, though. Is NIODetector respecting that?
@tballison Yes, NIOInspector uses the JUnit Jupiter engine and takes into account of all setup and teardown methods.
It seems that the patterns matcher of repo
is not reset by the following code in @BeforeEach
:
TikaConfig config = TikaConfig.getDefaultConfig();
repo = config.getMimeRepository();
Maybe I get it: repo = config.getMimeRepository();
isn't creating anything new, it's retrieving something that is changed later by the test? If my understanding is correct then it's a deeper problem.
getMimeRepository
I think it might the case. I wrote this dummy test, and it fails under surefire:
@Test
public void testResetRepo() throws Exception {
TikaConfig config0 = TikaConfig.getDefaultConfig();
MimeTypes repo0 = config0.getMimeRepository();
MimeType testType0 = new MimeType(MediaType.parse("baz/bar"));
String pattern = "rtg_sst_grb_0\\.5\\.\\d{9}";
repo0.addPattern(testType0, pattern, true);
TikaConfig config1 = TikaConfig.getDefaultConfig();
MimeTypes repo1 = config1.getMimeRepository();
MimeType testType1 = new MimeType(MediaType.parse("baz/bar"));
repo1.addPattern(testType1, pattern, true);
}
Error Message:
[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.857 s <<< FAILURE! -- in org.apache.tika.mime.TestMimeTypes
[ERROR] org.apache.tika.mime.TestMimeTypes.testResetRepo -- Time elapsed: 0.846 s <<< ERROR!
org.apache.tika.mime.MimeTypeException: Conflicting glob pattern: rtg_sst_grb_0\.5\.\d{9}
at org.apache.tika.mime.Patterns.addGlob(Patterns.java:123)
at org.apache.tika.mime.Patterns.add(Patterns.java:71)
at org.apache.tika.mime.MimeTypes.addPattern(MimeTypes.java:450)
at org.apache.tika.mime.TestMimeTypes.testResetRepo(TestMimeTypes.java:873)
...
@THausherr @tballison I confirmed that the two lines in @BeforeEach
does not create a new repo if one exists from a previous test run:
TikaConfig config = TikaConfig.getDefaultConfig();
repo = config.getMimeRepository();
TikaConfig.getDefaultConfig()
simply calls the default TikaConfig()
constructor (https://github.com/apache/tika/blob/b068e4290ad311b1e5f1ddaa6afa40be9e7bd797/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java#L390).
When the system property 'tika.config'
and the environment variable 'TIKA_CONFIG'
are both not set, the mimeTypes
field (accessible by getMimeRepository()
- which is repo
in our context) of the constructed config will be getDefaultMimeTypes(getContextClassLoader())
(https://github.com/apache/tika/blob/b068e4290ad311b1e5f1ddaa6afa40be9e7bd797/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java#L246).
Now take a look at getDefaultMimeTypes()
- when a classloader is given (getContextClassLoader()
in our context), it first tries to retrieve from a HashMap via CLASSLOADER_SPECIFIC_DEFAULT_TYPES.get(classLoader);
(https://github.com/apache/tika/blob/b068e4290ad311b1e5f1ddaa6afa40be9e7bd797/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java#L150). Notice that CLASSLOADER_SPECIFIC_DEFAULT_TYPES
is not an instance variable, but a static HashMap
.
So in the first test execution, the CLASSLOADER_SPECIFIC_DEFAULT_TYPES
is empty, so types
after the line types = CLASSLOADER_SPECIFIC_DEFAULT_TYPES.get(classLoader);
will be null
, and is later initialized by MimeTypesFactory.create()
as desired. After this, the initialized types
is put to the static CLASSLOADER_SPECIFIC_DEFAULT_TYPES
map (https://github.com/apache/tika/blob/b068e4290ad311b1e5f1ddaa6afa40be9e7bd797/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java#L166).
Now in the second test execution, the CLASSLOADER_SPECIFIC_DEFAULT_TYPES
already has the key of the context class loader, with corresponding types
initialized from the previous run. So CLASSLOADER_SPECIFIC_DEFAULT_TYPES.get(classLoader)
will return such initialized object directly. In other words, repo
would be the same object across repeated test runs.
I think the essential idea of CLASSLOADER_SPECIFIC_DEFAULT_TYPES
is 1-to-1 map between classloaders and default types, so this implementation does not seem buggy for me, but please confirm.