dkpro-core icon indicating copy to clipboard operation
dkpro-core copied to clipboard

Improvement/make brat reader more forgiving take2

Open alaindesilets opened this issue 5 years ago • 53 comments

This is Take2 of my attempt to improve BratReader. Please ignore the previous pull request.

I am not done yet but I am struggling with the method Mapping.merge(). Right now, there is a line at the start of the method:

boolean justCustomMapping = true;

If you leave it at true, all the tests pass, but only because the merge() method basically just returns a mapping that is equivalent to the first of the two arguments.

If you set the first line to false, a bunch of tests start failing and I have no idea why.

If someone can resolve that mystery, I should be OK for doing the rest of the work which is:

  • Making sure that BratReader assigns a catch-all type (ex: NamedEntity) to any unexpected brat label it encounters.

alaindesilets avatar Dec 22 '19 14:12 alaindesilets

Can one of the admins verify this patch?

ukp-svc-jenkins avatar Dec 22 '19 14:12 ukp-svc-jenkins

There is something wrong with this PR. I'm sure there shouldn't be over 3000 changed files in it.

reckart avatar Dec 22 '19 14:12 reckart

Changed target branch from 1.2.x to 1.12.x ;)

reckart avatar Dec 22 '19 14:12 reckart

Oups, sorry about that. I am such a cluts with git.

On Sun, Dec 22, 2019 at 9:33 AM Richard Eckart de Castilho < [email protected]> wrote:

Changed target branch from 1.2.x to 1.12.x ;)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-core/pull/1448?email_source=notifications&email_token=AAIMA4ARIFYYAAKJSRLKLALQZ53EBA5CNFSM4J6L7542YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHPRNGY#issuecomment-568268443, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIMA4BIM7VFY7MAZDIGWLDQZ53EBANCNFSM4J6L754Q .

alaindesilets avatar Dec 22 '19 14:12 alaindesilets

I figured out the problem with Mapping.merge(). As it turns out, I wasn't generating the default mappings correctly (I had the wrong order for the arguments when constructing TypeMapping instances).

I will continue with the rest of the work and let you know when the branch is ready for pulling.

alaindesilets avatar Dec 23 '19 15:12 alaindesilets

I have a question about the part regarding the whole globbing stuff and also your claim that IOTestRunner would add a .ann extension and thus work around the issues you are facing.

AFAIK, IOTestRunner does not add anything. The unit tests in BratReaderWriterTest call it with a specific .ann file - but it can be used with wildcards as well.

The normal use-case if you wanted to read multiple files would be to either use PARAM_SOURCE _LOCATION with a wildcard, e.g. annotations_folder/*.ann or to use PARAM_SOURCE _LOCATION and PARAM_PATTERNS together, e.g.

        CollectionReader reader = createReader(BratReader.class,
                BratReader.PARAM_SOURCE _LOCATION, "annotations_folder/*.ann");

        CollectionReader reader = createReader(BratReader.class,
                BratReader.PARAM_SOURCE _LOCATION, "annotations_folder",
                BratReader.PARAM_PATTERNS, "*.ann");

My understanding is that you want to avoid having to specify *.ann all the time - is that correct?

reckart avatar Dec 23 '19 15:12 reckart

On Mon, Dec 23, 2019 at 10:16 AM Richard Eckart de Castilho < [email protected]> wrote:

I have a question about the part regarding the whole globbing stuff and also your claim that IOTestRunner would add a .ann extension and thus work around the issues you are facing.

AFAIK, IOTestRunner does not add anything. The unit tests in BratReaderWriterTest call it with a specific .ann file.

I'll have a second look. But I am pretty sure that when I tried to write a test where I invoked testOneWay() with a reader that received a directory path in PARAM_SOURCE_LOCATION (without a wild card), the test passed. Yet, when I did the same test using my testReadWrite(), the test failed. I didn't completely follow execution to figure out exactly what testOneWay() was doing but I will do and report to you.

The normal use-case if you wanted to read multiple files would be to either use PARAM_SOURCE _LOCATION with a wildcard, e.g. annotations_folder/*.ann or to use PARAM_SOURCE _LOCATION and PARAM_PATTERNS together, e.g.

    CollectionReader reader = createReader(
            BratReader,
            BratReader, "annotations_folder",
            BratReader, "*.ann");

My understanding is that you want to avoid having to specify *.ann all the time - is that correct?

That is correct.

Alain

alaindesilets avatar Dec 23 '19 19:12 alaindesilets

AFAIK, IOTestRunner does not add anything. The unit tests in BratReaderWriterTest call it with a specific .ann file.

Hum... not sure how I had come to that conclusion but I confirm that if you pass a directory path without a *.ann pattern, the old BratReader raises an exception.

So I CAN write a failing test with testOneWay(). However, I then run into the problem that testOneWay() expects the aExpectedFile and aFile to be individual files, not directories. I tried to modify IOTestRunner so it checks whether or not those are individual files or directories, but I run into all sorts of difficulties.

Is it OK if I just write my own tests using my testReadWrite method()?

alaindesilets avatar Dec 23 '19 20:12 alaindesilets

I decided to bite the bullet and try to test my improvements using testOneWay(). I am pretty successful so far, but I have just discovered something that I think is a major design flaw in that method.

Basically, testOneWay() uses a location in the project's target folder for the PARAM_TARGET_LOCATION. This location will be exactly the same for every execution of a test with a given name. That means that successive execution of a same test ARE NOT INDENPENDANT of each other. I encountered that problem just now and it took me a while to find the problem.

Essentially, there was a bug in my code that caused an extra file to be written to PARAM_TARGET_LOCATION. My test failed because it checked the content of the PARAM_TARGET_LOCATION directory and found that it had this one unexpected file. I then fixed the bug, but the test still failed, because that file was left there from the previous run.

As a general rule, if I write a unit tests that needs to write to a directory, I make it write to a new temp directory everytime to avoid this sort of thing.

Is OK if I modify testOneWay accordingly?

alaindesilets avatar Dec 24 '19 11:12 alaindesilets

Basically, testOneWay() uses a location in the project's target folder for the PARAM_TARGET_LOCATION. This location will be exactly the same for every execution of a test with a given name. That means that successive execution of a same test ARE NOT INDENPENDANT of each other. I encountered that problem just now and it took me a while to find the problem.

This is intentional because often I need to look at the output after a test has run to verify by eye or to pick the output up and drop it into src/test/resources as reference data for the test.

Normally, it should be the DkproTestContext.getTestOutputFolder() which is invoked to generate the target location and that method removes any potentially previously existing data. If for some reason that method is not used, then I think that would be a bug in the test code. Best open a separate issue/PR for it if this is the case.

reckart avatar Dec 24 '19 17:12 reckart

Hi Richard,

I am literally tearing my hair out trying to test my changes using the existing IOTestRunner infrastructure. I have wasted the better part of the last two days trying to achieve that but no matter what changes I do to testOneWay or tesOneWay2 (and to a third testOneWay3() I created), I end up with some tests that are failing.

I give up for now and won't resume until January.

If you are OK with that, I would like to just test my changes using the simple testReadWrite runner I had written, and let you figure out how to merge that with the existing testOneWay(), testOneWay2() gizmos?

Have a pleasant holiday season.

On Tue, Dec 24, 2019 at 12:10 PM Richard Eckart de Castilho < [email protected]> wrote:

Basically, testOneWay() uses a location in the project's target folder for the PARAM_TARGET_LOCATION. This location will be exactly the same for every execution of a test with a given name. That means that successive execution of a same test ARE NOT INDENPENDANT of each other. I encountered that problem just now and it took me a while to find the problem.

This is intentional because often I need to look at the output after a test has run to verify by eye or to pick the output up and drop it into src/test/resources as reference data for the test.

Normally, it should be the DkproTestContext.getTestOutputFolder() which is invoked to generate the target location and that method removes any potentially previously existing data. If for some reason that method is not used, then I think that would be a bug in the test code. Best open a separate issue/PR for it if this is the case.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-core/pull/1448?email_source=notifications&email_token=AAIMA4BCUKCXX5VRKRSFF6LQ2I7BJA5CNFSM4J6L7542YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHTOKTQ#issuecomment-568780110, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIMA4CEDA2ECPC5T537R5DQ2I7BJANCNFSM4J6L754Q .

alaindesilets avatar Dec 24 '19 18:12 alaindesilets

BratReader would already know mappings for the Annotations defined in the standard dkpro-core type system (ex: Person) If you provide a PARAM_MAPPING, those mappings would be ADDED to the default ones

Having a default mapping and a method to merge mappings is a good idea. It should be possible for the user to turn off the default mapping and instead just use a custom mapping. -> Use default mapping if the user does not provide a custom mapping. Allow user to obtain default mapping and to merge it with a custom mapping in case both should be used.

Making sure to pass the .ann file as opposed to the .txt file (the Reader would automatically converts it to .ann path) Making sure that the single file, or all the the .txt files in the directory have a corresponding .ann file (the Reader would automatically creates empty .ann files for orphan .txt files)

So my understanding of this is that you have a use-case where you have various texts in a folder, some of which have not (yet) been annotated using brat, so there are no .ann files. This is why you want to point the reader to the .txt files and in case there is no .ann file, then no annotations are loaded. I can related to that.

Adding *.ann at the end of the directory path (the Reader would add it automaticaly)

Considering your case above (if I got this correctly), then it would probably make more sense to add *.txt.

I guess the main reason why we didn't add a default PARAM_PATTERNS to all of the readers is, that this parameter is defined in ResourceCollectionReaderBase and uimaFIT so far does not support overriding parameter default values in subclasses.

By virtue of inheriting from ResourceCollectionReaderBase, the readers support:

  1. reading from the file system
  2. reading from the classpath
  3. reading from ZIP files
  4. reading using a custom resource loader via KEY_RESOURCE_RESOLVER, e.g from HDFS

Adding "*.ann" to a "directory path" would mean that we need some heuristic to figure out of a path is a directory - not all of these support a ".isDirectory()" call. To address this, the ResourceCollectionReaderBase could be changed to provide an overridable getDefaultPatterns() method which may be used if the user didn't explicitly set PARAM_PATTERNS. But then there'd still be the "problem" how to detect whether the location passed by the user to PARAM_SOURCE_LOCATION is already sufficient or not. E.g. if the user passes /some/path/file.txt or /some/path/**/*.txt, then the default patterns should automatically not be used - it would be annoying to explicitly have to set PARAM_PATTERNS to an empty list in such cases, wouldn't it? One option could be to try and read the source location as a file and if that fails consider the location just as a base location and append the pattern - unless of course the user has set PARAM_PATTERNS explicitly (i.e. getDefaultPatterns() would not be used) and then no auto-detection would occur and the patterns would always be used.

reckart avatar Dec 26 '19 21:12 reckart

On Thu, Dec 26, 2019 at 4:42 PM Richard Eckart de Castilho < [email protected]> wrote:

BratReader would already know mappings for the Annotations defined in the standard dkpro-core type system (ex: Person) If you provide a PARAM_MAPPING, those mappings would be ADDED to the default ones

Having a default mapping and a method to merge mappings is a good idea. It should be possible for the user to turn off the default mapping and instead just use a custom mapping. -> Use default mapping if the user does not provide a custom mapping. Allow user to obtain default mapping and to merge it with a custom mapping in case both should be used.

I was thinking more along the lines of doing the merging in such a way that the individual mapping provided in the custom PARAM_MAPPING take precedence over the ones defined in the default Mapping object.

I THINK I managed to implement that correctly. Basically, for the classes that store the mapping entries as a list (I think that's only the TypeMappings class), I put the entries for the PARAM_MAPPING first, ensuring that those will be encountered first when sequentially searching the list. The fact that the list can contain duplicate entrries for the same brat label does not seem to mater

For those classes that store the entries info in a Map<String,String>, I do the opposite, i.e., I store the entries for the default mappings first, and then for the PARAM_MAPPING. That way if there are common entries in both mappings, the PARAM_MAPPING entries will over-ride the default ones.

Making sure to pass the .ann file as opposed to the .txt file (the Reader would automatically converts it to .ann path) Making sure that the single file, or all the the .txt files in the directory have a corresponding .ann file (the Reader would automatically creates empty .ann files for orphan .txt files)

So my understanding of this is that you have a use-case where you have various texts in a folder, some of which have not (yet) been annotated using brat, so there are no .ann files. This is why you want to point the reader to the .txt files and in case there is no .ann file, then no annotations are loaded. I can related to that.

That's correct. All my corpora start out as un-annotated txt files. Even for corpora that have already been annotated, I often end up adding new un-annotated txt files to it afterwards.

Adding *.ann at the end of the directory path (the Reader would add it automaticaly)

Considering your case above (if I got this correctly), then it would probably make more sense to add *.txt.

Not sure exactly how the scan method works, but in the case of BratReader, it seems it never scans for .txt. It only ever scans for .ann files.

Also, as it turns out, adding *.ann at the end of a directory sourceLocation did not work. Instead, I had to add *.ann to the 'pattern' attribute of the BratReader.

I guess the main reason why we didn't add a default PARAM_PATTERNS to all of the readers is, that this parameter is defined in ResourceCollectionReaderBase and uimaFIT so far does not support overriding parameter default values in subclasses.

That's a drag.

By virtue of inheriting from ResourceCollectionReaderBase, the readers support:

  1. reading from the file system
  2. reading from the classpath
  3. reading from ZIP files
  4. reading using a custom resource loader via KEY_RESOURCE_RESOLVER, e.g from HDFS

Adding "*.ann" to a "directory path" would mean that we need some heuristic to figure out of a path is a directory - not all of these support a ".isDirectory()" call.

Yes, I found I had to implement a static method getSingleSourceLocationType(). This method returns an enum that can take 3 possible values: SINGLE_FILE, DIR, PATTERN (where pattern means a path with some * wildcards)

I put this method in BratReader for my convenience, but I suspect it belongs in a superclass.

To address this, the ResourceCollectionReaderBase could be changed to provide an overridable getDefaultPatterns() method which may be used if the user didn't explicitly set PARAM_PATTERNS. But then there'd still be the "problem" how to detect whether the location passed by the user to PARAM_SOURCE_LOCATION is already sufficient or not. E.g. if the user passes /some/path/file.txt or /some/path/**/*.txt, then the default patterns should automatically not be used - it would be annoying to explicitly have to set PARAM_PATTERNS to an empty list in such cases, wouldn't it? One option could be to try and read the source location as a file and if that fails consider the location just as a base location and append the pattern - unless of course the user has set PARAM_PATTERNS explicitly (i.e. getDefaultPatterns() would not be used) and then no auto-detection would occur and the patterns would always be used.

The heuristic I used is this:

  • If the sourceLocation matches "^.*\.[a-zA-Z0-9]$" (i.e. if it ends with an extension), then it is a SINGLE_FILE
  • Otherwise, if the sourceLocation contains an asterisk, then it is a PATTERN
  • Otherwise, it is a DIR

I will push what I have on Jan 3. I can't do it now because I messed up with git (again) and lost soe of my work. In frustration I decided it was time for me to take some time off for Xmas and won't work on it til after the new year.

alaindesilets avatar Dec 27 '19 13:12 alaindesilets

I have started looking into providing a more convenient and flexible assertj-style approach of writing IO unit tests which could replace the IOTestRunner. Currently a way to process and verify multiple files with this approach could look like this:

        CollectionReaderAssert.assertThat(
                        ImsCwbReader.class,
                        ImsCwbReader.PARAM_SOURCE_LOCATION, 
                                "src/test/resources/multiple/*.vrt")
                .usingWriter(
                        ImsCwbWriter.class)
                .asFiles()
                .allSatisfy(file -> assertThat(contentOf(file, UTF_8)).isEqualToNormalizingNewlines(
                        contentOf(new File("src/test/resources/multiple", file.getName()), UTF_8)));

reckart avatar Dec 29 '19 00:12 reckart

Why does BratWriter.PARAM_ENABLE_TYPE_MAPPINGS default to false? This ends up writing the fully qualified class name as the Brat label, which makes those annotation completely unreadable in the brat viewer.

Given that the BratWriter will fall back on using that fully qualified name if it can't find a mapping, it seems to me that it makes little sense to set this parameter to false. For sure, false should not be the default value.

alaindesilets avatar Jan 06 '20 20:01 alaindesilets

Why does BratWriter.PARAM_ENABLE_TYPE_MAPPINGS default to false? This ends up writing the fully qualified class name as the Brat label, which makes those annotation completely unreadable in the brat viewer.

Given that the BratWriter will fall back on using that fully qualified name if it can't find a mapping, it seems to me that it makes little sense to set this parameter to false. For sure, false should not be the default value.

Another argument for enabling mapping by default is that this is how the BratReader works.

Another thought. It seems to me that the default mappings for the reader and writer should be the same. But it seems that they are set by different bits of code in BratReader and BratWriter respectively.

alaindesilets avatar Jan 06 '20 20:01 alaindesilets

Why does BratWriter.PARAM_ENABLE_TYPE_MAPPINGS default to false? This ends up writing the fully qualified class name as the Brat label, which makes those annotation completely unreadable in the brat viewer.

The reason is that by mapping types, information may be lost - e.g. if the same class/tag is defined in two different packages. With the default settings, I aimed for doing a data round-trip as good as possible, not for displaying data in brat itself.

reckart avatar Jan 06 '20 22:01 reckart

@alaindesilets Did you have a look at the improved testing API that I have added? Check the conflicts in BratReaderWriterTest.

reckart avatar Jan 06 '20 22:01 reckart

On Mon, Jan 6, 2020 at 5:40 PM Richard Eckart de Castilho < [email protected]> wrote:

Why does BratWriter.PARAM_ENABLE_TYPE_MAPPINGS default to false? This ends up writing the fully qualified class name as the Brat label, which makes those annotation completely unreadable in the brat viewer.

The reason is that by mapping types, information may be lost - e.g. if the same class/tag is defined in two different packages. With the default settings, I aimed for doing a data round-trip as good as possible, not for displaying data in brat itself.

Making PARAM_ENABLE_MAPPING be false by default does not address the issue at all. It just sweeps it under the carpet by "discouraging" its use.

A better solution would be for the type checking to raise an exception when a tag can match more than one classes. This could be done as follows.

  • When trying to map a class to a label, don't stop when you have found a match. Keep going until the end and raise an exception if more than one labels apply.
  • Note that this means that finding the label for a class will be slower the first time, but afterwards, it will be cached (I think... if not, we should implement a cache).

Alain

alaindesilets avatar Jan 06 '20 22:01 alaindesilets

I looked at it briefly, but it actually looked more complicated than what was available before.

Maybe you could rewrite one of the new tests I pushed using this new approach and show me how it would make my life easier?

On Mon, Jan 6, 2020 at 5:41 PM Richard Eckart de Castilho < [email protected]> wrote:

@alaindesilets https://github.com/alaindesilets Did you have a look at the improved testing API that I have added? Check the conflicts in BratReaderWriterTest.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-core/pull/1448?email_source=notifications&email_token=AAIMA4BZHJCWALDOQMTZU7LQ4OXRDA5CNFSM4J6L7542YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIHBGFI#issuecomment-571347733, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIMA4BKDOK7W5APPQ456ATQ4OXRDANCNFSM4J6L754Q .

alaindesilets avatar Jan 06 '20 22:01 alaindesilets

On Mon, Jan 6, 2020 at 5:54 PM Alain Désilets [email protected] wrote:

On Mon, Jan 6, 2020 at 5:40 PM Richard Eckart de Castilho < [email protected]> wrote:

Why does BratWriter.PARAM_ENABLE_TYPE_MAPPINGS default to false? This ends up writing the fully qualified class name as the Brat label, which makes those annotation completely unreadable in the brat viewer.

The reason is that by mapping types, information may be lost - e.g. if the same class/tag is defined in two different packages. With the default settings, I aimed for doing a data round-trip as good as possible, not for displaying data in brat itself.

Making PARAM_ENABLE_MAPPING be false by default does not address the issue at all. It just sweeps it under the carpet by "discouraging" its use.

Another argument for making it default to true is that people who just use BratReader and BratWriter to save JCases to file won't see any difference. Both BratReader and Writer will by default expect the .ann files to contain mapped labels.

But people who use the brat files for what they are actually intended for (i.e. loading into the Brat viewer) will see a positive difference in that the labels will be readable on the screen.

alaindesilets avatar Jan 06 '20 23:01 alaindesilets

I have converted test__SingleDirWithoutAnnFiles__AssumesEmptyAnnFiles to use ReaderAssert.

reckart avatar Jan 06 '20 23:01 reckart

Thx Richard. I will take a look at it and try to emulate it for the other new tests I wrote.

alaindesilets avatar Jan 07 '20 11:01 alaindesilets

I have a small homegrown class called FileGlob which I use to list or manipulate files using glob patterns. I found myself craving that class when implementing my improvements AND when writing tests for them. I have put it under dkpro-core-io-brat-asl, but as I start using your testing approach, I find I need to use it in dkpro-core-testing-asl as well. The problem is that:

  • I cannot make 'io-brat' be a dependency of 'testing' because that would introduce a circular reference
  • I cannot move FileGlob to 'testing', because 'io-brat' needs it and it does not (and probably shoud not) have 'testing' as a dependency.

What I need is to put it under a maven module that both 'testing' and 'io' can load as a dependency. What module would you suggest?

alaindesilets avatar Jan 07 '20 12:01 alaindesilets

I am looking at the ReaderAssert and WriterAssert methods and your example, and I like them.

One question though. Why does WriterAssert.run() not use SimplePipeline? From all the DKPro examples I see out there, this seems to be the most common way of invoking CollectionReaders and AnalysisEngines. As it stands now, I fear we may end up with discrepencies between the way that people typically invoke the Reader and Writers and the way that they are actually run in tests.

alaindesilets avatar Jan 07 '20 15:01 alaindesilets

One question though. Why does WriterAssert.run() not use SimplePipeline?

I wanted to be able to use WriterAssert with either a (partial) pipeline or manually constructed CASes as input without having to duplicate code. Using JCasIterator addresses that.

The behavior should be largely the same. There could be issues if a CAS-multiplier is being used or if the reader and writer need access to a shared resource manager. But these situations should be very very rare, in particular in unit tests. So I think it's pretty safe the way it is now. And if somebody hits these potential pitfalls further down the road, the implementation can be updated.

However, while working on the new testing code, I have thought about improving support for use-cases which require connecting multiple pipelines to each other directly within uimaFIT. Maybe I'll get back to that at a later stage.

reckart avatar Jan 07 '20 15:01 reckart

What I need is to put it under a maven module that both 'testing' and 'io' can load as a dependency. What module would you suggest?

For globbing, we usually use a Spring ResourcePatternResolver, in particular the PathMatchingResourcePatternResolver which is able to search on the file system, in ZIP/JAR files as well as on the classpath. The ResourceCollectionReaderBase largely handles the ResourcePatternResolver and subclasses then shouldn't need to bother anymore. Why do we need additional globbing code?

That said: the correct place to put such shared code should probably be the dkpro-core-api-io-asl module.

reckart avatar Jan 07 '20 15:01 reckart

On Tue, Jan 7, 2020 at 10:53 AM Richard Eckart de Castilho < [email protected]> wrote:

One question though. Why does WriterAssert.run() not use SimplePipeline?

I wanted to be able to use WriterAssert with either a (partial) pipeline or manually constructed CASes as input without having to duplicate code. Using JCasIterator addresses that.

Doesn't SimplePipeline allow you to provide a manually created JCas? If so, maybe you could have a method

.usingJCas(jcas)

Then, you invoke SimplePipeline.run(jcas) or SimplePipeline(), depending on whether or not a usingJCas() call was made. This ends up creating a very minor duplication (the two ways of invoking SImplePipeline), but it makes the test code closer to the the way that people run pipelines.

alaindesilets avatar Jan 07 '20 16:01 alaindesilets

On Tue, Jan 7, 2020 at 10:56 AM Richard Eckart de Castilho < [email protected]> wrote:

What I need is to put it under a maven module that both 'testing' and 'io' can load as a dependency. What module would you suggest?

For globbing, we usually use a Spring ResourcePatternResolver, in particular the PathMatchingResourcePatternResolver which is able to search on the file system, in ZIP/JAR files as well as on the classpath. The ResourceCollectionReaderBase largely handles the ResourcePatternResolver and subclasses then shouldn't need to bother anymore. Why do we need additional globbing code?

I have never used Spring before, so I don't know how to do this sort of thing.

Maybe you can send me some sample code that shows how to do the following:

  • List all the files that match a glob pattern
  • Delete all the files that match a glob pattern

Alain

alaindesilets avatar Jan 07 '20 16:01 alaindesilets

I just pushed some changes that impact DKProTestContext as well as ReaderAssert and WriterAssert. Can you take a look at them and make sure you are OK with that?

Basically I needed to decouple the input directory from the reference directory. The reason for this is that if I try to test a reader on a directory that contains a *-ref.ann file, it complains that this ann file does not have a .txt file. So I needed to be able to erase the *-ref.ann files before invoking the reader. But erasing that file would have erased it from src/test/resources.

So I modified BratReader.readingFrom() so that it

  • Copies the the input files to a location under the test context directory
  • Erases the *-ref.ann files from that test context directory
  • Sets that test context directory location as the SOURCE_LOCATION

The original src/test/resources location is still used as the reference location against which the outputs are compared.

In order to facilitate that, I changed DKProContext so that it allows you distinguish between:

  • The root of the test context directory
  • The location in the test context root where inputs are to be put
  • The location in the test context root where outputs are to be put

If you don't care about distinguishing between inputs and outputs, you can put files in the root.

alaindesilets avatar Jan 08 '20 12:01 alaindesilets