dkpro-core icon indicating copy to clipboard operation
dkpro-core copied to clipboard

Feature request: Rule-based Annotator for Cardinal/Ordinal

Open alaindesilets opened this issue 5 years ago • 15 comments

While DKPro has UIMA types for Cardinal and Ordinal, it seems there are no annotators that can produce them.

So I implemented my own CardOrdAnnotator for English based on the Stanford NLP QuantifiableEntityNormalizer class.

If you are interested, I could roll that into dkpro-core-api-ner-asl, or whatever module you think is appropriate.

I attach the classes and tests that I wrote for that. Note that you won't be able to run them as they use some utilities that I wrote for myself, but it should give you an idea of how they work.

Basically, the annotator uses a class CardOrdParser, which I wrote based on QuantifiableEntityNormalizer. This means that the annotator would have to be GPLed.

Note that at the moment, the parser is only available for English, but it would be probably be relatively easy to implement it for other languages. To do that however, we would have to re-write (or extend) QuantifiableEntityNormalizer because in its current implementation, it uses static variables to store words for cardinals and ordinals (ex: "first", "one", etc...). As a result, you cannot have different instances of QuantifiableEntityNormalizer for different languages. I guess we could rewrite QuantifiableEntityNormalizer altogether (using its code as "inspiration"). Not sure if that would be sufficient to remove the GPL constraint on CardOrdParser.

Let me know if you are interested.

CardOrdAnnotator_files.zip

alaindesilets avatar Nov 21 '19 13:11 alaindesilets

Is GPL vs ASL an issue for you?

If you run a compatible POS tagger before the CoreNlpNamedEntityRecognizer (i.e. the CoreNlpPosTagger), then you can also get e.g. ORDINAL tags:

    @Test
    public void thatOrdinalNumbersAreRecognized() throws Exception
    {
        JCas jcas = runTest("en", "John made the second place in the run .");
        
        String[] ne = {
                "[  0,  4]Person(PERSON) (John)",
                "[ 14, 20]NamedEntity(ORDINAL) (second)" };

        AssertAnnotations.assertNamedEntity(ne, select(jcas, NamedEntity.class));
    }

    private JCas runTest(String language, String testDocument)
        throws Exception
    {
        AnalysisEngineDescription engine = createEngineDescription(
                createEngineDescription(CoreNlpPosTagger.class),
                createEngineDescription(CoreNlpNamedEntityRecognizer.class));

        return TestRunner.runTest(engine, language, testDocument);
    }

reckart avatar Nov 21 '19 14:11 reckart

On Thu, Nov 21, 2019 at 9:58 AM Richard Eckart de Castilho < [email protected]> wrote:

Is GPL vs ASL an issue for you?

Not for me. But it could be an issue for other users of this annotator.

If you run a compatible POS tagger before the CoreNlpNamedEntityRecognizer (i.e. the CoreNlpPosTagger), then you can also get e.g. ORDINAL tags:

What about Cardinal?

alaindesilets avatar Nov 21 '19 16:11 alaindesilets

        JCas jcas = runTest("en", "John bought one hundred laptops .");
        
        String[] ne = {
                "[  0,  4]Person(PERSON) (John)",
                "[ 12, 15]NamedEntity(NUMBER) (one)",
                "[ 16, 23]NamedEntity(NUMBER) (hundred)" };

Looks like they are simply tagged as NUMBER. I'm not sure ifCARDINAL is even produced by CoreNLP - references to it never seem to be assignments.

reckart avatar Nov 21 '19 18:11 reckart

On Thu, Nov 21, 2019 at 1:03 PM Richard Eckart de Castilho < [email protected]> wrote:

    JCas jcas = runTest("en", "John bought one hundred laptops .");

    String[] ne = {
            "[  0,  4]Person(PERSON) (John)",
            "[ 12, 15]NamedEntity(NUMBER) (one)",
            "[ 16, 23]NamedEntity(NUMBER) (hundred)" };

Looks like they are simply tagged as NUMBER. I'm not sure ifCARDINAL is even produced by CoreNLP - references to it https://github.com/stanfordnlp/CoreNLP/search?p=1&q=CARDINAL&type=&utf8=%E2%9C%93 never seem to be assignments.

Would float numbers like "2.1" in "I bought 2.1 kg of meat" also be tagged as NUMBER? I am looking for something that would specifically tag Integers.

Alain

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-core/issues/1429?email_source=notifications&email_token=AAIMA4DXSPSOVFN3RTR7WO3QU3ENTA5CNFSM4JQCEHQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE3EBFY#issuecomment-557203607, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIMA4FJ57SQNM4TFAKPCALQU3ENTANCNFSM4JQCEHQA .

alaindesilets avatar Nov 21 '19 21:11 alaindesilets

For this, you could look into RUTA patterns: https://uima.apache.org/ruta.html

-Torsten

On 21.11.19, 22:14, "Alain Désilets" [email protected] wrote:

On Thu, Nov 21, 2019 at 1:03 PM Richard Eckart de Castilho <
[email protected]> wrote:

> JCas jcas = runTest("en", "John bought one hundred laptops .");
>
> String[] ne = {
> "[ 0, 4]Person(PERSON) (John)",
> "[ 12, 15]NamedEntity(NUMBER) (one)",
> "[ 16, 23]NamedEntity(NUMBER) (hundred)" };
>
> Looks like they are simply tagged as NUMBER. I'm not sure ifCARDINAL is
> even produced by CoreNLP - references to it
> <https://github.com/stanfordnlp/CoreNLP/search?p=1&q=CARDINAL&type=&utf8=%E2%9C%93>
> never seem to be assignments.
>

Would float numbers like "2.1" in "I bought 2.1 kg of meat" also be tagged
as NUMBER? I am looking for something that would specifically tag Integers.

Alain

> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <https://github.com/dkpro/dkpro-core/issues/1429?email_source=notifications&email_token=AAIMA4DXSPSOVFN3RTR7WO3QU3ENTA5CNFSM4JQCEHQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE3EBFY#issuecomment-557203607>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAIMA4FJ57SQNM4TFAKPCALQU3ENTANCNFSM4JQCEHQA>
> .
>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, 
view it on GitHub <https://github.com/dkpro/dkpro-core/issues/1429?email_source=notifications&email_token=AAURBYFCGADL23B3MOS4NWLQU324RA5CNFSM4JQCEHQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE3VNRY#issuecomment-557274823>, or 
unsubscribe <https://github.com/notifications/unsubscribe-auth/AAURBYBTOM3UR3MPGATHFNDQU324RANCNFSM4JQCEHQA>.

zesch avatar Nov 22 '19 08:11 zesch

I have committed the extended test which combines the Pos Tagger and the NER from CoreNLP here:

https://github.com/dkpro/dkpro-core/blob/4f8d74fdb003c90fdef8ccff7039a799ab471699/dkpro-core-corenlp-gpl/src/test/java/org/dkpro/core/corenlp/CoreNlpPosTaggerAndNamedEntityRecognizerTest.java

Feel free to play around with it and test additional number types (durations, percentages, etc.).

If I saw it correctly, the components you implemented work without requiring POS tags. If you would like to contribute them, it would be best if you create a PR. Since the classes depend on CoreNLP, the CoreNLP (GPL) module would be the best to place them.

I saw in the CoreNLP code, that there is also some support for normalizing quantities. If normalization is also something you are after, we might consider extending the DKPro Core type system with a way of storing such normalizations and to transfer them out of components such as CoreNLP which produce them.

reckart avatar Nov 22 '19 09:11 reckart

Le ven. 22 nov. 2019 à 04:14, Richard Eckart de Castilho < [email protected]> a écrit :

I have committed the extended test which combines the Pos Tagger and the NER from CoreNLP here:

https://github.com/dkpro/dkpro-core/blob/4f8d74fdb003c90fdef8ccff7039a799ab471699/dkpro-core-corenlp-gpl/src/test/java/org/dkpro/core/corenlp/CoreNlpPosTaggerAndNamedEntityRecognizerTest.java

Feel free to play around with it and test additional number types (durations, percentages, etc.).

Thx I definitely will!

If I saw it correctly, the components you implemented work without requiring POS tags.

That is correct. It might be an advantage as my understanding is that POS tagging is relatively slow and requires a trained model. i will compare performance of the two to confirm.

If you would like to contribute them, it would be best if you create a PR.

I am interested. Wat is a PR?

I saw in the CoreNLP code, that there is also some support for normalizing quantities. If normalization is also something you are after, we might consider extending the DKPro Core type system with a way of storing such normalizations and to transfer them out of components such as CoreNLP which produce them.

Yes! Note that I am interested in having a normalization attribute for all NamedEntity types not just quantities. I don't know if the UIMA type system can support that because the type of this normalisation attribute will depend on the type of NamedEntity. For Cardinal and Ordinal it should be a Long while for other quantities it should be a Double. For Location, Person, Date and Time probably a String.

The problem with the UIMA type system is that afaik it would not allow you to define an Object attribute in NamedEntity which could be overridden to more specific types in subclasses. This is one of my pet leaves about the typesystem. I don't know what possessed the good UIMA people to cook up their own very limited type system instead of just allowing features to be arbitrary java classes.

alaindesilets avatar Nov 22 '19 11:11 alaindesilets

I'd probably simply use a string feature even for numeric/boolean values...

IMHO having an "Object" attribute also isn't a great solution because it would also require type-casting.

The equivalent to an "Object" attribute in UIMA would be a "Feature Structure"-type attribute which could then point to e.g. a to-be-defined "DoubleValue" Feature structure which simply has a feature "value" of the type "double".

UIMAv3 also has new features to store custom objects in the CAS - but I have never tried this out so far: https://uima.apache.org/d/uimaj-current/version_3_users_guide.html#uv3.custom_java_objects - might be worth a look.

As for contributing via a PR - see here: https://dkpro.github.io/contributing/

reckart avatar Nov 22 '19 12:11 reckart

I wonder why time expressions, monetary expressions and so on are even considered as named entities / handled by the CoreNLP NER tools. They are not really entities... in particular not named ones.

reckart avatar Nov 22 '19 12:11 reckart

They are annotated in Ontonotes at least.

jcklie avatar Nov 22 '19 12:11 jcklie

On Fri, Nov 22, 2019 at 7:10 AM Richard Eckart de Castilho < [email protected]> wrote:

I'd probably simply use a string feature even for numeric/boolean values...

That would do.

While you are at it, you might want to define another attribute called say, altNormalizations, which would be a list of Strings. I build lots of NLP apps that have a human in the loop. In those kinds of apps, it's often useful to be able to provide the user with alternative plausible interpretations of a piece of a text.

For example, for a location "Victoria", there are literally dozens of likely place with that name. Usually you can tell from the application context or the content of the document which one is referred to. But it is also useful to be able to provide alternative normalizations. Actually, even without a human in the loop, alternative location normalization can be useful. For example, if I process a collection of documents that all have to do with the Ebola virus, you might conclude that a reference to "Victoria" refers to a location in Africa, eventhough that specific specific document by itself does not have sufficient information to conclude that (but the doc collection as a whole does).

Note that alternative normalizations can be useful not only for "proper" named entities (Location, Person, Org, etc...) but also to quantities. For example, a relative Date like "next Tuesday" cannot be normalized without having first established a "reference date" (i.e. the date that you would get if you replaced "next Tuesday" by "today"). But figuring out the reference date can be tricky if it was not provided as part of the document's metadata. Even if a reference date IS provided in the document's metadata, there are scenarios where the reference date may change in the course of the document. For example in this case:

"A bunch of things happened last week. On Monday, etc.."

In this case, the reference date for that excerpt is last week, not "this week" (which would be the reference date that would have been provided in the document's metadata).

IMHO having an "Object" attribute also isn't a great solution because it would also require type-casting.

The equivalent to an "Object" attribute in UIMA would be a "Feature Structure"-type attribute which could then point to e.g. a to-be-defined "DoubleValue" Feature structure which simply has a feature "value" of the type "double".

Very awkward in my opinion. I am sure the UIMA people had a good reason for inventing their own type system instead of just going with Java's, but I have never seen an explanation of the rationale.

UIMAv3 also has new features to store custom objects in the CAS - but I have never tried this out so far: https://uima.apache.org/d/uimaj-current/version_3_users_guide.html#uv3.custom_java_objects

  • might be worth a look.

Cool! I have been craving this ever since I started working with UIMA 4 years ago.

As for contributing via a PR - see here: https://dkpro.github.io/contributing/

Ah, PR = Pull Request. Yes, I am already familiar with that process.

I probably will get going on that some time in January.

Alain

alaindesilets avatar Nov 22 '19 13:11 alaindesilets

On Fri, Nov 22, 2019 at 7:27 AM Richard Eckart de Castilho < [email protected]> wrote:

I wonder why time expressions, monetary expressions and so on are even considered as named entities / handled by the CoreNLP NER tools. They are not really entities... in particular not named ones.

Yeah, I have the same issue. It seems to be a common misunderstanding in the NLP community, not just the Stanford folks.

It seems the meaning of the term "Named Entity" has evolved to encompass all "things and concepts from the physical world".

Alain

alaindesilets avatar Nov 22 '19 13:11 alaindesilets

Very awkward in my opinion. I am sure the UIMA people had a good reason for inventing their own type system instead of just going with Java's, but I have never seen an explanation of the rationale.

UIMA is supposed to be cross-platform. There is a C++ implementation provided by the Apache UIMA project. There are also some outdated Python bindings and the more recent DKPro Cassis library which implements the CAS in Python. So just using the full Java type system wouldn't really do.

reckart avatar Nov 22 '19 13:11 reckart

Le ven. 22 nov. 2019 à 08:29, Richard Eckart de Castilho < [email protected]> a écrit :

Very awkward in my opinion. I am sure the UIMA people had a good reason for inventing their own type system instead of just going with Java's, but I have never seen an explanation of the rationale.

UIMA is supposed to be cross-platform. There is a C++ implementation provided by the Apache UIMA project. There are also some outdated Python bindings and the more recent DKPro Cassis library which implements the CAS in Python. So just using the full Java type system wouldn't really do.

I thought that might be the reason but I wasn't sure since I had never heard of a non-java implementation.

Too bad the Python bindings are outdated. There are lots of excellent Python NLP frameworks out there (Spacy in particular).

The next thing I wonder is why they didn't just go for making all attributes be Json serialization strings instead of forcing devs to learn a different, uima-specific (and in my view awkward) object serialization framework. I understand that serialization and deserialisation encurs an overhead but you could cache the deserialised version so that this overhead is only encurred once per attribute.

I have written many annotation wrappers that use this approach and it was a lot easier to use than the UIMA feature structure system.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-core/issues/1429?email_source=notifications&email_token=AAIMA4HCUSNRLNRQLBI5ED3QU7NCXA5CNFSM4JQCEHQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE5UFRI#issuecomment-557531845, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIMA4G6S7KOKSUKADVYNY3QU7NCXANCNFSM4JQCEHQA .

alaindesilets avatar Nov 22 '19 14:11 alaindesilets

Too bad the Python bindings are outdated. There are lots of excellent Python NLP frameworks out there (Spacy in particular).

That's why we have built DKPro Cassis :) We use it amongst other things to connect tools such as spacy to the UIMA-based INCEpTION annotation editor.

Wrt. object serialization - the best place to discuss this would be the UIMA user's mailing list.

reckart avatar Nov 22 '19 14:11 reckart