stanza icon indicating copy to clipboard operation
stanza copied to clipboard

Enhanced Dependencies Support

Open mahdiman opened this issue 4 years ago • 29 comments

  • Currently, stanza produces universal dependencies for many languages, It would be great if it could be extended to augment the resulting universal dependencies with enhancements (Just like CoreNLP's Enhanced/Enhanced++/CCprocessed annotations).
  • More details could be found here and here. Will there be a limitation for such a new feature to be added?

mahdiman avatar Jun 20 '20 21:06 mahdiman

This would certainly be a useful feature to have. @AngledLuffa I am not familiar with the enhanced dependency implementation in CoreNLP - how difficult do you think this is?

yuhaozhang avatar Jun 23 '20 17:06 yuhaozhang

I'm not very familiar with the dependency parser implementation in stanza. Does it allow multiple connections for a dependent? If not, we would need to write a dependency converter or reuse the corenlp converter. Reusing sounds like the better option

AngledLuffa avatar Jun 23 '20 17:06 AngledLuffa

@AngledLuffa It could be adapted to allow multiple connections per depedent, but doesn't support it out of the box.

qipeng avatar Jun 23 '20 17:06 qipeng

it's a question of would we rather use the java server for accessing the converter or allow multiple connections per dependent. i think reimplementing it in python would be the worst possible solution

AngledLuffa avatar Jun 23 '20 19:06 AngledLuffa

Why is that? Is it due to the complexity of the task itself? Complexity aside, I feel that a native Python implementation will be better integrated with the neural pipeline, since users never need to leave the Stanza Python environment. We can ideally have it as a processor that takes the depparse output and grow some new annotations to the document.

yuhaozhang avatar Jun 23 '20 19:06 yuhaozhang

I was thinking in terms of having to repeat all of the logic involved in the java version

If the formalism changes or we come up with an improvement, we'd need to remember to redo it on both sides

AngledLuffa avatar Jun 23 '20 20:06 AngledLuffa

A more generalizable way of doing it would be to write the conversion as a sequence of rules which can be applied in both java & python, I suppse

AngledLuffa avatar Jun 23 '20 20:06 AngledLuffa

This is a great point. Maintaining could be an issue going forward. Does the CoreNLP converter support external dependency annotations? If so, what format?

yuhaozhang avatar Jun 23 '20 20:06 yuhaozhang

Unfortunately, no, which is part of why the easiest solution by far would be to leverage the existing converter

AngledLuffa avatar Jun 23 '20 21:06 AngledLuffa

Looks like this is what we want? https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/trees/ud/UniversalEnhancer.java

Is there any way we can build a server interface for this UniversalEnhancer within CoreNLP, such that the server may be able to take in a CoNLL-U or json representation, and return an enhanced serialization of the graph?

yuhaozhang avatar Jun 26 '20 20:06 yuhaozhang

It's a little more complicated than that. In English, there's a more specialized version in trees.GrammaticalStructure.java:

public List<TypedDependency> typedDependenciesEnhancedPlusPlus() { List<TypedDependency> tdl = typedDependencies(Extras.MAXIMAL); addEnhancements(tdl, UniversalEnglishGrammaticalStructure.ENHANCED_PLUS_PLUS_OPTIONS); return tdl; }

this winds up calling some specialized for English code in UniversalEnglishGrammaticalStructure.java

@Override protected void addEnhancements(List<TypedDependency> list, EnhancementOptions options)

I believe there was an attempt at doing the same thing in Chinese, although I have no idea how good it is. I don't believe any other language has the specialized conversions.

Contacting Chris or Sebastian would get more information - I'll drop Sebastian a note and maybe ask Chris at my next meeting if we don't figure it out by then. At any rate, adding a way of doing this via the server is certainly doable.

AngledLuffa avatar Jun 26 '20 21:06 AngledLuffa

Yes sounds like a good plan to me. The UD enhanced dependency page does suggest a handful of language-independent rules for conversion, so it makes sense to have a language-independent conversion module in CoreNLP server going forward.

yuhaozhang avatar Jun 26 '20 21:06 yuhaozhang

I investigated this some, and it sounds like UniversalEnhancer could indeed be used to add enhancements to any language. However, this needs a language specific list of relativizing pronouns. For example, in English, the list looks like

public static final String RELATIVIZING_WORD_REGEX = "(?i:that|what|which|who|whom|whose)";

Without that, the initial step would be impossible, and once that is done incorrectly several of the later steps would be negatively affected as well. In other words, you would still get some sort of result, but it wouldn't be nearly as useful. Is it possible to provide such a list for other languages?

There's also the question of whether the English and Chinese specific versions are better than the generic one. I'm sure it would be for English, but not so sure about Chinese.

AngledLuffa avatar Jul 27 '20 18:07 AngledLuffa

Some languages, like the upcoming Chukchi treebank also have enhanced dependencies in the annotation. It would be great to be able to train on those too.

ftyers avatar Oct 13 '20 15:10 ftyers

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Dec 29 '20 18:12 stale[bot]

Alright, this is a really old issue, but I figured it would be nice to add this as a feature before the next release. What I did was work on a python interface to the java UniversalEnhancer code. However, there's a limitation: some languages such as Chinese don't have relative pronouns, so the relative clauses can't be built with the mechanism used in this code

https://en.wikipedia.org/wiki/Relative_clause#Chinese https://en.wikipedia.org/wiki/Relative_pronoun#Absence

Any suggestions on how to handle ref dependencies or add relative clauses there would be appreciated - this is not my strength.

AngledLuffa avatar Apr 06 '21 20:04 AngledLuffa

interface is here. will work on the python side of it as well

https://github.com/stanfordnlp/CoreNLP/pull/1148/commits/c548e6c6a7c20fa2fc82d35fe399ccc887c78ec9

AngledLuffa avatar Apr 06 '21 20:04 AngledLuffa

This is still a work in progress, with some more testing etc necessary, but it should be usable now:

java side, needs to be recompiled:

https://github.com/stanfordnlp/corenlp/tree/ud_enhancer https://github.com/stanfordnlp/CoreNLP/blob/ud_enhancer/src/edu/stanford/nlp/trees/ud/ProcessUniversalEnhancerRequest.java

python interface:

https://github.com/stanfordnlp/stanza/tree/ud_enhancer_v2 https://github.com/stanfordnlp/stanza/blob/ud_enhancer_v2/stanza/server/ud_enhancer.py

if any of those branches stop existing in the future, it's because they've been merged into dev or possibly even main

AngledLuffa avatar Apr 07 '21 07:04 AngledLuffa

Currently we do not support enhancing relative clauses in Chinese, fwiw. For English, it will use that/which/etc, for Chinese, it will skip relative clauses, and for other languages, it will complain and ask you to provide a regex which processes relative clause pronouns.

AngledLuffa avatar Apr 07 '21 07:04 AngledLuffa

The above links shows 404 not found error. Can you please give exact url or changes need to be made in order to get enhanced dependencies?

CJPJ007 avatar May 28 '21 07:05 CJPJ007

The CoreNLP changes are now included in the most recent release:

https://stanfordnlp.github.io/CoreNLP/

As of this comment, the stanza changes are in the dev branch:

https://github.com/stanfordnlp/stanza/tree/dev

I expect to release a new version of stanza (including these changes) in the next week or so.

AngledLuffa avatar May 28 '21 12:05 AngledLuffa

Ok Thanks

CJPJ007 avatar May 30 '21 16:05 CJPJ007

Hi @AngledLuffa, when trying to use the enhancer:

import stanza.server.ud_enhancer as ud_enhancer
ud_enhancer.process_doc(doc, language="en")

The following errors are reported:

/usr/local/lib/python3.7/dist-packages/stanza/server/ud_enhancer.py in process_doc(doc, language, pronouns_pattern)
     49 def process_doc(doc, language=None, pronouns_pattern=None):
     50     request = build_enhancer_request(doc, language, pronouns_pattern)
---> 51     return send_request(request, Document, ENHANCER_JAVA, "$CLASSPATH")
     52 
     53 class UniversalEnhancer(JavaProtobufContext):

/usr/local/lib/python3.7/dist-packages/stanza/server/java_protobuf_requests.py in send_request(request, response_type, java_main, classpath)
     12                           input=request.SerializeToString(),
     13                           stdout=subprocess.PIPE,
---> 14                           check=True)
     15     response = response_type()
     16     response.ParseFromString(pipe.stdout)

/usr/lib/python3.7/subprocess.py in run(input, capture_output, timeout, check, *popenargs, **kwargs)
    486         kwargs['stderr'] = PIPE
    487 
--> 488     with Popen(*popenargs, **kwargs) as process:
    489         try:
    490             stdout, stderr = process.communicate(input, timeout=timeout)

/usr/lib/python3.7/subprocess.py in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, encoding, errors, text)
    798                                 c2pread, c2pwrite,
    799                                 errread, errwrite,
--> 800                                 restore_signals, start_new_session)
    801         except:
    802             # Cleanup if the child failed starting.

/usr/lib/python3.7/subprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, restore_signals, start_new_session)
   1480                             errread, errwrite,
   1481                             errpipe_read, errpipe_write,
-> 1482                             restore_signals, start_new_session, preexec_fn)
   1483                     self._child_created = True
   1484                 finally:

TypeError: expected str, bytes or os.PathLike object, not NoneType

Any hints on how to solve these? Thanks!

victoryhb avatar Sep 20 '21 03:09 victoryhb

What version of CoreNLP do you have? We're missing the other end of the UD Enhancer in the most recent release of CoreNLP, but it's in the dev branch and you could install that instead. Alternatively, it is going to be in the next release of CoreNLP, which should be available within a week anyway.

AngledLuffa avatar Sep 20 '21 04:09 AngledLuffa

I am using CoreNLP 4.2.2, which I thought had already incorporated the features. Will wait for the next release then. Thank you!

victoryhb avatar Sep 20 '21 07:09 victoryhb

Getting the same error as reported by victoryhb above while using ud_enhancer. do we need the corenlp server running locally for getting enhanced dependencies using stanza ?

swatiagarwal-s avatar Feb 01 '22 12:02 swatiagarwal-s

Which version of CoreNLP are you using? You do not need the server running at all, but you do need a recent version of CoreNLP in your classpath.

AngledLuffa avatar Feb 01 '22 17:02 AngledLuffa

i have 4.4.0 corenlp and was using it in colab. process_doc gives the error while running the subprocess. however, i was able to use it how it's specified in ud_enhancer.py -

nlp = stanza.Pipeline(lang='en', processors='tokenize,pos,lemma,depparse')
with ud_enhancer.UniversalEnhancer(language="en") as enhancer:
    depparseFromStanza = nlp("This is a test")
    depparseEnhanced = enhancer.process(depparseFromStanza)

swatiagarwal-s avatar Feb 02 '22 05:02 swatiagarwal-s

I don't really use colab for anything, but hopefully you can figure out how to make it work!

AngledLuffa avatar Feb 02 '22 05:02 AngledLuffa