spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

Spacy split the sentence when I try to change the head of a root token

Open zsozso21 opened this issue 2 years ago • 2 comments

The first thing that I found is the experimental biaffine parser can parse sentences with multiple roots. (I thought about the reporting of this situation as an issue, but dependency trees can include multiple heads to the root. Instead of the Universal Dependencies, there are treebanks that do not contain any restriction for the number of heads that point to the root.) After that, I wanted to make a pipeline component that can select one token with root head by linguistic rules and connect the others to that. I used the setter of the token “head” attribute. The first irregular observation was: in some cases, this modification split the sentence into two. My second observation was when I tried to serialize/deserialize a Doc with pickle the deserialized Doc contained two sentences instead of one.

How to reproduce the behaviour

There is a gist that reproduce the above mentioned behaviors on a Hungarian example.

Your Environment

  • Operating System: Ubuntu 20.04.3 LTS
  • Python Version Used: 3.8.5
  • spaCy Version Used: 3.2.3
  • Environment Information: spacy-experimental: git+https://github.com/explosion/spacy-experimental.git@3eca6174f3d7a48e959c77a2560428b4412d3e91

zsozso21 avatar Mar 19 '22 15:03 zsozso21

We really appreciate the feedback about the experimental components! I don't think it's intentional to support sentences with multiple roots.

The spacy Doc object can only support one head per token and one root per sentence. If you set the token features in cython (as the component does) there aren't any automatic checks/adjustments, but if you set a token head in python, it automatically adjusts the parses in the background to keep the sentence boundaries and some related attributes in sync with the parses. The same adjustments are applied when you deserialize or unpickle a Doc, too.

I see that there's a todo in the code related to this:

https://github.com/explosion/spacy-experimental/blob/b2d3813b511524e8bf4160f51e545d71d88bb316/spacy_experimental/biaffine_parser/arc_predicter.pyx#L192-L193

Daniël will probably have more to say about what's going on with the sentence boundaries and the parser algorithm.

adrianeboyd avatar Mar 21 '22 07:03 adrianeboyd

The minimum spanning tree decoder in the biaffine parser supports multiple heads, but the biaffine parser itself still needs to be more constrained to follow spaCy rules/conventions.

When it comes to sentence boundaries, they are currently not set by the biaffine parser, because the biaffine parser itself relies on senter for sentence boundaries to avoid that the O(n^2) complexity of the pairwise bilinear layer gets out of hand. I am currently working on a newer version of the parser that uses a lazy splitting strategy. Once that is in place, we could use the parser output for sentence boundaries.

danieldk avatar Apr 05 '22 12:04 danieldk