stanza icon indicating copy to clipboard operation
stanza copied to clipboard

Does stanza implement SpaceAfter?

Open BramVanroy opened this issue 3 years ago • 6 comments

I have a user-case where I need to know for all tokens whether or not they have space before/after them. I cannot find such information in the documentation and from glancing over the source code, there does not seem to be such an attribute.

Am I missing something? If it is not present, I would request it to be added as a feature.

BramVanroy avatar Apr 26 '21 14:04 BramVanroy

The following works reasonably well. May not be perfect if your sentence is malformed in terms of the number of spaces or if it has spaces at the front/back. Works best on strip()'d text. It would be nice to have a space-attribute on the tokens directly in Stanza, though.

def find_space_after(sentence: Sentence) -> List[bool]:
    spaces = []
    prev_end = sentence.tokens[0].end_char
    for token in sentence.tokens[1:]:
        spaces.append(prev_end != token.start_char)
        prev_end = token.end_char

    return spaces + [sentence.text[-1].isspace()]

BramVanroy avatar Apr 26 '21 15:04 BramVanroy

I was just about to reply with an explanation of how to get it from the original text or the following tokens. I could definitely see adding it to the conll output. Not sure about putting it on the tokens themselves, but it's something to consider

On Mon, Apr 26, 2021 at 8:40 AM Bram Vanroy @.***> wrote:

The following works reasonably well (may not be perfect if your sentence is malformed in terms of the number of spaces or if it has spaces at the front), but it would be nice to have a space-attribute on the tokens directly in Stanza.

def find_space_after(sentence: Sentence) -> List[bool]: spaces = [] prev_end = sentence.tokens[0].end_char for token in sentence.tokens[1:]: spaces.append(prev_end != token.start_char) prev_end = token.end_char

return spaces + [sentence.text[-1].isspace()]

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/677#issuecomment-826940110, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWKMFP5UWA7RW3INWCDTKWCQNANCNFSM43TBVYQQ .

AngledLuffa avatar Apr 26 '21 15:04 AngledLuffa

For reference, in spaCy every token has a ._whitespace attribute that contains the trailing whitespace character(s) if present as well as .text_with_ws which is the token text included the trailing whitespace.

For CoNLL this seems particularly useful indeed, where it could/should then be added to the misc field. This has already been implemented in my spacy_conll library through spaCy's wrappers (spacy_stanza and the like). Having such properties natively in stanza would be useful, too, I think.

BramVanroy avatar Apr 26 '21 16:04 BramVanroy

And CoreNLP has this with the (perhaps badly named) BeforeAnnotation and AfterAnnotation 🙂

manning avatar May 20 '21 00:05 manning

@AngledLuffa If of interest, I can do a first minor addition to doc2conll_text, adding SpaceAfter attributes in the misc field where applicable?

https://github.com/stanfordnlp/stanza/blob/c457a9309ad15c522e94230f919c25d1e7aebf64/stanza/utils/conll.py#L202

Or, if preferred, I can also add a token-level whitespace attribute but that would require a lot more work.

BramVanroy avatar Sep 29 '21 12:09 BramVanroy

Yes, that sounds great! (The minor work one, if that's what you're up for doing right now.) Thanks.

AngledLuffa avatar Sep 29 '21 16:09 AngledLuffa

Just looking through some old issues, the conllu output format of document objects now includes whether or not the tokens have SpaceAfter=no (that field missing from the misc column implies yes, there is whitespace), so I believe this is finished.

AngledLuffa avatar Mar 15 '23 20:03 AngledLuffa