stanza
stanza copied to clipboard
Does stanza implement SpaceAfter?
I have a user-case where I need to know for all tokens whether or not they have space before/after them. I cannot find such information in the documentation and from glancing over the source code, there does not seem to be such an attribute.
Am I missing something? If it is not present, I would request it to be added as a feature.
The following works reasonably well. May not be perfect if your sentence is malformed in terms of the number of spaces or if it has spaces at the front/back. Works best on strip()
'd text. It would be nice to have a space-attribute on the tokens directly in Stanza, though.
def find_space_after(sentence: Sentence) -> List[bool]:
spaces = []
prev_end = sentence.tokens[0].end_char
for token in sentence.tokens[1:]:
spaces.append(prev_end != token.start_char)
prev_end = token.end_char
return spaces + [sentence.text[-1].isspace()]
I was just about to reply with an explanation of how to get it from the original text or the following tokens. I could definitely see adding it to the conll output. Not sure about putting it on the tokens themselves, but it's something to consider
On Mon, Apr 26, 2021 at 8:40 AM Bram Vanroy @.***> wrote:
The following works reasonably well (may not be perfect if your sentence is malformed in terms of the number of spaces or if it has spaces at the front), but it would be nice to have a space-attribute on the tokens directly in Stanza.
def find_space_after(sentence: Sentence) -> List[bool]: spaces = [] prev_end = sentence.tokens[0].end_char for token in sentence.tokens[1:]: spaces.append(prev_end != token.start_char) prev_end = token.end_char
return spaces + [sentence.text[-1].isspace()]
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/677#issuecomment-826940110, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWKMFP5UWA7RW3INWCDTKWCQNANCNFSM43TBVYQQ .
For reference, in spaCy every token has a ._whitespace
attribute that contains the trailing whitespace character(s) if present as well as .text_with_ws
which is the token text included the trailing whitespace.
For CoNLL this seems particularly useful indeed, where it could/should then be added to the misc field. This has already been implemented in my spacy_conll library through spaCy's wrappers (spacy_stanza and the like). Having such properties natively in stanza would be useful, too, I think.
And CoreNLP has this with the (perhaps badly named) BeforeAnnotation
and AfterAnnotation
🙂
@AngledLuffa If of interest, I can do a first minor addition to doc2conll_text, adding SpaceAfter attributes in the misc field where applicable?
https://github.com/stanfordnlp/stanza/blob/c457a9309ad15c522e94230f919c25d1e7aebf64/stanza/utils/conll.py#L202
Or, if preferred, I can also add a token-level whitespace attribute but that would require a lot more work.
Yes, that sounds great! (The minor work one, if that's what you're up for doing right now.) Thanks.
Just looking through some old issues, the conllu output format of document objects now includes whether or not the tokens have SpaceAfter=no
(that field missing from the misc column implies yes, there is whitespace), so I believe this is finished.