ud-annotatrix
ud-annotatrix copied to clipboard
CG character ": " not parsed correctly
Annotatrix eats input:
"<Синь>"
"синь" Pron Pers Pl3 Gen
"синь" Pron Pers Pl3 Nom
:
"<эсост>"
"эса" Pron Ine PxPl3
"эсост" Adv Ine PxPl3
:
"<нинге>"
"ни" Adv Temp Foc
"нинге" Adv
:
"<точкат>"
"точка" N Pl Nom Indef
"<->"
"-" PUNCT
"<кутт>"
"куд" N Pl Nom Indef
"<,>"
"," CLB
:
"<Сай>"
"самс" V IV Ind Prs ScSg3
"самс" V IV V Act PrsPrc
"самс" V IV V Der/NomAg N Sg Nom Indef
:
"<пинге>"
"пинге" N Sg Nom Indef
:
"<—>"
"—" PUNCT
:
"<касыхть>"
"касомс" V IV Ind Prs ScPl3
"касы" N Pl Nom Indef
:
"<ошсон>"
"ош" N SP Ine PxSg1
"<.>"
"." CLB
@keggsmurph21
<spectie> https://github.com/jonorthwash/ud-annotatrix/issues/321
<spectie> why are there
<spectie> :
<spectie> in the GT CG output ?
<TinoDidriksen> Those are literal spaces from hfst-tokenise --giella-cg mode.
<TinoDidriksen> It's ": " not just :
<Unhammer> «superblanks»
<Unhammer> or just blanks
<Unhammer> I don't understand the issue
<Unhammer> what did it eat?
<spectie> hmm
<spectie> well, whatever annotatrix is using can't cope with it
<Unhammer> grep -v ^: ? or even sed 's/^:.*//'
<Flammie[m]> isn't basically any line not containing " a comment in CG
<Unhammer> weeel
<Unhammer> it should start with " or whitespace
<Unhammer> does it need a lemma?
<Unhammer> "<>"
<Unhammer> tag
<Flammie[m]> might work
<Unhammer> also there's stuff like <setvariable> or something, but I'm guessing that's also not handled by annotatrix
<TinoDidriksen> Any CG parser should treat non-matching stuff as text.
<TinoDidriksen> Those : are perfectly valid.
<spectie> ok!
<spectie> thanks
<spectie> i'll pass it on
Relevant points:
- ": " needs to be dealt with somehow, probably just as a space?
- Any non-matching stuff in CG should be treated as text.
This issue might be best filed against notatrix. @keggsmurph21, what do you think?
The parser is in parser.js somewhere. This functionality should be fairly straightforward to add?
what does "treated as text" mean in this context?
what does "treated as text" mean in this context?
@TinoDidriksen, could you clarify your statement on this some?
<TinoDidriksen> Any CG parser should treat non-matching stuff as text.
The way CG-3 does it, is any non-CG input gets bundled up in a buffer attached to the immediately previous cohort, and then spit out again completely untouched once processing is finished.
If the cohort is moved, the attached text goes with it. If the cohort is deleted, the text is still output where the cohort would have been.
This lets CG-3 transparently pass along all sorts of markup.
So in this case, each occurrence of "\n: " would be part of the cohort on the previous line, right?
Yes.
With the important caveat that I really mean cohort, not reading. Cohorts own the non-CG parts. Readings do not, because readings have messy lives. So non-CG interspersed between readings will get bundled up as one lump sum, output after the owning cohort.
Ah, right. I think I get how the parsing is supposed to work then. @keggsmurph21, does this make sense to you?
A newline followed by : should be interpreted as part of the previous cohort when parsing CG format.