ud-annotatrix CG character ": " not parsed correctly

Annotatrix eats input:

"<Синь>"
        "синь" Pron Pers Pl3 Gen
        "синь" Pron Pers Pl3 Nom
: 
"<эсост>"
        "эса" Pron Ine PxPl3
        "эсост" Adv Ine PxPl3
: 
"<нинге>"
        "ни" Adv Temp Foc
        "нинге" Adv
: 
"<точкат>"
        "точка" N Pl Nom Indef
"<->"
        "-" PUNCT
"<кутт>"
        "куд" N Pl Nom Indef
"<,>"
        "," CLB
: 
"<Сай>"
        "самс" V IV Ind Prs ScSg3
        "самс" V IV V Act PrsPrc
        "самс" V IV V Der/NomAg N Sg Nom Indef
: 
"<пинге>"
        "пинге" N Sg Nom Indef
: 
"<—>"
        "—" PUNCT
: 
"<касыхть>"
        "касомс" V IV Ind Prs ScPl3
        "касы" N Pl Nom Indef
: 
"<ошсон>"
        "ош" N SP Ine PxSg1
"<.>"
        "." CLB

Sep 18 '18 09:09 rueter

@keggsmurph21

<spectie> https://github.com/jonorthwash/ud-annotatrix/issues/321
<spectie> why are there 
<spectie> :
<spectie> in the GT CG output ?
<TinoDidriksen> Those are literal spaces from hfst-tokenise --giella-cg mode.
<TinoDidriksen> It's ": " not just :
<Unhammer> «superblanks»
<Unhammer> or just blanks
<Unhammer> I don't understand the issue
<Unhammer> what did it eat?
<spectie> hmm
<spectie> well, whatever annotatrix is using can't cope with it 
<Unhammer> grep -v ^: ? or even sed 's/^:.*//'
<Flammie[m]> isn't basically any line not containing " a comment in CG
<Unhammer> weeel
<Unhammer> it should start with " or whitespace
<Unhammer> does it need a lemma?
<Unhammer> "<>"
<Unhammer>  tag
<Flammie[m]> might work
<Unhammer> also there's stuff like <setvariable> or something, but I'm guessing that's also not handled by annotatrix
<TinoDidriksen> Any CG parser should treat non-matching stuff as text.
<TinoDidriksen> Those : are perfectly valid.
<spectie> ok!
<spectie> thanks
<spectie> i'll pass it on

Sep 18 '18 10:09 ftyers

Relevant points:

": " needs to be dealt with somehow, probably just as a space?
Any non-matching stuff in CG should be treated as text.

Sep 24 '18 15:09 jonorthwash

This issue might be best filed against notatrix. @keggsmurph21, what do you think?

Sep 24 '18 15:09 jonorthwash

The parser is in parser.js somewhere. This functionality should be fairly straightforward to add?

Sep 24 '18 15:09 jonorthwash

what does "treated as text" mean in this context?

Oct 01 '18 03:10 keggsmurph21

what does "treated as text" mean in this context?

@TinoDidriksen, could you clarify your statement on this some?

<TinoDidriksen> Any CG parser should treat non-matching stuff as text.

Oct 19 '18 15:10 jonorthwash

The way CG-3 does it, is any non-CG input gets bundled up in a buffer attached to the immediately previous cohort, and then spit out again completely untouched once processing is finished.

If the cohort is moved, the attached text goes with it. If the cohort is deleted, the text is still output where the cohort would have been.

This lets CG-3 transparently pass along all sorts of markup.

Oct 19 '18 15:10 TinoDidriksen

So in this case, each occurrence of "\n: " would be part of the cohort on the previous line, right?

Oct 19 '18 15:10 jonorthwash

Yes.

With the important caveat that I really mean cohort, not reading. Cohorts own the non-CG parts. Readings do not, because readings have messy lives. So non-CG interspersed between readings will get bundled up as one lump sum, output after the owning cohort.

Oct 19 '18 16:10 TinoDidriksen

Ah, right. I think I get how the parsing is supposed to work then. @keggsmurph21, does this make sense to you?

Oct 20 '18 03:10 jonorthwash

A newline followed by : should be interpreted as part of the previous cohort when parsing CG format.

Jun 25 '19 17:06 jonorthwash

ud-annotatrix ud-annotatrix copied to clipboard

CG character ": " not parsed correctly

ud-annotatrix
ud-annotatrix copied to clipboard