ud-annotatrix icon indicating copy to clipboard operation
ud-annotatrix copied to clipboard

CG character ": " not parsed correctly

Open rueter opened this issue 6 years ago • 11 comments

Annotatrix eats input:

"<Синь>"
        "синь" Pron Pers Pl3 Gen
        "синь" Pron Pers Pl3 Nom
: 
"<эсост>"
        "эса" Pron Ine PxPl3
        "эсост" Adv Ine PxPl3
: 
"<нинге>"
        "ни" Adv Temp Foc
        "нинге" Adv
: 
"<точкат>"
        "точка" N Pl Nom Indef
"<->"
        "-" PUNCT
"<кутт>"
        "куд" N Pl Nom Indef
"<,>"
        "," CLB
: 
"<Сай>"
        "самс" V IV Ind Prs ScSg3
        "самс" V IV V Act PrsPrc
        "самс" V IV V Der/NomAg N Sg Nom Indef
: 
"<пинге>"
        "пинге" N Sg Nom Indef
: 
"<—>"
        "—" PUNCT
: 
"<касыхть>"
        "касомс" V IV Ind Prs ScPl3
        "касы" N Pl Nom Indef
: 
"<ошсон>"
        "ош" N SP Ine PxSg1
"<.>"
        "." CLB

rueter avatar Sep 18 '18 09:09 rueter

@keggsmurph21

<spectie> https://github.com/jonorthwash/ud-annotatrix/issues/321
<spectie> why are there 
<spectie> :
<spectie> in the GT CG output ?
<TinoDidriksen> Those are literal spaces from hfst-tokenise --giella-cg mode.
<TinoDidriksen> It's ": " not just :
<Unhammer> «superblanks»
<Unhammer> or just blanks
<Unhammer> I don't understand the issue
<Unhammer> what did it eat?
<spectie> hmm
<spectie> well, whatever annotatrix is using can't cope with it 
<Unhammer> grep -v ^: ? or even sed 's/^:.*//'
<Flammie[m]> isn't basically any line not containing " a comment in CG
<Unhammer> weeel
<Unhammer> it should start with " or whitespace
<Unhammer> does it need a lemma?
<Unhammer> "<>"
<Unhammer>  tag
<Flammie[m]> might work
<Unhammer> also there's stuff like <setvariable> or something, but I'm guessing that's also not handled by annotatrix
<TinoDidriksen> Any CG parser should treat non-matching stuff as text.
<TinoDidriksen> Those : are perfectly valid.
<spectie> ok!
<spectie> thanks
<spectie> i'll pass it on

ftyers avatar Sep 18 '18 10:09 ftyers

Relevant points:

  • ": " needs to be dealt with somehow, probably just as a space?
  • Any non-matching stuff in CG should be treated as text.

jonorthwash avatar Sep 24 '18 15:09 jonorthwash

This issue might be best filed against notatrix. @keggsmurph21, what do you think?

jonorthwash avatar Sep 24 '18 15:09 jonorthwash

The parser is in parser.js somewhere. This functionality should be fairly straightforward to add?

jonorthwash avatar Sep 24 '18 15:09 jonorthwash

what does "treated as text" mean in this context?

keggsmurph21 avatar Oct 01 '18 03:10 keggsmurph21

what does "treated as text" mean in this context?

@TinoDidriksen, could you clarify your statement on this some?

<TinoDidriksen> Any CG parser should treat non-matching stuff as text.

jonorthwash avatar Oct 19 '18 15:10 jonorthwash

The way CG-3 does it, is any non-CG input gets bundled up in a buffer attached to the immediately previous cohort, and then spit out again completely untouched once processing is finished.

If the cohort is moved, the attached text goes with it. If the cohort is deleted, the text is still output where the cohort would have been.

This lets CG-3 transparently pass along all sorts of markup.

TinoDidriksen avatar Oct 19 '18 15:10 TinoDidriksen

So in this case, each occurrence of "\n: " would be part of the cohort on the previous line, right?

jonorthwash avatar Oct 19 '18 15:10 jonorthwash

Yes.

With the important caveat that I really mean cohort, not reading. Cohorts own the non-CG parts. Readings do not, because readings have messy lives. So non-CG interspersed between readings will get bundled up as one lump sum, output after the owning cohort.

TinoDidriksen avatar Oct 19 '18 16:10 TinoDidriksen

Ah, right. I think I get how the parsing is supposed to work then. @keggsmurph21, does this make sense to you?

jonorthwash avatar Oct 20 '18 03:10 jonorthwash

A newline followed by : should be interpreted as part of the previous cohort when parsing CG format.

jonorthwash avatar Jun 25 '19 17:06 jonorthwash