lttoolbox icon indicating copy to clipboard operation
lttoolbox copied to clipboard

Write a utility to assign weights to a compiled transducer based on a corpus

Open ftyers opened this issue 6 years ago • 9 comments

I imagine it will be called lt-reweight

It should have two arguments:

  1. a binary lttoolbox file e.g. grn.automorf.bin
  2. a tagged corpus grn.tagged
$ lt-reweight grn.automorf.bin grn.tagged

Where grn.tagged looks like:

^Avañeʼẽ/avañeʼẽ<n>$
^ha/ha<cnjcoo>$
^Guarani/guarani<n>$
^ñeʼẽ/ñeʼẽ<n>$
^ombohéra/o<prn><p3><sg>+mbohéra<v><tv><pres>$
^hikuái/hikuái<aux><impf><p3><pl>$
^umi/umi<adj><dem><pl>$
^Guaranikuéra/guarani<n>+kuéra<det><pl>$
^pe/pe<post>$
^ñeʼẽ/ñeʼẽ<n>$
^teépe/tee<n>+pe<post>$
^./.<sent>$

^Guarani/guarani<n>$
^haʼe/haʼe<vbser><iv><pres>$
^peteĩva/peteĩ<num>+va<subs><dem>$
^umi/umi<adj><dem><pl>$
^teʼyikuéra/teʼyi<n>+kuéra<det><pl>$
^Amérika-gua/Amérika<np><top>+gua<post>$
^ñeʼẽnguéra/ñeʼẽ<n>+kuéra<det><pl>$
^apytépe/apytépe<post>$
^hetave/heta<adv>+ve<comp>$
^iñeʼẽhárava/iñeʼẽhárava<adj>$
^,/,<cm>$
^oñemohendáva/o<prn><p3><sg>+je<pass>+mohenda<v><tv><pres>+va<subs><dem>$
^irundy/irundy<num>$
^tetãnguéra/tetã<n>+kuéra<det><pl>$
^iñambuévape/iñambuéva<adj>+pe<post>$
^(/(<lpar>$
^Paraguái/Paraguái<np><top>$
^,/,<cm>$
^Argentina/Argentina<np><top>$
^,/,<cm>$
^Volívia/Volívia<np><top>$
^ha/ha<cnjcoo>$
^Brasil/Brasil<np><top>$
^)/)<rpar>$
^./.<sent>$

^Avei/avei<adv>$
^,/,<cm>$
^haʼe/haʼe<vbser><iv><pres>$
^ñoite/ñoite<adv>$
^ojehechakuaáva/o<prn><p3><sg>+je<pass>+hechakuaa<v><tv><pres>+va<subs><dem>$
^ñeʼẽ/ñeʼẽ<n>$
^teéramo/tee<n>+ramo<post>$
^peteĩ/peteĩ<num>$
^tetã/tetã<n>$
^Ñembyamérika-guápe/Ñembyamérika<np><top>+gua<post>+pe<post>$
^./.<sent>$

^Tupi/Tupi<n>$
^ha/ha<cnjcoo>$
^guarani/guarani<n>$
^ñeʼẽ/ñeʼẽ<n>$
^aty/aty<n>$
^guasu/guasu<adj>$
^rehegua/rehegua<post>$

^,/,<cm>$
^oguereko/o<prn><p3><sg>+guereko<v><tv><pres>$
^hetáichagua/hetáichagua<adj>$

^ñeʼẽnunga/ñeʼẽnunga<n>$
^,/,<cm>$
^upéicharõ/upéicha<adv>+rõ<post>$
^jepe/jepe<adv>$
^oĩ/oĩ<v><iv><pres>$
^jekupyty/jekupyty<v><tv><pres>$
^ijapytepekuéra/i<prn><p3><sg>+japyte<n>+pe<post>+kuéra<det><pl>$
^ha/ha<cnjcoo>$
^heta/heta<adv>$
^mbaʼépe/mbaʼe<n>+pe<post>$
^ojojogua/ojojogua<n>$
^koʼã/koʼã<adj><dem><pl>$
^ñeʼẽnungakuéra/ñeʼẽnunga<n>+kuéra<det><pl>$
^./.<sent>$

^Avañeʼẽ/avañeʼẽ<n>$
^ha/ha<cnjcoo>$
^karaiñeʼẽ/karaiñeʼẽ<n>$
^haʼe/haʼe<vbser><iv><pres>$
^Paraguái/Paraguái<np><top>$
^retãme/tetã<n>+pe<post>$
^ñeʼẽ/ñeʼẽ<n>$
^tee/tee<adj>$
^ary/ary<n>$
^1992/1992<num>$
^guive/guive<post>$
^./.<sent>$

^Japypateĩ/Japypateĩ<num>$
^2006/2006<num>$
^guive/guive<post>$
^haʼe/haʼe<vbser><iv><pres>$
^avei/avei<adv>$
^ñeʼẽ/ñeʼẽ<n>$
^tee/tee<adj>$
^Mercosur-pe/Mercosur<np><org>+pe<case>$
^,/,<cm>$
^karaiñeʼẽ/karaiñeʼẽ<n>$
^ha/ha<cnjcoo>$
^poytugañeʼẽ/poytugañeʼẽ<n>$
^ykére/ykére<post>$
^./.<sent>$

And the output of the analyser for e.g. poytugañeʼẽ is:

^poytugañeʼẽ/poytugañeʼẽ<n>/a<prn><p1><sg>+poytugañeʼẽ<n>/re<prn><p2><sg>+poytugañeʼẽ<n>$^./.<sent>$

So, the analyses should be weighted

poytugañeʼẽ : poytugañeʼẽ<n> = 1.0
poytugañeʼẽ : a<prn><p1><sg>+poytugañeʼẽ<n> = 0.0
poytugañeʼẽ  : re<prn><p2><sg>+poytugañeʼẽ<n>  = 0.0

ftyers avatar Jul 01 '18 16:07 ftyers

Is it similar to supervised tagger training? @flammie

Techievena avatar Jul 05 '18 23:07 Techievena

Pretty much I'd say, a unigram tagger should work exactly the same if I haven't missed anything.

flammie avatar Jul 06 '18 01:07 flammie

So one way this could work is:

  • Load original transducer, A
  • Read tagged corpus into a weighted FST, B
  • Intersect B and A, making C
  • Priority union C and A.

Questions:

  • Does intersection in lttoolbox do the right thing with weights?
  • We don't have priority union, or subtract, it seems a bit difficult to do without either of those.

@flammie @unhammer thoughts ?

ftyers avatar Jun 27 '20 08:06 ftyers

I don't think even openfst has a defined intersection of weighted or two-tape automata, they just do the encoded intersection where a:b::W is treated as a special symbol in an automata intersection. It might be possible to add weights by way of intersection algorithm at least when the automata were mostly synchronised, otherwise I'd just do with composing.

For the experiments I published on weighing automata we did compose(A, B), or at most compose(minus'(A', B'), B) which does something similar to priority union I guess. It required some trickery though. One could even just do the union(A, B) since B is gold corpus with good tags, right? In compose method you mainly lose if there is non 1:1 relation from the direction you compose I think, e.g. if you have foo+X:bar foo+X:baz.

The part of A that doesn't get weighted by corpus should usually receive the penalty weight of unseen tokens.

flammie avatar Jun 27 '20 13:06 flammie

The reason for not just doing union is that then we would have multiple identical analyses with different weights, right?

I was thinking compose would work, but we also don't have an implementation of compose in lttoolbox at the moment.

ftyers avatar Jun 27 '20 15:06 ftyers

https://github.com/apertium/lttoolbox/pull/161 adds a compose (optional on matching sub-paths), though not very extensively tested :) also I have no idea what the expected value of composed weights would be

unhammer avatar Sep 24 '22 11:09 unhammer

I think weights are just added together in our WFSAs? Or theoretically using the weight structure's semiring's collect operation but we've always used the tropical semiring which is just +.

flammie avatar Sep 24 '22 12:09 flammie

So whatever operation you use on weights when following arcs should be used when composing? And if you want to compose g . f without changing the weights that are in f, then all arcs of g need to be the identity (ie. 0 if operation is +)?

unhammer avatar Sep 24 '22 14:09 unhammer

@flammie newest uses +

unhammer avatar Sep 25 '22 21:09 unhammer