lttoolbox icon indicating copy to clipboard operation
lttoolbox copied to clipboard

Weights are ignored in monolingual dictionary entries

Open MarcRiera opened this issue 5 years ago • 11 comments

Given the following paradigms and entries:

<pardef n="liv/e__vblex">
  <e>       <p><l>e</l>         <r>e<s n="vblex"/><s n="inf"/></r></p></e>
  <e>       <p><l>e</l>         <r>e<s n="vblex"/><s n="imp"/></r></p></e>
  <e>       <p><l>ed</l>        <r>e<s n="vblex"/><s n="pp"/></r></p></e>
  <e w="1"> <p><l>ing</l>       <r>e<s n="vblex"/><s n="pprs"/></r></p></e>
  <e w="3"> <p><l>ing</l>       <r>e<s n="vblex"/><s n="ger"/></r></p></e>
  <e w="2"> <p><l>ing</l>       <r>e<s n="vblex"/><s n="subs"/></r></p></e>
  <e>       <p><l>e</l>         <r>e<s n="vblex"/><s n="pres"/></r></p></e>
  <e>       <p><l>es</l>        <r>e<s n="vblex"/><s n="pres"/><s n="p3"/><s n="sg"/></r></p></e>
  <e>       <p><l>ed</l>        <r>e<s n="vblex"/><s n="past"/></r></p></e>
</pardef>
<pardef n="house__n">
  <e>       <p><l></l>          <r><s n="n"/><s n="sg"/></r></p></e>
  <e r="RL"><p><l>'s</l>        <r><s n="n"/><s n="sg"/><j/>'s<s n="gen"/></r></p></e>
  <e>       <p><l>s</l>         <r><s n="n"/><s n="pl"/></r></p></e>
  <e r="RL"><p><l>s'</l>        <r><s n="n"/><s n="pl"/><j/>'s<s n="gen"/></r></p></e>
</pardef>
<e lm="house" w="1">     <i>house</i><par n="house__n"/></e>
<e lm="house" w="2">     <i>hous</i><par n="liv/e__vblex"/></e>

lt-proc seems to ignore the weights for the entries:

$ echo "house" | lt-proc -wW eng-cat.automorf.bin
^house/house<n><sg><W:0.000000>/house<vblex><inf><W:0.000000>/house<vblex><pres><W:0.000000>/house<vblex><imp><W:0.000000>$

The expected result would be:

$ echo "house" | lt-proc -wW eng-cat.automorf.bin
^house/house<n><sg><W:1.000000>/house<vblex><inf><W:2.000000>/house<vblex><pres><W:2.000000>/house<vblex><imp><W:2.000000>$

However, the weights work fine when they are used inside a paradigm:

$ echo "housing" | lt-proc -wW eng-cat.automorf.bin
^housing/housing<n><sg><W:0.000000>/house<vblex><pprs><W:1.000000>/house<vblex><subs><W:2.000000>/house<vblex><ger><W:3.000000>$

MarcRiera avatar Mar 14 '19 12:03 MarcRiera

@Techievena

unhammer avatar Mar 14 '19 12:03 unhammer

@unhammer I will definitely look into it.

Techievena avatar Mar 15 '19 05:03 Techievena

I might be facing the same problem. I am using an input written in .att format to generate a weighted transducer.

0       1       c       c       0.000000
1       2       a       a       0.000000
2       3       t       t       0.000000
3       4       @0@     <n>     0.000000
3       5       s       <n>     0.000000
4       2.000000
5       6       @0@     <pl>    0.000000
6       1.000000

I generate the transducer using lt-comp lr in.att apert_model. The output of lt-print apert_model is:

0       1       c       c       0.000000
1       2       a       a       0.000000
2       3       t       t       0.000000
3       4       ε       <n>     0.000000
3       5       s       <n>     0.000000
4       7       ε       ε       2.000000
5       6       ε       <pl>    0.000000
6       7       ε       ε       1.000000
7       0.000000

which seems to be correct.

However, the output of the echo 'cat' | lt-proc apert_model -W seems to ignore the weights. ^cat/cat<n><W:0.000000>$

AMR-KELEG avatar Mar 25 '19 01:03 AMR-KELEG

I think the bug might be related to this line and its following lines: https://github.com/apertium/lttoolbox/blob/f73c54162cc8ca1d9f70486b051165af1a7bf7cb/lttoolbox/state.cc#L607

AMR-KELEG avatar Mar 25 '19 02:03 AMR-KELEG

I guess editing the comment on #49 to remove "Fix #44" was not enough to make Github understand it was not a closing merge.

TinoDidriksen avatar Apr 03 '19 12:04 TinoDidriksen

@MarcRiera I think the bug is with the lt-comp command. Is lt-comp used in the apertium-eng to compile the dictionary?

I have prepared a sample dictionary:

<dictionary>
  <alphabet>ÀÁÂÄÆÇÈÉÊËÌÍÎÏÑÒÓÔÖÙÚÛÜàáâäçèéêëìíîïñòóôöùúûüABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz</alphabet>
  <sdefs>
    <sdef n="n"   c="Noun"/>  
    <sdef n="vblex"   c="Verb"/> 
    <sdef n="p1"  c="First person"/> 
    <sdef n="p3"  c="Third person"/> 
    <sdef n="sg"  c="Singular"/> 
    <sdef n="pl"  c="Plural"/> 
    <sdef n="pres"  c="Present (tense)"/> 
    <sdef n="past"  c="Past"/> 
    <sdef n="imp"   c="Imperative"/> 
    <sdef n="inf"   c="Infinitive"/> 
    <sdef n="pp"  c="Past participle"/> 
    <sdef n="subs"  c="Verbal noun"/> 
    <sdef n="pprs"  c="Present participle"/> 
    <sdef n="ger"   c="Gerund"/> 
  </sdefs>
  <pardefs>
    <pardef n="liv/e__vblex">
      <e>       <p><l>e</l>         <r>e<s n="vblex"/><s n="inf"/></r></p></e>
      <e>       <p><l>e</l>         <r>e<s n="vblex"/><s n="imp"/></r></p></e>
      <e>       <p><l>ed</l>        <r>e<s n="vblex"/><s n="pp"/></r></p></e>
      <e w="1"> <p><l>ing</l>       <r>e<s n="vblex"/><s n="pprs"/></r></p></e>
      <e w="3"> <p><l>ing</l>       <r>e<s n="vblex"/><s n="ger"/></r></p></e>
      <e w="2"> <p><l>ing</l>       <r>e<s n="vblex"/><s n="subs"/></r></p></e>
      <e>       <p><l>e</l>         <r>e<s n="vblex"/><s n="pres"/></r></p></e>
      <e>       <p><l>es</l>        <r>e<s n="vblex"/><s n="pres"/><s n="p3"/><s n="sg"/></r></p></e>
      <e>       <p><l>ed</l>        <r>e<s n="vblex"/><s n="past"/></r></p></e>
    </pardef>
    <pardef n="house__n">
      <e>       <p><l></l>          <r><s n="n"/><s n="sg"/></r></p></e>
      <e r="RL"><p><l>'s</l>        <r><s n="n"/><s n="sg"/><j/>'s<s n="gen"/></r></p></e>
      <e>       <p><l>s</l>         <r><s n="n"/><s n="pl"/></r></p></e>
      <e r="RL"><p><l>s'</l>        <r><s n="n"/><s n="pl"/><j/>'s<s n="gen"/></r></p></e>
    </pardef>
  </pardefs>
<section id="main" type="standard">
  <e lm="house" w="1">     <i>house</i><par n="house__n"/></e>
  <e lm="house" w="2">     <i>hous</i><par n="liv/e__vblex"/></e>
</section>
</dictionary>

And the output transducer isn't correct

0	1	h	h	0.000000	
1	2	o	o	0.000000	
2	3	u	u	0.000000	
3	4	s	s	0.000000	
4	5	e	e	0.000000	 # THIS EDGE SHOULD HAVE WEIGHT=2
4	6	e	e	1.000000 # THIS EDGE HAVE A CORRECT WEIGHT!!	
4	7	i	e	0.000000	
5	8	ε	<vblex>	0.000000	
5	9	d	<vblex>	0.000000	
5	10	s	<vblex>	0.000000	
6	11	ε	<n>	0.000000	
6	12	s	<n>	0.000000	
7	13	n	<vblex>	0.000000	
8	14	ε	<inf>	0.000000	
8	14	ε	<imp>	0.000000	
8	14	ε	<pres>	0.000000	
9	14	ε	<pp>	0.000000	
9	14	ε	<past>	0.000000	
10	15	ε	<pres>	0.000000	
11	14	ε	<sg>	0.000000	
12	14	ε	<pl>	0.000000	
13	14	g	<pprs>	1.000000	
13	14	g	<ger>	3.000000	
13	14	g	<subs>	2.000000	
15	11	ε	<p3>	0.000000	
14	0.000000

When I use the command echo "house" | lt-proc house.bin -W I get only correct weights for the noun analysis:

^house/house<vblex><inf><W:0.000000>/house<vblex><imp><W:0.000000>/house<vblex><pres><W:0.000000>/house<n><sg><W:1.000000>$

AMR-KELEG avatar Apr 03 '19 13:04 AMR-KELEG

the correct weighting here is not trivial (so there seems to be something wrong in the compilation part too), keep in mind that the prefix "hous" is shared by both verb and noun, and the verb that needs that weight of 2 needs it also for "housing" which does not go through the "4 5 e e" arc.

Here's the hfst + lexc equivalent for reference:

 $ ▓▒cat house.lexc 
Multichar_Symbols
%<n%>
%<vblex%>
%<p1%>
%<p3%>
%<sg%>
%<pl%>
%<pres%>
%<past%>
%<imp%>
%<inf%>
%<pp%>
%<subs%>
%<pprs%>
%<ger%>
%<gen%>

LEXICON Root

house:house house__n "weight: 1" ;
hous:hous liv/e__vblex "weight: 2" ;

LEXICON liv/e__vblex

e%<vblex%>%<inf%>:e # ;
e%<vblex%>%<imp%>:e # ;
e%<vblex%>%<pp%>:ed # ;
e%<vblex%>%<pprs%>:ing # "weight: 1" ;
e%<vblex%>%<ger%>:ing  # "weight: 2" ;
e%<vblex%>%<subs%>:ing  # "weight: 3" ;
e%<vblex%>%<pres%>:e # ;
e%<vblex%>%<pres%>%<p3%>%<sg%>:es # ;
e%<vblex%>%<past%>:ed # ;

LEXICON house__n

%<n%>%<sg%>:0  # ;
%<n%>%<sg%>+'s%<gen%>:'s  # ;
%<n%>%<pl%>:s  # ;
%<n%>%<pl%>+'s%<gen%>:s'  # ;

$ ▓▒hfst-lexc house.lexc | hfst-fst2txt 
hfst-lexc: warning: Defaulting to OpenFst tropical type
Root...2 liv/e__vblex...9 house__n...
0	1	h	h	1.000000
1	2	o	o	0.000000
2	3	u	u	0.000000
3	4	s	s	0.000000
4	5	e	i	2.000000
4	6	e	e	0.000000
5	7	<vblex>	n	0.000000
6	8	<n>	@0@	0.000000
6	9	<n>	s	0.000000
6	10	<n>	'	0.000000
6	11	<vblex>	@0@	1.000000
6	12	<vblex>	s	1.000000
6	13	<vblex>	d	1.000000
7	14	<subs>	g	2.000000
7	14	<ger>	g	1.000000
7	14	<pprs>	g	0.000000
8	14	<sg>	@0@	0.000000
9	14	<pl>	@0@	0.000000
9	15	<pl>	'	0.000000
10	15	<sg>	s	0.000000
11	14	<pres>	@0@	0.000000
11	14	<imp>	@0@	0.000000
11	14	<inf>	@0@	0.000000
12	16	<pres>	@0@	0.000000
13	14	<past>	@0@	0.000000
13	14	<pp>	@0@	0.000000
14	0.000000
15	17	+	@0@	0.000000
16	8	<p3>	@0@	0.000000
17	18	'	@0@	0.000000
18	19	s	@0@	0.000000
19	14	<gen>	@0@	0.000000

$ ▓▒hfst-lexc house.lexc | hfst-fst2strings  -w
hfst-lexc: warning: Defaulting to OpenFst tropical type
Root...2 liv/e__vblex...9 house__n...
house<vblex><subs>:housing	5
house<vblex><ger>:housing	4
house<vblex><pprs>:housing	3
house<n><sg>:house	1
house<n><pl>:houses	1
house<n><pl>+'s<gen>:houses'	1
house<n><sg>+'s<gen>:house's	1
house<vblex><pres>:house	2
house<vblex><imp>:house	2
house<vblex><inf>:house	2
house<vblex><pres><p3><sg>:houses	2
house<vblex><past>:housed	2

nonetheless for the lt-proc part there should be at least a bit more of the weight accumulated :-/

flammie avatar Apr 03 '19 14:04 flammie

Is lt-comp used in the apertium-eng to compile the dictionary?

it is

unhammer avatar Apr 03 '19 15:04 unhammer

I believe the issue here is that Transducer::closure() disregards weight and as a result determinize() and minimize() lose any weights which are on epsilon transitions.

mr-martian avatar Jul 06 '22 18:07 mr-martian

@mr-martian, it seems at some point you attempted to fix it, but then had to revert. Any idea on what needs to be done?

xavivars avatar Nov 01 '23 19:11 xavivars

The issue is that FST minimization was written for unweighted automata and when weight support for added, closure() and/or minimize() were updated incorrectly and my first attempt at fixing it failed. So someone who knows FST algorithms better than me needs to go through that code.

mr-martian avatar Nov 01 '23 19:11 mr-martian