openlexicon icon indicating copy to clipboard operation
openlexicon copied to clipboard

potential bug report

Open alephpi opened this issue 1 year ago • 4 comments

image Hi, I'm just curious that the first aurai exists in French?

alephpi avatar May 21 '23 18:05 alephpi

Hum, looks like a "bug" from the parser which classified it as inf (infinite tense). One hypothesis: because the frequency in books is 0, it likely comes from the subtitles corpora which may contain some deviant sentences where the parser choked.

-- Christophe Pallier (http://www.pallier.org) INSERM Cognitive Neuroimaging Lab (http://www.unicog.org)

On Sun, May 21, 2023 at 8:04 PM 润心 @.***> wrote:

[image: image] https://user-images.githubusercontent.com/61275421/239759667-71577d97-2910-46b7-b704-3c51671b650f.png Hi, I'm just curious that the first aurai exists in French?

— Reply to this email directly, view it on GitHub https://github.com/chrplr/openlexicon/issues/19, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALVWMSKYHJL2HJ6DMDHCO3XHJKLTANCNFSM6AAAAAAYJRHXVA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

chrplr avatar May 21 '23 18:05 chrplr

I'll keep reporting potential bugs I find in this issue. since I'm doing some data processing for my project, it's just a side task.

invari = re.compile('ADV|CON|PRE')
df_invari = df.loc[df.cgram.str.contains(invari)]
df_invari[df_invari['ortho'] != df_invari['lemme']]

gives me

ortho phon lemme cgram genre nombre freqlemlivres freqlivres infover
aujourd'hui oZuRd8i aujourd'huie ADV     0.14 0.14  
bons-cadeaux b§kado bon-cadeaux ADV     0.00 0.00  
c'est-à-dire sEtadiR c'est-à-diree ADV     0.07 0.07  
d'emblée d@ble d'embléee ADV     0.07 0.07  
n n ne ADV     13841.89 5.68  
n' n ne ADV     13841.89 6084.12  
re R2 r ADV     7.50 7.50  
y i yu ADV     0.27 0.27

The lemma seems not correct. (I suppose invariant words' lemma are themselves)

alephpi avatar May 21 '23 19:05 alephpi

ortho phon lemme cgram genre nombre freqlemlivres freqlivres infover
e 2 2e ADJ     0.00 0.00  
e 2 58e ADJ     0.00 0.00  
e 2 7e ADJ     0.07 0.07

alephpi avatar May 21 '23 20:05 alephpi

bug.csv Here is a table of words whose lemma's cgram is not the same as its own. (I think the lemma should be a closed operation right?)

alephpi avatar May 21 '23 22:05 alephpi