openlexicon
openlexicon copied to clipboard
potential bug report
Hi, I'm just curious that the first
aurai
exists in French?
Hum, looks like a "bug" from the parser which classified it as inf (infinite tense). One hypothesis: because the frequency in books is 0, it likely comes from the subtitles corpora which may contain some deviant sentences where the parser choked.
-- Christophe Pallier (http://www.pallier.org) INSERM Cognitive Neuroimaging Lab (http://www.unicog.org)
On Sun, May 21, 2023 at 8:04 PM 润心 @.***> wrote:
[image: image] https://user-images.githubusercontent.com/61275421/239759667-71577d97-2910-46b7-b704-3c51671b650f.png Hi, I'm just curious that the first aurai exists in French?
— Reply to this email directly, view it on GitHub https://github.com/chrplr/openlexicon/issues/19, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALVWMSKYHJL2HJ6DMDHCO3XHJKLTANCNFSM6AAAAAAYJRHXVA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
I'll keep reporting potential bugs I find in this issue. since I'm doing some data processing for my project, it's just a side task.
invari = re.compile('ADV|CON|PRE')
df_invari = df.loc[df.cgram.str.contains(invari)]
df_invari[df_invari['ortho'] != df_invari['lemme']]
gives me
ortho | phon | lemme | cgram | genre | nombre | freqlemlivres | freqlivres | infover |
---|---|---|---|---|---|---|---|---|
aujourd'hui | oZuRd8i | aujourd'huie | ADV | 0.14 | 0.14 | |||
bons-cadeaux | b§kado | bon-cadeaux | ADV | 0.00 | 0.00 | |||
c'est-à-dire | sEtadiR | c'est-à-diree | ADV | 0.07 | 0.07 | |||
d'emblée | d@ble | d'embléee | ADV | 0.07 | 0.07 | |||
n | n | ne | ADV | 13841.89 | 5.68 | |||
n' | n | ne | ADV | 13841.89 | 6084.12 | |||
re | R2 | r | ADV | 7.50 | 7.50 | |||
y | i | yu | ADV | 0.27 | 0.27 |
The lemma seems not correct. (I suppose invariant words' lemma are themselves)
ortho | phon | lemme | cgram | genre | nombre | freqlemlivres | freqlivres | infover |
---|---|---|---|---|---|---|---|---|
e | 2 | 2e | ADJ | 0.00 | 0.00 | |||
e | 2 | 58e | ADJ | 0.00 | 0.00 | |||
e | 2 | 7e | ADJ | 0.07 | 0.07 |
bug.csv Here is a table of words whose lemma's cgram is not the same as its own. (I think the lemma should be a closed operation right?)