proustr
proustr copied to clipboard
Tools for Natural Language Processing in French and texts from Marcel Proust's collection "A La Recherche Du Temps Perdu"
Hi Colin, I'm using tidytext for tokenization, but have some problems with texts in French. For instance "L'achat" or "j'ai" are not separated as they should be. In [an issue...
Sentiment analysis might work better on stemmed text. Might be an option in `proust_sentiments(stem=TRUE)ˋ
Check for this punctuation "«»““”„‟≪≫《》〝〞〟" and 'ʻʼʽ٬‘’‚‛
`pr_stem` should let the user choose between several stemming methods (SnowballC, hunspell)
Calculation of word rariety, based on : http://www.lexique.org/telLexique.php
Both the stemmer should be implemented, with an arg specifying which one to choose.
https://www.rocq.inria.fr/alpage-wiki/tiki-index.php?page=CorpusSequoia http://www.llf.cnrs.fr/Gens/Abeille/French-Treebank-fr.php http://lia.univ-avignon.fr/chercheurs/bechet/download_fred.html http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
Some terms might be mispelled, and appear once or twice in the dataset, and should be put back to the right spot in the table. `pr_spell_*` or so would take...