Kohei Watanabe issues

Results 81 issues of


                                            Kohei Watanabe

Consider parallelizing tokenization

There is a package called **future.apply** which provides parallelized apply-type functions. It seems that we can parallelize tokenization with `future_lapply()`. ```r require(quanteda) require(future.apply) plan(multiprocess) > corp #corp txt length(txt) [1]...

performance

tokens

Reconsider handling of padding

Following a bug #1960, we have to reconsider how we handle paddings. ``` dfmt[,""] # error dfm_select(dfmt, "") # works ``` R treats empty names as a special case according...

question

robustness

dfm

Return docnames as row.names or a column

Since we no longer use rownames in the data.frame for docvars, `docvars(dfmt)` returns not information about docnames. We can return docnames as rownames. ```r # Current > rownames(docvars(dfm(c("a", "b")))) [1]...

dev-corpus2

Reconsider character string transformation

If `corpus` is the object for the original texts, there shouldn't be `corpus_reshape()`. Even if texts are segmented into sentences or paragraphs, we can apply all preprocessing on the tokens...

design

Show docvars in print()

I was thinking of adding functions that enhance user experience in v2.0, because the internal structural change stay unnoticed unless there is some "good" (or "bad") things for users. `print_with_docvar()`...

enhancement

tokens

feature request

corpus

Support the TEI format

I recently learn that the TEI XML format is becoming popular in the linguistics community. In this format, texts are saved in small chunks with associated meta information (e.g. speaker),...

Add encoding inference function

The EU manifesto example is incorrect, because Hungarian text, for example, is not in ISO-8859-1. https://readtext.quanteda.io/articles/readtext_vignette.html#reading-one-or-more-text-files However, it is tedious to specify encoding manually. Why not doing like this? `stri_enc_detect()`...

Ambiguity in Chinese seed words for CF and MN

Hello, I found some issues in the Chinese simplified dictionary. I just list it here. 1. 'CF': [中非共和国, 中非*, 班吉]. The 中非 is a term used in a general context...

Add more seed dictionaries

There are more languages need to be covered: - [x] English (master) - [x] Russian - [x] German - [x] Spanish - [x] Portuguese - [x] Italian - [x] French...

help wanted

dictionary

Allow ambiguous place names to be included

**quanteda** v1.5 added `nested_score = "dictionary"` to `tokens_lookup()`. If this function is used, it a new priority rule apply in dictionary lookup. ``` 'DM': [Commonwealth of Dominica, Commonwealth Dominican*, Roseau]...

dictionary