Kohei Watanabe

Results 81 issues of Kohei Watanabe

There is a package called **future.apply** which provides parallelized apply-type functions. It seems that we can parallelize tokenization with `future_lapply()`. ```r require(quanteda) require(future.apply) plan(multiprocess) > corp #corp txt length(txt) [1]...

performance
tokens

Following a bug #1960, we have to reconsider how we handle paddings. ``` dfmt[,""] # error dfm_select(dfmt, "") # works ``` R treats empty names as a special case according...

question
robustness
dfm

Since we no longer use rownames in the data.frame for docvars, `docvars(dfmt)` returns not information about docnames. We can return docnames as rownames. ```r # Current > rownames(docvars(dfm(c("a", "b")))) [1]...

dev-corpus2

If `corpus` is the object for the original texts, there shouldn't be `corpus_reshape()`. Even if texts are segmented into sentences or paragraphs, we can apply all preprocessing on the tokens...

design

I was thinking of adding functions that enhance user experience in v2.0, because the internal structural change stay unnoticed unless there is some "good" (or "bad") things for users. `print_with_docvar()`...

enhancement
tokens
feature request
corpus

I recently learn that the TEI XML format is becoming popular in the linguistics community. In this format, texts are saved in small chunks with associated meta information (e.g. speaker),...

The EU manifesto example is incorrect, because Hungarian text, for example, is not in ISO-8859-1. https://readtext.quanteda.io/articles/readtext_vignette.html#reading-one-or-more-text-files However, it is tedious to specify encoding manually. Why not doing like this? `stri_enc_detect()`...

Hello, I found some issues in the Chinese simplified dictionary. I just list it here. 1. 'CF': [中非共和国, 中非*, 班吉]. The 中非 is a term used in a general context...

There are more languages need to be covered: - [x] English (master) - [x] Russian - [x] German - [x] Spanish - [x] Portuguese - [x] Italian - [x] French...

help wanted
dictionary

**quanteda** v1.5 added `nested_score = "dictionary"` to `tokens_lookup()`. If this function is used, it a new priority rule apply in dictionary lookup. ``` 'DM': [Commonwealth of Dominica, Commonwealth Dominican*, Roseau]...

dictionary