Kohei Watanabe
Kohei Watanabe
Something like this? When a corpus is reshaped `docid_` shows the number of sentences in original documents. ```r require(quanteda) corp
I think output for `docid_`, `segid_` and `length_` are not had, although it is redundant for `document_`.
I don't have strong opinion on this issue, but need to close for modularization. Only for the sake of discussion: ``` > summary(corpus_reshape(data_corpus_inaugural)) Corpus consisting of 5018 documents document segment...
How about this? ```r > docvars(dfm(c("a", "b")) docname_ docid_ segid_ 1 text1 text1 1 2 text2 text2 1 ``` This is the same as ```r > quanteda:::get_docvars.dfm(dfm(c("a", "b")), user =...
I like the mask idea, but we need to generalize it a bit more to allow selection of ngrams and collocations with different length. I will also think about it.
I wrote a small function to compute PMI using FCM while ago. Do you want to add something like this? ```r > toks fcmt > fcm_pmi
Why don't you start a branch to add a new function called `fcm_weight()` with additional measures? I am happy to assist.
I wrote `fcm_pmi()` for pre-processing for SVD, so I though should be in the main package. If it is for network analysis, textstats would be a better place. @eisioriginal how...
Good eyes. `docid` is factor. We hugely welcome users' participation via pull requests!
It would be more useful and easy to implement `skip` for `window`, which will be like ``` tokens_compound(toks, "not", window = 2, skip = 1) #> [1] "London_not" "not_bad" "a"...