superml icon indicating copy to clipboard operation
superml copied to clipboard

CountVectorizer split argument doesn't do anything

Open nshahpazov opened this issue 3 years ago • 1 comments

I have the following example

# should be a vector of texts
sents <-  c('i, am, going, home, and, home',
          'where, are, you , going.? //// ',
          'how, does, it, work')

cfv <- CountVectorizer$new(max_features = 10, remove_stopwords = FALSE, split = ", )

# generate the matrix
cf_mat <- cfv$fit_transform(sents)

head(cf_mat, 3)

As you can see after executing it, it doesn't split on the comma sign, but splits on space again.

Is this a bug? Would a Pull Request be welcome? Thanks in advance!

nshahpazov avatar Dec 20 '21 13:12 nshahpazov

@nshahpazov it works correctly for me, what is the issue here?

head(cf_mat, 3) home going you work where it i how does are [1,] 2 1 0 0 0 0 1 0 0 0 [2,] 0 1 1 0 1 0 0 0 0 1 [3,] 0 0 0 1 0 1 0 1 1 0

saraswatmks avatar May 06 '22 20:05 saraswatmks