hummingbird CountVectorizer implementation

The existing CountVectorizer code has jit things such as in the forward function

doc_ids = torch.jit.annotate(List[Tensor], [])  # noqa: F821

which we need to do a bit of a work around so that it doesn't fail at

  File "/root/hummingbird/hummingbird/ml/_container.py", line 63, in forward
    raise RuntimeError("Inputer tensor {} of not supported type {}".format(input_name, type(inputs[i])))

because it's not a tensor

See this branch

Jul 20 '20 18:07 ksaur

@ksaur I'm solving the same in issue #293 as discussed in issue #164

Sep 13 '20 21:09 Hemantr05

Hi @hemantr05,

For issue #164, there are two parts:

In #293, you are working on the first half tf-idf. This issue is challenging and non-trivial for sure!!
For the second part there is CountVectorizer (this issue). As mentioned in #164, we have some internal code already for CountVectorizer that was a bit more time-consuming to integrate, which I can definitely post in the future!

We really appreciate your enthusiasm!! If you finish your current two issues (#293 and #273) you can get started on this third one! :) Let me know if you have questions or would like to change which issue you focus on! Thanks again!

Sep 14 '20 02:09 ksaur

Actually you can take a look at count vectorizer code at this old branch.

Sep 14 '20 03:09 interesaaat

@interesaaat - I see that I also had the old CV code already posted in the original post above (See "this branch"). :-D I can delete mine if you made changes in yours? (else the appear to be dups)

Sep 14 '20 04:09 ksaur

I made no changes, let me delete mine then since it is not used.

Sep 14 '20 04:09 interesaaat

@ksaur Sure. Will finish the previously assigned issue first and get back to this

Sep 14 '20 16:09 Hemantr05

hummingbird hummingbird copied to clipboard

CountVectorizer implementation

hummingbird
hummingbird copied to clipboard