hummingbird
hummingbird copied to clipboard
CountVectorizer implementation
The existing CountVectorizer code has jit things such as in the forward function
doc_ids = torch.jit.annotate(List[Tensor], []) # noqa: F821
which we need to do a bit of a work around so that it doesn't fail at
File "/root/hummingbird/hummingbird/ml/_container.py", line 63, in forward
raise RuntimeError("Inputer tensor {} of not supported type {}".format(input_name, type(inputs[i])))
because it's not a tensor
See this branch
@ksaur I'm solving the same in issue #293 as discussed in issue #164
Hi @hemantr05,
For issue #164, there are two parts:
- In #293, you are working on the first half
tf-idf
. This issue is challenging and non-trivial for sure!! - For the second part there is
CountVectorizer
(this issue). As mentioned in #164, we have some internal code already forCountVectorizer
that was a bit more time-consuming to integrate, which I can definitely post in the future!
We really appreciate your enthusiasm!! If you finish your current two issues (#293 and #273) you can get started on this third one! :) Let me know if you have questions or would like to change which issue you focus on! Thanks again!
Actually you can take a look at count vectorizer code at this old branch.
@interesaaat - I see that I also had the old CV code already posted in the original post above (See "this branch"). :-D I can delete mine if you made changes in yours? (else the appear to be dups)
I made no changes, let me delete mine then since it is not used.
@ksaur Sure. Will finish the previously assigned issue first and get back to this