ML
ML copied to clipboard
Can wordcountvectorizer support custom dictionaries ?
Hello, thank you very much for developing rubix ml, which is a great work. I am currently using it and found an important problem to feed back to you.
When I used rubix ml to develop a function involving word segmentation in the project, I found that when using wordcountvectorizer, articles can only be divided into words according to spaces. However, in many articles, especially in a large number of documents, splitting words according to spaces will lead to the wrong splitting of many terms originally composed of two or three words. For example, the Latin of many plants is composed of two words, which completely loses its meaning after splitting. This results in very limited applicability of wordcountvectorizer. I wonder if you can support custom dictionaries so that users can create their own dictionaries and wordcountvectorizer can recognize terms created in the user's Dictionary consisting of two or three words as one word. In this way, the applicability of wordcountvectorizer will be greatly expanded.
Best wishes
This use case is something that I encountered. And would be very useful.
I think we can probably make Word Count Vectorizer more flexible, if not directly, then through the Tokenizer abstraction. Even with a custom vocabulary, text blobs still need to be tokenized in order to be counted. Take a look at what we have here and let us know if a modification to one of the current Tokenizers or a new tokenizer will provide the functionality you are needing.
https://github.com/RubixML/ML/tree/2.0/src/Tokenizers
Hello, I have carefully read several files listed in the links you give, but I still don't know how to use the user-defined dictionary or similar functions.
I'm sure there's a way we can make it work. A custom vocabulary is the easy part. Tokenizing is the harder part because you still need to recognize words that are in the vocabulary from arbitrary blobs of text. Perhaps a "RegexTokenizer" is something we should spend some time thinking about. A tokenizer that would allow the user to define how words are tokenized through regular expressions.
@kornatzky Hello, I didn't understand the reply of andrewdalpino. I can't think of how to use regular expressions to realize the function of custom dictionaries. I wonder if you understand?
You need to recognize these multiple word phrases. There are two possibilities. Define a regular expression for each such phrase. The problem is that with more than a few phrases, it is not going to be scalable. A second one is to actually program a tokenizer that recognizes each such phrase from the list and identify it as a token. Currently, I do not think the available tokenizers can do this. That is get a list of phrases to be each reognized as a token.
@kornatzky Thank you very much. After your explanation, I think I probably understand. I reread the files in the list again. If I want to define a regular expression, I think it should be in https://github.com/RubixML/ML/blob/2.0/src/Tokenizers/Word.php , that will be WORD_REGEX. For example, if I want to match the following two Latin scientific names: Ferula sinkiangensis, Ferula fukanensis Then, I need to edit WORD_REGEX: protected const WORD_REGEX = "/Ferula sinkiangensis|Ferula fukanensis|[\w'-]+/u"; I wonder if my understanding is correct? If my understanding is correct, if there are many terms to be defined (this is almost certain), then this WORD_REGEX will be extremely complex.
Seems to me your understanding is correct. And it would be complex. So we need something which gets a list of words as parameter.
@kornatzky I quite agree. I'm looking forward to it.
Hey @neosaganeo, is there a pattern that would capture the words correctly for your language besides just spelling the exact words out in a list?
What I'm thinking is that, if we could provide a tokenizer that could tokenize the words correctly, then Word Count Vectorizer, Token Hashing Vectorizer, and potentially a new "Dictionary" Vectorizer would all be compatible with arbitrary languages. So the trick is designing the tokenizer.
@andrewdalpino Hello, I see what you mean. But I think no matter how well tokenizer is designed, it is good at accurately dividing words one by one. But for terms like Latin scientific names, phrases and multi word terms, tokenizer may not recognize them. If a user-defined dictionary is not difficult to implement, I think it should be the best solution to solve the situation of Latin scientific names, phrases and multi word terms.