texthero
texthero copied to clipboard
Add initial Chinese support for hero.lang.zh.preprocessing
This is the first step towards the Chinese support. All the docstrings are the same with original.
- Create a
hero.lang.hero_zh
module - Implement
preprocessing.py
for Chinese, removing some inapplicable functions from the original API - Add test cases and docstring test in Chinese
See #68.
Hi @henrifroese, you've made a good point.
My starting point is that, a lot of standard functions are not applicable on Chinese. For example, remove_digits
, remove_*_brackets
(should be done only after tokenization), stem
, remove_diacritics
. As the user choose to deal with Chinese, exposing these APIs may be confusing or misleading. (Refer to #84). So I basically put everything that supports Chinese into the hero_zh
module. But, as you said, it also introduce lots of redundant code and make it hard to maintain when more languages joins.
Your suggestion is great, which I thought the same. However what can we do deal with the problem above? Should we hide these N/A API to user?
Dear @AlfredWGA and @henrifroese,
Thank you for the PR and the respective review; good job! 👍
Regarding the order of operations in the preprocessing
phase, it's probably more natural to first tokenize the text and then apply all the different functions on a tokenized text.
The only drawback of this solution is that it might be slower for certain operations. For instance, remove_punctuation
on a string with regex is faster than removing punctuation by iterating over a list of tokens. Nonetheless, the second method can be highly parallelized and we might overcome the problem.
Regarding the fact that some of the functions are not relevant for a specific language (i.e stem
) I don't really know what's the best solution. I'm open to any ideas. Before us probably many other programmers had the same issue; we might want to look around and see how they solved. Maybe there is even a design pattern to resolve that?
Regarding the other comment(s): agree with Henri
P.S @AlfredWGA should I call you Alfred or Guoao?
My starting point is that, a lot of standard functions are not applicable on Chinese. For example, remove_digits, remove_*_brackets (should be done only after tokenization), stem, remove_diacritics. As the user choose to deal with Chinese, exposing these APIs may be confusing or misleading. (Refer to #84). So I basically put everything that supports Chinese into the hero_zh module. But, as you said, it also introduce lots of redundant code and make it hard to maintain when more languages joins.
What do you think about this: When the language is set to e.g. Chinese and a user calls a function texthero.f
, texthero goes to texthero/lang/Chinese/supported_functions.py
. In that file, there are two lists (or rather dicts):
- functions overwritten / changed / added for Chinese -> if
f
is here, call the "Chinese version" - functions from texthero that don't make sense for Chinese -> if
f
is here, throw an Exception (or a warning).
This also allows us to gradually add support. At the beginning, many functions might still be in the second list; after implementing solutions for Chinese, they are moved to the first list.
However, I also agree with this, maybe there's a well-proven path to doing this:
Before us probably many other programmers had the same issue; we might want to look around and see how they solved. Maybe there is even a design pattern to resolve that?
One other thing: maybe convert this to a draft PR: on the right side, under "Reviewers", it should say "Still in progress? Convert to draft", click that. This way, everyone can see this is a work in progress and not ready to be merged yet.
@jbesomi, thanks for the suggestion, I'll try to look for solutions through other NLP libraries. Also you can call me Guoao. 😃
functions overwritten / changed / added for Chinese -> if f is here, call the "Chinese version" functions from texthero that don't make sense for Chinese -> if f is here, throw an Exception (or a warning).
@henrifroese thanks for your idea. But think of a situation: without documentation or docstring, when a user tries to call a function, he/she won't know if it supports Chinese until that function returns (correct result or Exception), which is kind of annoying, isn't it?
How about I do this under texthero/lang/zh/preprocessing.py
,
# Original functions that supports Chinese
from texthero.preprocessing import fillna, drop_no_content, remove_whitespace, ...
# Functions overwritten/added for Chinese
def tokenize(s: pd.Series) -> pd.Series:
...
Through this we still maintain a list of supported/not supported functions, and also prevent user from calling unexpected functions and getting Exceptions. What do you think?
How about I do this under texthero/lang/zh/preprocessing.py
That is extremely simple and does the job! Makes sense!
Through this we still maintain a list of supported/not supported functions, and also prevent user from calling unexpected functions and getting Exceptions. What do you think?
I only fail to see how we still maintain a list of not supported functions? The way you suggest, if someone calls a texthero function that's not implemented for Chinese yet and doesn't work for Chinese, python will just give an error saying it does not know the function? Or would you keep a separate list of functions that need to be changed for Chinese, but aren't implemented yet?
Note that this PR introduces these major changes:
-
replace_tags
/replace_hashtags
: The regex pattern changed fromr"@[a-zA-Z0-9]+"
tor"@[a-zA-Z0-9\u4e00-\u9fa5]+"
-> We might just add this changes in the "default" functions? -
tokenizer
The function make use of spaCy Chinese tokenizer. We probably want to have spaCy as the default tokenizers. The question is then: do we need a separatetokenize
functions or an universal one might be enough? See also #131 -
clean
Call onlyfillna
,remove_whitespace
andtokenize
-> A solution is required (see above)
Review:
- We probably do not need an extra
test_indexes.py
file, rather adapt the actual one to test functions from anylang
--> we might have just a list of "valid" function names and then iterate over__DIR__(texthero.lang)
for lang in {'default', 'en'} or something similar?
@henrifroese I think every standard functions that isn't included in the language module hero.lang.*
can be taken as not supported, and will raise AttributeError
when they're called with texthero.lang.zh.precessing.*
@henrifroese I think every standard functions that isn't included in the language module hero.lang.* can be taken as not supported, and will raise AttributeError when they're called with texthero.lang.zh.precessing.*
Makes sense, I think that's good enough for users :ok_hand:
We probably do not need an extra test_indexes.py file, rather adapt the actual one to test functions from any lang --> we might have just a list of "valid" function names and then iterate over DIR(texthero.lang) for lang in {'default', 'en'} or > > something similar?
@jbesomi Maybe put these functions in __all__
under each language package and call with getattr()
,
import importlib
test_cases_preprocessing = []
for module_str in dir(texthero.lang):
if not module_str.startswith("__"):
module = importlib.import_module('texthero.lang.' + module_str, package='texthero.lang')
# print(module)
test_cases_preprocessing.extend([getattr(module, func) for func in module.__all__])
Exactly! What if we do that in a separate PR? Would like to do it?
@jbesomi Sure, I'll look into it.
Thank you!! :)
Should we avoid calling variables and functions across modules?
For example, in https://github.com/jbesomi/texthero/blob/master/texthero/preprocessing.py#L295, texthero.stopwords
is used by remove_stopwords
. Suppose there's a package for French texthero.lang.fr
, the same remove_stopwords
logic can apply properly, except the stopwords
should point to texthero.lang.fr.stopwords
instead of texthero.stopwords
. The only way is to copy & paste the same code to texthero.lang.fr.remove_stopwords
, and add from texthero.lang.fr import stopwords as _stopwords
in texthero.lang.fr.preprocessing
, which is quite redundant.
I think in most cases it is unnecessary for modules to call each other. Regulating this might be better for other language supports, what do you guys think?
For example, in ... which is quite redundant.
Hey Guoao, I agree with you that this might be redundant.
As a general comment: as for now we are not offering multilingual support, any ideas that provide flexibility or cleanness, or code robustness is very welcome.
I just don't understand what's your advice and idea about how to solve that.
Just add custom Series type to align with the standard preprocessing.py
.
I am thinking simply not to call functions or variables from other modules, unless that module is language agnostic. For example, preprocessing
and stopwords
is obviously language-specific, but visualization
, _helper
, _types
may be not. import _helper
in preprocessing
is okay, but import preprocessing
in visualization
may lead to the problem I mentioned above.
Currently I only found few such cases in the code, like stopwords
used in remove_stopwords
, and tokenize
used in functions of representation
. Removing these calling would not seriously affect the original function, we can tell the user in docstring to do these steps outside the function.
By the way, Chinese word segmentation package jieba
is optional for users, but it's necessary for testing, should I add it to dev dependencies?
By the way, Chinese word segmentation package jieba is optional for users, but it's necessary for testing, should I add it to dev dependencies?
Yes. And we will have to make it clear to users that if they want to use jieba
they need to install it separately. That's similar to what we are doing in #146 with Flair.
For the rest: I agree!
For the rest: I agree!
Should I create another PR? That issue should be solved before this PR could continue.
@henrifroese Do you think we should proceed with that?
@AlfredWGA I'm not sure what you mean. It's difficult for us to not to import functions from other modules at the moment (e.g. I'm not sure how we would not use the tokenize function in representation right now?). Or maybe I am misunderstanding you?
It's difficult for us to not to import functions from other modules at the moment (e.g. I'm not sure how we would not use the tokenize function in representation right now?).
Hi @henrifroese. Some modules need to deal with specific languages, therefore shouldn't be imported directly within other modules (because the user choose what language they want to deal with). Currently preprocessing
and stopwords
seems like such modules. For other modules, they're definitely okay to be imported.
For representation
, if all functions receive a TokenSeries
as input, there's no need to use not_tokenized_warning and preprocessing
, the tokenization step is already done by users themselves.
Okay I get it now, thanks 👌.
So presumably solution would be
-
tokenize: as you said above, we can switch to using the decorator
@InputSeries(TokenSeries)
to force a TokenSeries input and can then leave out the tokenization -
stopwords and other cases: I don't know, interested in how you solve this 💡
I think a separate PR for this definitely makes sense.
Why does the error The command "black --check ." exited with 1.
occurs in CI? I ran this check locally and there's nothing wrong.
See #171
Just fix the problem, thank you! @jbesomi