texthero icon indicating copy to clipboard operation
texthero copied to clipboard

Add initial Chinese support for hero.lang.zh.preprocessing

Open ryangawei opened this issue 3 years ago • 24 comments

This is the first step towards the Chinese support. All the docstrings are the same with original.

  • Create a hero.lang.hero_zh module
  • Implement preprocessing.py for Chinese, removing some inapplicable functions from the original API
  • Add test cases and docstring test in Chinese

See #68.

ryangawei avatar Jul 29 '20 03:07 ryangawei

Hi @henrifroese, you've made a good point.

My starting point is that, a lot of standard functions are not applicable on Chinese. For example, remove_digits, remove_*_brackets (should be done only after tokenization), stem, remove_diacritics. As the user choose to deal with Chinese, exposing these APIs may be confusing or misleading. (Refer to #84). So I basically put everything that supports Chinese into the hero_zh module. But, as you said, it also introduce lots of redundant code and make it hard to maintain when more languages joins.

Your suggestion is great, which I thought the same. However what can we do deal with the problem above? Should we hide these N/A API to user?

ryangawei avatar Jul 30 '20 03:07 ryangawei

Dear @AlfredWGA and @henrifroese,

Thank you for the PR and the respective review; good job! 👍

Regarding the order of operations in the preprocessing phase, it's probably more natural to first tokenize the text and then apply all the different functions on a tokenized text.

The only drawback of this solution is that it might be slower for certain operations. For instance, remove_punctuation on a string with regex is faster than removing punctuation by iterating over a list of tokens. Nonetheless, the second method can be highly parallelized and we might overcome the problem.

Regarding the fact that some of the functions are not relevant for a specific language (i.e stem) I don't really know what's the best solution. I'm open to any ideas. Before us probably many other programmers had the same issue; we might want to look around and see how they solved. Maybe there is even a design pattern to resolve that?

Regarding the other comment(s): agree with Henri

P.S @AlfredWGA should I call you Alfred or Guoao?

jbesomi avatar Jul 30 '20 09:07 jbesomi

My starting point is that, a lot of standard functions are not applicable on Chinese. For example, remove_digits, remove_*_brackets (should be done only after tokenization), stem, remove_diacritics. As the user choose to deal with Chinese, exposing these APIs may be confusing or misleading. (Refer to #84). So I basically put everything that supports Chinese into the hero_zh module. But, as you said, it also introduce lots of redundant code and make it hard to maintain when more languages joins.

What do you think about this: When the language is set to e.g. Chinese and a user calls a function texthero.f, texthero goes to texthero/lang/Chinese/supported_functions.py. In that file, there are two lists (or rather dicts):

  1. functions overwritten / changed / added for Chinese -> if f is here, call the "Chinese version"
  2. functions from texthero that don't make sense for Chinese -> if f is here, throw an Exception (or a warning).

This also allows us to gradually add support. At the beginning, many functions might still be in the second list; after implementing solutions for Chinese, they are moved to the first list.

However, I also agree with this, maybe there's a well-proven path to doing this:

Before us probably many other programmers had the same issue; we might want to look around and see how they solved. Maybe there is even a design pattern to resolve that?

One other thing: maybe convert this to a draft PR: on the right side, under "Reviewers", it should say "Still in progress? Convert to draft", click that. This way, everyone can see this is a work in progress and not ready to be merged yet.

henrifroese avatar Jul 30 '20 15:07 henrifroese

@jbesomi, thanks for the suggestion, I'll try to look for solutions through other NLP libraries. Also you can call me Guoao. 😃

functions overwritten / changed / added for Chinese -> if f is here, call the "Chinese version" functions from texthero that don't make sense for Chinese -> if f is here, throw an Exception (or a warning).

@henrifroese thanks for your idea. But think of a situation: without documentation or docstring, when a user tries to call a function, he/she won't know if it supports Chinese until that function returns (correct result or Exception), which is kind of annoying, isn't it?

How about I do this under texthero/lang/zh/preprocessing.py,

# Original functions that supports Chinese
from texthero.preprocessing import fillna, drop_no_content, remove_whitespace, ...


# Functions overwritten/added for Chinese
def tokenize(s: pd.Series) -> pd.Series:
...

Through this we still maintain a list of supported/not supported functions, and also prevent user from calling unexpected functions and getting Exceptions. What do you think?

ryangawei avatar Jul 30 '20 15:07 ryangawei

How about I do this under texthero/lang/zh/preprocessing.py

That is extremely simple and does the job! Makes sense!

Through this we still maintain a list of supported/not supported functions, and also prevent user from calling unexpected functions and getting Exceptions. What do you think?

I only fail to see how we still maintain a list of not supported functions? The way you suggest, if someone calls a texthero function that's not implemented for Chinese yet and doesn't work for Chinese, python will just give an error saying it does not know the function? Or would you keep a separate list of functions that need to be changed for Chinese, but aren't implemented yet?

henrifroese avatar Jul 30 '20 16:07 henrifroese

Note that this PR introduces these major changes:

  • replace_tags / replace_hashtags: The regex pattern changed from r"@[a-zA-Z0-9]+" to r"@[a-zA-Z0-9\u4e00-\u9fa5]+" -> We might just add this changes in the "default" functions?

  • tokenizer The function make use of spaCy Chinese tokenizer. We probably want to have spaCy as the default tokenizers. The question is then: do we need a separate tokenize functions or an universal one might be enough? See also #131

  • clean Call only fillna, remove_whitespace and tokenize -> A solution is required (see above)

Review:

  1. We probably do not need an extra test_indexes.py file, rather adapt the actual one to test functions from any lang --> we might have just a list of "valid" function names and then iterate over __DIR__(texthero.lang) for lang in {'default', 'en'} or something similar?

jbesomi avatar Jul 30 '20 16:07 jbesomi

@henrifroese I think every standard functions that isn't included in the language module hero.lang.* can be taken as not supported, and will raise AttributeError when they're called with texthero.lang.zh.precessing.*

ryangawei avatar Jul 31 '20 07:07 ryangawei

@henrifroese I think every standard functions that isn't included in the language module hero.lang.* can be taken as not supported, and will raise AttributeError when they're called with texthero.lang.zh.precessing.*

Makes sense, I think that's good enough for users :ok_hand:

henrifroese avatar Jul 31 '20 13:07 henrifroese

We probably do not need an extra test_indexes.py file, rather adapt the actual one to test functions from any lang --> we might have just a list of "valid" function names and then iterate over DIR(texthero.lang) for lang in {'default', 'en'} or > > something similar?

@jbesomi Maybe put these functions in __all__ under each language package and call with getattr(),

import importlib

test_cases_preprocessing = []
for module_str in dir(texthero.lang):
    if not module_str.startswith("__"):
        module = importlib.import_module('texthero.lang.' + module_str, package='texthero.lang')
        # print(module)
        test_cases_preprocessing.extend([getattr(module, func) for func in module.__all__])

ryangawei avatar Aug 01 '20 03:08 ryangawei

Exactly! What if we do that in a separate PR? Would like to do it?

jbesomi avatar Aug 05 '20 13:08 jbesomi

@jbesomi Sure, I'll look into it.

ryangawei avatar Aug 05 '20 15:08 ryangawei

Thank you!! :)

jbesomi avatar Aug 05 '20 17:08 jbesomi

Should we avoid calling variables and functions across modules?

For example, in https://github.com/jbesomi/texthero/blob/master/texthero/preprocessing.py#L295, texthero.stopwords is used by remove_stopwords. Suppose there's a package for French texthero.lang.fr, the same remove_stopwords logic can apply properly, except the stopwords should point to texthero.lang.fr.stopwords instead of texthero.stopwords. The only way is to copy & paste the same code to texthero.lang.fr.remove_stopwords, and add from texthero.lang.fr import stopwords as _stopwords in texthero.lang.fr.preprocessing, which is quite redundant.

I think in most cases it is unnecessary for modules to call each other. Regulating this might be better for other language supports, what do you guys think?

ryangawei avatar Aug 14 '20 02:08 ryangawei

For example, in ... which is quite redundant.

Hey Guoao, I agree with you that this might be redundant.

As a general comment: as for now we are not offering multilingual support, any ideas that provide flexibility or cleanness, or code robustness is very welcome.

I just don't understand what's your advice and idea about how to solve that.

jbesomi avatar Aug 14 '20 14:08 jbesomi

Just add custom Series type to align with the standard preprocessing.py.

I am thinking simply not to call functions or variables from other modules, unless that module is language agnostic. For example, preprocessing and stopwords is obviously language-specific, but visualization, _helper, _types may be not. import _helper in preprocessing is okay, but import preprocessing in visualization may lead to the problem I mentioned above.

Currently I only found few such cases in the code, like stopwords used in remove_stopwords, and tokenize used in functions of representation. Removing these calling would not seriously affect the original function, we can tell the user in docstring to do these steps outside the function.

By the way, Chinese word segmentation package jieba is optional for users, but it's necessary for testing, should I add it to dev dependencies?

ryangawei avatar Aug 15 '20 03:08 ryangawei

By the way, Chinese word segmentation package jieba is optional for users, but it's necessary for testing, should I add it to dev dependencies?

Yes. And we will have to make it clear to users that if they want to use jieba they need to install it separately. That's similar to what we are doing in #146 with Flair.

For the rest: I agree!

jbesomi avatar Aug 20 '20 08:08 jbesomi

For the rest: I agree!

Should I create another PR? That issue should be solved before this PR could continue.

ryangawei avatar Aug 20 '20 15:08 ryangawei

@henrifroese Do you think we should proceed with that?

ryangawei avatar Sep 02 '20 12:09 ryangawei

@AlfredWGA I'm not sure what you mean. It's difficult for us to not to import functions from other modules at the moment (e.g. I'm not sure how we would not use the tokenize function in representation right now?). Or maybe I am misunderstanding you?

henrifroese avatar Sep 02 '20 19:09 henrifroese

It's difficult for us to not to import functions from other modules at the moment (e.g. I'm not sure how we would not use the tokenize function in representation right now?).

Hi @henrifroese. Some modules need to deal with specific languages, therefore shouldn't be imported directly within other modules (because the user choose what language they want to deal with). Currently preprocessing and stopwords seems like such modules. For other modules, they're definitely okay to be imported.

For representation, if all functions receive a TokenSeries as input, there's no need to use not_tokenized_warning and preprocessing, the tokenization step is already done by users themselves.

ryangawei avatar Sep 03 '20 02:09 ryangawei

Okay I get it now, thanks 👌.

So presumably solution would be

  • tokenize: as you said above, we can switch to using the decorator @InputSeries(TokenSeries) to force a TokenSeries input and can then leave out the tokenization

  • stopwords and other cases: I don't know, interested in how you solve this 💡

I think a separate PR for this definitely makes sense.

henrifroese avatar Sep 03 '20 06:09 henrifroese

Why does the error The command "black --check ." exited with 1. occurs in CI? I ran this check locally and there's nothing wrong.

ryangawei avatar Sep 07 '20 09:09 ryangawei

See #171

jbesomi avatar Sep 07 '20 18:09 jbesomi

Just fix the problem, thank you! @jbesomi

ryangawei avatar Sep 08 '20 07:09 ryangawei