spark-nlp icon indicating copy to clipboard operation
spark-nlp copied to clipboard

wordseg_best unable to transform english words correctly

Open jslim89 opened this issue 2 years ago • 7 comments

Description

I'm using wordseg_best, the example given is working as expected.

However, I try it with mix of english & th, the english word not segmented properly.

Expected Behavior

+---------------------------------------------------------------------------------------------------------------------------------------+
|term_text                                                                                                                       |
+---------------------------------------------------------------------------------------------------------------------------------------+
|[oem, loomma, สำหรับ, ฐาน, ลำโพง, apple, homepod, อุปกรณ์, เครื่อง, เสียง, ยึด, ขา, ตั้ง, ไม้, แข็ง, ตั้ง, พื้น, speaker, stands, null]|
|[v3i, 100, original, motorola, razr, v3i, quad, band, flip, gsm, bluetooth, mp3, unlocked, mobile, phone, console, gaming, controllers]|
+---------------------------------------------------------------------------------------------------------------------------------------+

Current Behavior

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|term_masterbrain                                                                                                                                                                                                                                                         |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[o, e, m, l, o, o, m, m, a, สำหรับฐาน, ล, ำ, โพง, a, p, p, l, e, h, o, m, e, p, o, d, อุปกรณ์, เครื่อง, เสียง, ยึด, ขา, ตั้ง, ไม้, แข็ง, ตั้ง, พื้น, s, p, e, a, k, e, r, s, t, a, n, d, snull]                                                                          |
|[v, 3, i1, 0, 0, o, r, i, g, i, n, a, l, m, o, t, o, r, o, l, a, r, a, z, r, v, 3, i, q, u, a, d, b, a, n, d, f, l, i, p, g, s, m, b, l, u, e, t, o, o, t, h, m, p3unlockedmobile, p, h, o, n, e, c, o, n, s, o, l, e, g, a, m, i, n, g, c, o, n, t, r, o, l, l, e, r, s]|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Possible Solution

Steps to Reproduce

Run the unit test

from pyspark.sql import SparkSession
from sparknlp.annotator import *
from sparknlp.base import DocumentAssembler, Pipeline, Finisher

class TestThaiNlp(PySparkTestCase):

    def setUp(self):
        self.spark = SparkSession.builder \
            .master('local') \
            .appName('vision') \
            .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.0") \
            .getOrCreate()
        self.df = self.spark.createDataFrame(
            [
                ('oem loomma สำหรับฐานลำโพง apple homepod อุปกรณ์เครื่องเสียงยึดขาตั้งไม้แข็งตั้งพื้น speaker stands null'),
                ('v3i 100 original motorola razr v3i quad band flip gsm bluetooth mp3 unlocked mobile phone console gaming controllers'),
            ],
            [
                'text',
            ]
        )

    def test_sparknlp(self):
        field = 'text'
        document_assembler = DocumentAssembler() \
            .setInputCol(field) \
            .setOutputCol(f'{field}_document')
        word_seg = WordSegmenterModel.pretrained('wordseg_best', 'th') \
            .setInputCols(f'{field}_document') \
            .setOutputCol(f'{field}_token')
        finisher = Finisher() \
            .setInputCols([f'{field}_token']) \
            .setIncludeMetadata(True)
        pipeline = Pipeline(stages=[document_assembler, word_seg, finisher])
        result = pipeline.fit(self.df).transform(self.df).withColumnRenamed(f'finished_{field}_token', f'term_{field}')
        result.select(f'term_{field}').show(2, False)

    def tearDown(self):
        self.spark.stop()

Context

I'm doing a benchmark with pythainlp

Your Environment

  • Spark NLP version: 4.0.0
  • Apache NLP version: 3.2.0
  • Java version: openjdk version "11.0.15" 2022-04-19
  • Operating System and version: Ubuntu 18.04.4 LTS

jslim89 avatar Jul 01 '22 11:07 jslim89

Hi,

Short answer: WordSegmenterModel doesn't support multi-lingual word segmentation and it's always trained on a specific language.

The WordSegmenterModel is for languages that require segmentation like the model you are using which only supports Thai.

This annotator doesn't support multi-lingual segmentation since it is always trained over a specific language, you need a mix of WordSegmenterModel for Thai and Tokenizer for English. I would suggest using a LanguageDetectorDL to detect the language of each row/document, and then via the value of that column using one of those two to tokenize the content. (or if you already have a way to separate the DataFrame into different languages you can have different pipelines for different languages)

maziyarpanahi avatar Jul 01 '22 12:07 maziyarpanahi

sorry closed by mistake, that's being said we will look into why the content with Thai (even with a few English) is not performing well. @danilojsl

maziyarpanahi avatar Jul 01 '22 12:07 maziyarpanahi

hi @maziyarpanahi what if a single field contains a mix of english & thai words? like oem loomma สำหรับฐานลำโพง apple homepod อุปกรณ์เครื่องเสียงยึดขาตั้งไม้แข็งตั้งพื้น speaker stands null. Is not possible to processed by spark-nlp right?

jslim89 avatar Jul 01 '22 12:07 jslim89

hi @maziyarpanahi what if a single field contains a mix of english & thai words? like oem loomma สำหรับฐานลำโพง apple homepod อุปกรณ์เครื่องเสียงยึดขาตั้งไม้แข็งตั้งพื้น speaker stands null. Is not possible to processed by spark-nlp right?

That's what @danilojsl will investigate to see if that's possible. For now, only the language of that model can be segmented via WordSegmenterModel

maziyarpanahi avatar Jul 01 '22 12:07 maziyarpanahi

Alright. Thanks @maziyarpanahi

jslim89 avatar Jul 01 '22 12:07 jslim89

Hi @jslim89, as @maziyarpanahi pointed out, WordSegmenter is not multi-lingual. All these models assume the document contents will be of one language only. So, suppose a sentence has a mix of languages. In that case, it will segment/combine the characters based on the language the model was trained for (in this example Thai), the other characters will be considered as a single character since the model does not know how to segment/combine those.

One way to change this behavior would be that WordSegmenter internally runs a regular Tokenizer annotator, let the tokens with characters different than Thai as they are, and only segment the Thai tokens. So for this example, it would run the word segmenter algorithm for [สำหรับฐานลำโพง, อุปกรณ์เครื่องเสียงยึดขาตั้งไม้แข็งตั้งพื้น] tokens. So, the output will look something like these: [oem, loomma, word_segmenter_output, apple, homepod, word_segmenter_output, speaker, stands, null]

This behavior requires a change in the code, @maziyarpanahi let me know if we proceed.

danilojsl avatar Aug 18 '22 18:08 danilojsl

@danilojsl that's interesting, it will make the annotator more flexible for sure. However, will this be passing TonkeinzerModel somehow to WordSegmenter? (makes it complicated for saving and serialization)

What we can do is to have RegexTokenizer inside and control it via some parameters:

  • enableRegexTokenizer
  • if enabled, then the following parameters are used to feed RegexTokenizer and get the results internally
  • .setToLowercase(true)
  • .setPattern("\s+")

This way we can easily save those parameters and it allows users to customize how to tokenize the words with whitespace between them.

maziyarpanahi avatar Aug 20 '22 08:08 maziyarpanahi