spark-nlp
spark-nlp copied to clipboard
wordseg_best unable to transform english words correctly
Description
I'm using wordseg_best, the example given is working as expected.
However, I try it with mix of english & th, the english word not segmented properly.
Expected Behavior
+---------------------------------------------------------------------------------------------------------------------------------------+
|term_text |
+---------------------------------------------------------------------------------------------------------------------------------------+
|[oem, loomma, สำหรับ, ฐาน, ลำโพง, apple, homepod, อุปกรณ์, เครื่อง, เสียง, ยึด, ขา, ตั้ง, ไม้, แข็ง, ตั้ง, พื้น, speaker, stands, null]|
|[v3i, 100, original, motorola, razr, v3i, quad, band, flip, gsm, bluetooth, mp3, unlocked, mobile, phone, console, gaming, controllers]|
+---------------------------------------------------------------------------------------------------------------------------------------+
Current Behavior
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|term_masterbrain |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[o, e, m, l, o, o, m, m, a, สำหรับฐาน, ล, ำ, โพง, a, p, p, l, e, h, o, m, e, p, o, d, อุปกรณ์, เครื่อง, เสียง, ยึด, ขา, ตั้ง, ไม้, แข็ง, ตั้ง, พื้น, s, p, e, a, k, e, r, s, t, a, n, d, snull] |
|[v, 3, i1, 0, 0, o, r, i, g, i, n, a, l, m, o, t, o, r, o, l, a, r, a, z, r, v, 3, i, q, u, a, d, b, a, n, d, f, l, i, p, g, s, m, b, l, u, e, t, o, o, t, h, m, p3unlockedmobile, p, h, o, n, e, c, o, n, s, o, l, e, g, a, m, i, n, g, c, o, n, t, r, o, l, l, e, r, s]|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Possible Solution
Steps to Reproduce
Run the unit test
from pyspark.sql import SparkSession
from sparknlp.annotator import *
from sparknlp.base import DocumentAssembler, Pipeline, Finisher
class TestThaiNlp(PySparkTestCase):
def setUp(self):
self.spark = SparkSession.builder \
.master('local') \
.appName('vision') \
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.0") \
.getOrCreate()
self.df = self.spark.createDataFrame(
[
('oem loomma สำหรับฐานลำโพง apple homepod อุปกรณ์เครื่องเสียงยึดขาตั้งไม้แข็งตั้งพื้น speaker stands null'),
('v3i 100 original motorola razr v3i quad band flip gsm bluetooth mp3 unlocked mobile phone console gaming controllers'),
],
[
'text',
]
)
def test_sparknlp(self):
field = 'text'
document_assembler = DocumentAssembler() \
.setInputCol(field) \
.setOutputCol(f'{field}_document')
word_seg = WordSegmenterModel.pretrained('wordseg_best', 'th') \
.setInputCols(f'{field}_document') \
.setOutputCol(f'{field}_token')
finisher = Finisher() \
.setInputCols([f'{field}_token']) \
.setIncludeMetadata(True)
pipeline = Pipeline(stages=[document_assembler, word_seg, finisher])
result = pipeline.fit(self.df).transform(self.df).withColumnRenamed(f'finished_{field}_token', f'term_{field}')
result.select(f'term_{field}').show(2, False)
def tearDown(self):
self.spark.stop()
Context
I'm doing a benchmark with pythainlp
Your Environment
- Spark NLP version:
4.0.0
- Apache NLP version:
3.2.0
- Java version:
openjdk version "11.0.15" 2022-04-19
- Operating System and version:
Ubuntu 18.04.4 LTS
Hi,
Short answer: WordSegmenterModel doesn't support multi-lingual word segmentation and it's always trained on a specific language.
The WordSegmenterModel is for languages that require segmentation like the model you are using which only supports Thai.
This annotator doesn't support multi-lingual segmentation since it is always trained over a specific language, you need a mix of WordSegmenterModel for Thai
and Tokenizer for English
. I would suggest using a LanguageDetectorDL
to detect the language of each row/document, and then via the value of that column using one of those two to tokenize the content. (or if you already have a way to separate the DataFrame into different languages you can have different pipelines for different languages)
sorry closed by mistake, that's being said we will look into why the content with Thai (even with a few English) is not performing well. @danilojsl
hi @maziyarpanahi what if a single field contains a mix of english & thai words? like oem loomma สำหรับฐานลำโพง apple homepod อุปกรณ์เครื่องเสียงยึดขาตั้งไม้แข็งตั้งพื้น speaker stands null
.
Is not possible to processed by spark-nlp right?
hi @maziyarpanahi what if a single field contains a mix of english & thai words? like
oem loomma สำหรับฐานลำโพง apple homepod อุปกรณ์เครื่องเสียงยึดขาตั้งไม้แข็งตั้งพื้น speaker stands null
. Is not possible to processed by spark-nlp right?
That's what @danilojsl will investigate to see if that's possible. For now, only the language of that model can be segmented via WordSegmenterModel
Alright. Thanks @maziyarpanahi
Hi @jslim89, as @maziyarpanahi pointed out, WordSegmenter is not multi-lingual. All these models assume the document contents will be of one language only. So, suppose a sentence has a mix of languages. In that case, it will segment/combine the characters based on the language the model was trained for (in this example Thai), the other characters will be considered as a single character since the model does not know how to segment/combine those.
One way to change this behavior would be that WordSegmenter internally runs a regular Tokenizer annotator, let the tokens with characters different than Thai as they are, and only segment the Thai tokens.
So for this example, it would run the word segmenter algorithm for [สำหรับฐานลำโพง, อุปกรณ์เครื่องเสียงยึดขาตั้งไม้แข็งตั้งพื้น]
tokens.
So, the output will look something like these:
[oem, loomma, word_segmenter_output, apple, homepod, word_segmenter_output, speaker, stands, null]
This behavior requires a change in the code, @maziyarpanahi let me know if we proceed.
@danilojsl that's interesting, it will make the annotator more flexible for sure. However, will this be passing TonkeinzerModel somehow to WordSegmenter? (makes it complicated for saving and serialization)
What we can do is to have RegexTokenizer inside and control it via some parameters:
- enableRegexTokenizer
- if enabled, then the following parameters are used to feed RegexTokenizer and get the results internally
- .setToLowercase(true)
- .setPattern("\s+")
This way we can easily save those parameters and it allows users to customize how to tokenize the words with whitespace between them.