Korean text is not split well
Hello,
First of all, thank you for the great work! I was excited to try out this powerful text segmentation model, so I tested it with both an English text and a translated Korean text. However, I encountered an issue where a large chunk of the Korean text was considered a single sentence. I tried another sample, but once again, the entire text was returned as a single sentence. Could you please help me figure out what I might be doing wrong?
Thank you in advance.
Code
from wtpsplit import SaT
# onnxruntime GPU
model_ort = SaT("sat-12l-sm", ort_providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
%timeit list(model_ort.split("This is a test This is another test."))
english_split = model_ort.split(text)
korean_split = model_ort.split(korean)
print(len(english_split))
print(len(korean_split))
for en, ko in zip(english_split, korean_split):
print("English", en)
print("Korean", ko)
print("==============")
Result
English wtpsplit๐ช
Segment any Text - Robustly, Efficiently, Adaptablyโก
Korean wtpsplit๐ช
ํ
์คํธ ๋ถํ - ๊ฐ๋ ฅํ๊ณ , ํจ์จ์ ์ด๋ฉฐ, ์ ์๋ ฅ์ด ๋ฐ์ด๋๊ฒโก
==============
English This repository allows you to segment text into sentences or other semantic units.
Korean ์ด ์ ์ฅ์๋ฅผ ์ฌ์ฉํ๋ฉด ํ
์คํธ๋ฅผ ๋ฌธ์ฅ์ด๋ ๋ค๋ฅธ ์๋ฏธ ๋จ์๋ก ๋ถํ ํ ์ ์์ต๋๋ค.
==============
English It implements the models from:
Korean ๋ค์ ๋ชจ๋ธ์ ๊ตฌํํฉ๋๋ค.
==============
English SaT โ Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation by Markus Frohmann, Igor Sterner, Benjamin Minixhofer, Ivan Vuliฤ and Markus Schedl (state-of-the-art, encouraged).
Korean SaT โ Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation, Markus Frohmann, Igor Sterner, Benjamin Minixhofer, Ivan Vuliฤ ๋ฐ Markus Schedl ์ (์ต์ฒจ๋จ, ๊ถ์ฅ).
==============
English WtP โ Whereโs the Point?
Korean WtP โ Whereโs the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation, Benjamin Minixhofer, Jonas Pfeiffer ๋ฐ Ivan Vuliฤ ์ (์ด์ ๋ฒ์ , ์ฌํ์ฑ์ ์ํด ์ ์ง ๊ด๋ฆฌ).
==============
English Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation by Benjamin Minixhofer, Jonas Pfeiffer and Ivan Vuliฤ (previous version, maintained for reproducibility).
Korean WtP๋ผ๋ ์ด๋ฆ์ ์ผ๊ด์ฑ์ ์ํด ์ ์ง๋ฉ๋๋ค.
==============
English The namesake WtP is maintained for consistency.
Korean ์๋ก์ด ํ์์์ธ SaT๋ 85๊ฐ ์ธ์ด์์ ๋ ๋์ ์ฑ๋ฅ๊ณผ ๋ ์ ์ ์ปดํจํ
๋น์ฉ์ผ๋ก ๊ฐ๋ ฅํ๊ณ ํจ์จ์ ์ด๋ฉฐ ์ ์๋ ฅ์ด ๋ฐ์ด๋ ๋ฌธ์ฅ ๋ถํ ์ ์ ๊ณตํฉ๋๋ค. Segment any Text ๋
ผ๋ฌธ์์ ์
์ฆ๋ 8๊ฐ์ ์๋ก ๋ค๋ฅธ ๋ง๋ญ์น์ 85๊ฐ ์ธ์ด์ ๋ํ ์ต์ฒจ๋จ ๊ฒฐ๊ณผ๋ฅผ ํ์ธํ์ญ์์ค.
==============
English Our new followup SaT provides robust, efficient and adaptable sentence segmentation across 85 languages at higher performance and less compute cost.
Korean ์์คํ
๊ทธ๋ฆผ
์ค์น
pip install wtpsplit
์ฌ์ฉ๋ฒ
from wtpsplit import SaT
sat = SaT("sat-3l")
# ์ฑ๋ฅ ํฅ์์ ์ํด GPU์์ ์ ํ์ ์ผ๋ก ์คํ
# ์: sat.to("xla:0")๋ฅผ ํตํด TPU๋ ์ง์, ์ด ๊ฒฝ์ฐ sat.split์ `pad_last_batch=True`๋ฅผ ์ ๋ฌ
sat.half().to("cuda")
sat.split("This is a test This is another test.")
# ["This is a test ", "This is another test."] ๋ฐํ
# ํจ์ฌ ๋ ๋์ ์ฑ๋ฅ์ ์ํด ๋ชจ๋ ํ
์คํธ์ ๋ํด ๊ฐ๋ณ์ ์ผ๋ก sat.split์ ํธ์ถํ๋ ๋์ ์ด ์์
์ ์ํ
sat.split(["This is a test This is another test.", "And some more texts..."])
# ๋ชจ๋ ํ
์คํธ์ ๋ํ ๋ฌธ์ฅ ๋ชฉ๋ก์ ์์ฑํ๋ ๋ฐ๋ณต๊ธฐ๋ฅผ ๋ฐํ
# ์ผ๋ฐ์ ์ธ ๋ฌธ์ฅ ๋ถํ ์์
์๋ '-sm' ๋ชจ๋ธ ์ฌ์ฉ
sat_sm = SaT("sat-3l-sm")
sat_sm.half().to("cuda") # ์ ํ ์ฌํญ, ์ ์ฐธ์กฐ
sat_sm.split("this is a test this is another test")
# ["this is a test ", "this is another test"] ๋ฐํ
# ์ธ์ด ๋ฐ ๋๋ฉ์ธ/์คํ์ผ์ ๋ํ ๊ฐ๋ ฅํ ์ ์์ ์ํด ํ๋ จ๋ lora ๋ชจ๋ ์ฌ์ฉ
sat_adapted = SaT("sat-3l", style_or_domain="ud", language="en")
sat_adapted.half().to("cuda") # ์ ํ ์ฌํญ, ์ ์ฐธ์กฐ
sat_adapted.split("This is a test This is another test.")
# ['This is a test ', 'This is another test'] ๋ฐํ
ONNX ์ง์
๐ ์ด์ sat ๋ฐ sat-sm ๋ชจ๋ธ์ ๋ํด ํจ์ฌ ๋ ๋น ๋ฅธ ONNX ์ถ๋ก ์ ํ์ฑํํ ์ ์์ต๋๋ค! ๐
sat = SaT("sat-3l-sm", ort_providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
>>> from wtpsplit import SaT
>>> texts = ["This is a sentence. This is another sentence."] * 1000
# PyTorch GPU
>>> model_pytorch = SaT("sat-3l-sm")
>>> model_pytorch.half().to("cuda");
>>> %timeit list(model_pytorch.split(texts))
# 144ms ยฑ 252ฮผs per loop (7๋ฒ ์คํ, ๊ฐ 10ํ ๋ฐ๋ณต์ ํ๊ท ยฑ ํ์ค ํธ์ฐจ)
# ์ด๋ฏธ ๋งค์ฐ ๋น ๋ฅด์ง๋ง...
# onnxruntime GPU
>>> model_ort = SaT("sat-3l-sm", ort_providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
>>> %timeit list(model_ort.split(texts))
# 94.9ms ยฑ 165ฮผs per loop (7๋ฒ ์คํ, ๊ฐ 10ํ ๋ฐ๋ณต์ ํ๊ท ยฑ ํ์ค ํธ์ฐจ)
#
==============
English Check out the state-of-the-art results in 8 distinct corpora and 85 languages demonstrated in our Segment any Text paper.
Korean ...์ด๊ฒ์ ์ฝ 50% ๋ ๋นจ๋ผ์ผ ํฉ๋๋ค! (RTX 3090์์ ํ
์คํธ)
==============
English System Figure
Korean ONNX ๋ชจ๋ธ๊ณผ ํจ๊ป LoRA๋ฅผ ์ฌ์ฉํ๋ ค๋ฉด:
use_lora: True ๋ฐ ์ ์ ํ output_dir: <OUTPUT_DIR>์ ์ฌ์ฉํ์ฌ scripts/export_to_onnx_sat.py๋ฅผ ์คํํฉ๋๋ค.
==============
English Installation
Korean ๋ก์ปฌ LoRA ๋ชจ๋์ด ์๋ ๊ฒฝ์ฐ lora_path๋ฅผ ์ฌ์ฉํฉ๋๋ค.
==============
English pip install wtpsplit
Korean HuggingFace ํ๋ธ์์ LoRA ๋ชจ๋์ ๋ก๋ํ๋ ค๋ฉด style_or_domain ๋ฐ language๋ฅผ ์ฌ์ฉํฉ๋๋ค.
==============
English Usage
from wtpsplit import SaT
Korean ๋ณํฉ๋ LoRA ๊ฐ์ค์น๊ฐ ์๋ ONNX ๋ชจ๋ธ์ ๋ก๋ํฉ๋๋ค: sat = SaT(<OUTPUT_DIR>, onnx_providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
==============
English sat = SaT("sat-3l")
# optionally run on GPU for better performance
# also supports TPUs via e.g. sat.to("xla:0"), in that case pass `pad_last_batch=True` to sat.split
sat.half().to("cuda")
Korean ์ฌ์ฉ ๊ฐ๋ฅํ ๋ชจ๋ธ
==============
English sat.split("This is a test This is another test.")
# returns ["This is a test ", "This is another test."]
# do this instead of calling sat.split on every text individually for much better performance
Korean ์ผ๋ฐ์ ์ธ ๋ฌธ์ฅ ๋ถํ ๋ชจ๋ธ์ด ํ์ํ ๊ฒฝ์ฐ -sm ๋ชจ๋ธ(์: sat-3l-sm)์ ์ฌ์ฉํ์ญ์์ค.
==============
English sat.split(["This is a test This is another test.", "And some more texts..."])
# returns an iterator yielding lists of sentences for every text
# use our '-sm' models for general sentence segmentation tasks
Korean ์๋์ ๋ฏผ๊ฐํ ์ ํ๋ฆฌ์ผ์ด์
์ ๊ฒฝ์ฐ 3๊ณ์ธต ๋ชจ๋ธ(sat-3l ๋ฐ sat-3l-sm)์ ๊ถ์ฅํฉ๋๋ค.
==============
English sat_sm = SaT("sat-3l-sm")
sat_sm.half().to("cuda") # optional, see above
Korean ์๋์ ์ฑ๋ฅ ๊ฐ์ ํ์ํ ์ ์ถฉ์์ ์ ๊ณตํฉ๋๋ค.
==============
English sat_sm.split("this is a test this is another test")
# returns ["this is a test ", "this is another test"]
# use trained lora modules for strong adaptation to language & domain/style
Korean ์ต๊ณ ์ ๋ชจ๋ธ์ 12๊ณ์ธต ๋ชจ๋ธ์ธ sat-12l ๋ฐ sat-12l-sm์
๋๋ค.
==============
English sat_adapted = SaT("sat-3l", style_or_domain="ud", language="en")
Korean ๋ชจ๋ธ ์์ด ์ ์ ๋ค๊ตญ์ด ์ ์
sat-1l 88.5 84.3
sat-1l-sm 88.2 87.9
sat-3l 93.7 89.2
sat-3l-lora 96.7 94.8
sat-3l-sm 96.5 93.5
sat-6l 94.1 89.7
sat-6l-sm 96.9 95.1
sat-9l 94.3 90.3
sat-12l 94.0 90.4
sat-12l-lora 97.3 95.9
sat-12l-sm 97.4 96.0
์ ์๋ "์์ด"์ ๋ํด ์ฌ์ฉ ๊ฐ๋ฅํ ๋ชจ๋ ๋ฐ์ดํฐ ์ธํธ์ ๋ํ ๋งคํฌ๋ก ํ๊ท F1 ์ ์์ด๋ฉฐ, "๋ค๊ตญ์ด"์ ๋ํด ๋ชจ๋ ๋ฐ์ดํฐ ์ธํธ์ ์ธ์ด์ ๋ํ ๋งคํฌ๋ก ํ๊ท F1 ์ ์์
๋๋ค.
==============
English sat_adapted.half().to("cuda") # optional, see above
sat_adapted.split("This is a test This is another test.")
# returns ['This is a test ', 'This is another test']
Korean "์ ์๋จ"์ LoRA๋ฅผ ํตํ ์ ์์ ์๋ฏธํฉ๋๋ค.
==============
English ONNX Support
Korean ์์ธํ ๋ด์ฉ์ ๋
ผ๋ฌธ์ ์ฐธ์กฐํ์ญ์์ค.
==============
English ๐ You can now enable even faster ONNX inference for sat and sat-sm models! ๐
Korean ๋น๊ต๋ฅผ ์ํด ๋ค๋ฅธ ๋๊ตฌ์ ์์ด ์ ์๋ ๋ค์๊ณผ ๊ฐ์ต๋๋ค.
==============
English sat = SaT("sat-3l-sm", ort_providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
>>> from wtpsplit import SaT
>>> texts = ["This is a sentence. This is another sentence."] * 1000
# PyTorch GPU
>>> model_pytorch = SaT("sat-3l-sm")
>>> model_pytorch.half().to("cuda");
>>> %timeit list(model_pytorch.split(texts))
#
Korean ๋ชจ๋ธ ์์ด ์ ์
PySBD 69.6
SpaCy(sentencizer; ๋จ์ผ ์ธ์ด) 92.9
SpaCy(sentencizer; ๋ค๊ตญ์ด) 91.5
Ersatz 91.4
Punkt (nltk.sent_tokenize) 92.2
==============
English 144 ms ยฑ 252 ฮผs per loop (mean ยฑ std. dev. of 7 runs, 10 loops each)
# quite fast already, but...
# onnxruntime GPU
>>> model_ort = SaT("sat-3l-sm", ort_providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
>>> %timeit list(model_ort.split(texts))
#
Korean WtP (3l) 93.9
==============
English 94.9 ms ยฑ 165 ฮผs per loop (mean ยฑ std. dev. of 7 runs, 10 loops each
#
Korean ์ด ๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ ์ด์ WtP ๋ชจ๋ธ๋ ์ง์ํฉ๋๋ค.
==============
English ...this should be ~50% faster! (tested on RTX 3090)
Korean SaT ๋ชจ๋ธ๊ณผ ๋ง์ฐฌ๊ฐ์ง๋ก ๋ณธ์ง์ ์ผ๋ก ๋์ผํ ๋ฐฉ์์ผ๋ก ์ฌ์ฉํ ์ ์์ต๋๋ค.
==============
English If you wish to use LoRA in combination with an ONNX model:
Korean from wtpsplit import WtP
wtp = WtP("wtp-bert-mini")
# SaT ๋ชจ๋ธ๊ณผ ์ ์ฌํ ๊ธฐ๋ฅ
wtp.split("This is a test This is another test.")
WtP ๋ฐ ์ฌํ ์ธ๋ถ ์ ๋ณด์ ๋ํ ์์ธํ ๋ด์ฉ์ WtP ๋ฌธ์๋ฅผ ์ฐธ์กฐํ์ญ์์ค.
==============
English Run scripts/export_to_onnx_sat.py with use_lora: True and an appropriate output_dir: <OUTPUT_DIR>.
Korean ๋จ๋ฝ ๋ถํ
SaT๋ ์ค ๋ฐ๊ฟ ํ๋ฅ ์ ์์ธกํ๋๋ก ํ๋ จ๋๋ฏ๋ก ๋ฌธ์ฅ ์ธ์๋ ํ
์คํธ๋ฅผ ๋จ๋ฝ์ผ๋ก ๋ถํ ํ ์ ์์ต๋๋ค.
==============
English If you have a local LoRA module, use lora_path.
Korean # ๊ฐ๊ฐ ๋ฌธ์ฅ ๋ชฉ๋ก์ด ํฌํจ๋ ๋จ๋ฝ ๋ชฉ๋ก์ ๋ฐํํฉ๋๋ค.
==============
English If you wish to load a LoRA module from the HuggingFace hub, use style_or_domain and language.
Korean # `paragraph_threshold` ์ธ์๋ฅผ ํตํด ๋จ๋ฝ ์๊ณ๊ฐ์ ์กฐ์ ํฉ๋๋ค.
==============
English Load the ONNX model with merged LoRA weights:
Korean sat.split(text, do_paragraph_segmentation=True)
์ ์
SaT๋ LoRA๋ฅผ ํตํด ๋๋ฉ์ธ ๋ฐ ์คํ์ผ์ ์ ์ฉํ ์ ์์ต๋๋ค.
==============
English sat = SaT(<OUTPUT_DIR>, onnx_providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
Korean 81๊ฐ ์ธ์ด์ sat-3l ๋ฐ sat-12l์ ๋ํด Universal Dependencies, OPUS100, Ersatz ๋ฐ TED(์ฆ, ASR ์คํ์ผ๋ก ์ ์ฌ๋ ์์ฑ) ๋ฌธ์ฅ ์คํ์ผ์ ๋ํด ํ๋ จ๋ LoRA ๋ชจ๋์ ์ ๊ณตํฉ๋๋ค.
==============
English Available Models
Korean ๋ํ 6๊ฐ ์ธ์ด์ ๋ฒ๋ฅ ๋ฌธ์(๋ฒ๋ฅ ๋ฐ ํ๊ฒฐ), 4๊ฐ ์ธ์ด ์์ ์ฝ๋ ์ ํ ๋ฐ 3๊ฐ ์ธ์ด์ ํธ์์ ๋ํ LoRA ๋ชจ๋์ ์ ๊ณตํฉ๋๋ค.
==============
English If you need a general sentence segmentation model, use -sm models (e.g., sat-3l-sm)
Korean ์์ธํ ๋ด์ฉ์ ๋
ผ๋ฌธ์ ์ฐธ์กฐํ์ญ์์ค.
==============
English For speed-sensitive applications, we recommend 3-layer models (sat-3l and sat-3l-sm).
Korean ๋ํ sat-12-no-limited-lookahead์ ๋ํด 16๊ฐ ์ฅ๋ฅด์ ๊ตฌ์ ๋ถํ ๋ชจ๋์ ์ ๊ณตํ์ต๋๋ค.
==============
English They provide a great tradeoff between speed and performance.
Korean LoRA ๋ชจ๋์ ๋ค์๊ณผ ๊ฐ์ด ๋ก๋ํฉ๋๋ค.
==============
English The best models are our 12-layer models: sat-12l and sat-12l-sm.
Korean # lang_code ๋ฐ style_or_domain์ด ๋ชจ๋ ํ์ํฉ๋๋ค.
==============
English Model
Korean # ์ฌ์ฉ ๊ฐ๋ฅํ ๋ชจ๋์ ๋ํด์๋ <model_repository>/loras ํด๋๋ฅผ ํ์ธํ์ญ์์ค.
==============
English English Score
Korean sat_lora = SaT("sat-3l", style_or_domain="ud", language="en")
==============
English Multilingual Score
Korean sat_lora.split("Hello this is a test But this is different now Now the next one starts looool")
==============
English sat-1l 88.5 84.3
Korean # ์ด์ ๋งค์ฐ ๋ค๋ฅธ ๋๋ฉ์ธ์ ๊ฒฝ์ฐ
sat_lora_distinct = SaT("sat-12l", style_or_domain="code-switching", language="es-en")
==============
English sat-1l-sm 88.2 87.9
Korean sat_lora_distinct.split("in the morning over there cada vez que yo decรญa algo รฉl me decรญa algo")
๋ถํ ์๊ณ๊ฐ์ ์์ ๋กญ๊ฒ ์กฐ์ ํ ์๋ ์์ต๋๋ค.
==============
English sat-3l 93.7 89.2
Korean ์๊ณ๊ฐ์ด ๋์์๋ก ๋ ๋ณด์์ ์ธ ๋ถํ ์ด ์ด๋ฃจ์ด์ง๋๋ค.
==============
English sat-3l-lora 96.7 94.8
Korean sat.split("This is a test This is another test.", threshold=0.4)
==============
English sat-3l-sm 96.5 93.5
Korean # lora์๋ ์ ์ฌํ๊ฒ ์๋ํ์ง๋ง ์๊ณ๊ฐ์ด ๋ ๋์ต๋๋ค.
==============
English sat-6l 94.1 89.7
Korean sat_lora.split("Hello this is a test But this is different now Now the next one starts looool", threshold=0.7)
==============
English sat-6l-sm 96.9 95.1
Korean ๊ณ ๊ธ ์ฌ์ฉ๋ฒ
ํ
์คํธ์ ๋ํ ์ค ๋ฐ๊ฟ ๋๋ ๋ฌธ์ฅ ๊ฒฝ๊ณ ํ๋ฅ ์ ๊ฐ์ ธ์ต๋๋ค.
==============
English sat-9l 94.3 90.3
Korean # ์ค ๋ฐ๊ฟ ํ๋ฅ ์ ๋ฐํํฉ๋๋ค(๋ฐฐ์น ์ง์!).
==============
English sat-12l 94.0 90.4
Korean sat.predict_proba(text)
==============
English sat-12l-lora 97.3 95.9
Korean HuggingFace transformers์์ SaT ๋ชจ๋ธ์ ๋ก๋ํฉ๋๋ค.
==============
English sat-12l-sm 97.4 96.0
Korean # ์ฌ์ฉ์ ์ง์ ๋ชจ๋ธ์ ๋ฑ๋กํ๋ ๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ฅผ ๊ฐ์ ธ์ต๋๋ค.
==============
English The scores are macro-average F1 score across all available datasets for "English", and macro-average F1 score across all datasets and languages for "Multilingual".
Korean import wtpsplit
from transformers import AutoModelForTokenClassification
model = AutoModelForTokenClassification.from_pretrained("segment-any-text/sat-3l-sm") # ๋๋ ๋ค๋ฅธ ๋ชจ๋ธ ์ด๋ฆ; https://huggingface.co/segment-any-text ์ฐธ์กฐ
==============
English "adapted" means adapation via LoRA; check out the paper for details.
Korean LoRA๋ฅผ ํตํด ์์ฒด ๋ง๋ญ์น์ ์ ์
๋ชจ๋ธ์ ๊ฐ๋ ฅํ ๋ฐฉ์์ผ๋ก LoRA๋ฅผ ํตํด ํจ์จ์ ์ผ๋ก ์ ์ํ ์ ์์ต๋๋ค.
==============
English For comparison, here the English scores of some other tools:
Korean 10~100๊ฐ์ ํ๋ จ๋ ๋ถํ ํ๋ จ ๋ฌธ์ฅ๋ง์ผ๋ก๋ ์ฑ๋ฅ์ด ํฌ๊ฒ ํฅ์๋ฉ๋๋ค.
==============
English Model English Score
Korean ์ด๋ ๊ฒ ํ๋ ค๋ฉด:
์ ์ฅ์๋ฅผ ๋ณต์ ํ๊ณ ์๊ตฌ ์ฌํญ์ ์ค์นํฉ๋๋ค.
==============
English PySBD 69.6
Korean git clone https://github.com/segment-any-text/wtpsplit
cd wtpsplit
pip install -r requirements.txt
pip install adapters==0.2.1 --no-dependencies
cd ..
==============
English SpaCy (sentencizer; monolingual) 92.9
Korean ์ด ํ์์ผ๋ก ๋ฐ์ดํฐ๋ฅผ ๋ง๋ญ๋๋ค.
==============
English SpaCy (sentencizer; multilingual) 91.5
Korean import torch
torch.save(
{
"language_code": {
"sentence": {
"dummy-dataset": {
"meta": {
"train_data": ["train sentence 1", "train sentence 2"],
},
"data": [
"test sentence 1",
"test sentence 2",
]
}
}
}
},
"dummy-dataset.pth"
)
์ค์ ์ ๋ง๋ค๊ฑฐ๋ ์์ ํฉ๋๋ค. model_name_or_path๋ฅผ ํตํด ๊ธฐ๋ณธ ๋ชจ๋ธ์ ์ ๊ณตํ๊ณ text_path๋ฅผ ํตํด ํ๋ จ ๋ฐ์ดํฐ .pth๋ฅผ ์ ๊ณตํฉ๋๋ค.
configs/lora/lora_dummy_config.json
LoRA ํ๋ จ:
python3 wtpsplit/train/train_lora.py configs/lora/lora_dummy_config.json
ํ๋ จ์ด ์๋ฃ๋๋ฉด ์ ์ฅ๋ ๋ชจ๋์ ๊ฒฝ๋ก๋ฅผ SaT์ ์ ๊ณตํฉ๋๋ค.
sat_lora_adapted = SaT("model-used", lora_path="dummy_lora_path")
sat_lora_adapted.split("Some domains-specific or styled text")
์์ ๋ฐ์ดํฐ ์ธํธ ์ด๋ฆ, ์ธ์ด ๋ฐ ๋ชจ๋ธ์ ํ์์ ๋ฐ๋ผ ์กฐ์ ํ์ญ์์ค.
==============
English Ersatz 91.4
Korean ๋
ผ๋ฌธ ์ฌํ
configs/์๋ ๊ธฐ๋ณธ ๋ฐ sm ๋ชจ๋ธ๊ณผ LoRA ๋ชจ๋์ ๋ํ ๋
ผ๋ฌธ์ ์คํ์ ๋ํ ์ค์ ์ด ํฌํจ๋์ด ์์ต๋๋ค.
==============
English Punkt (nltk.sent_tokenize) 92.2
Korean ๊ฐ ์ค์ ์ ๋ํด ๋ค์๊ณผ ๊ฐ์ด ํ๋ จ์ ์์ํฉ๋๋ค.
python3 wtpsplit/train/train.py configs/<config_name>.json
python3 wtpsplit/train/train_sm.py configs/<config_name>.json
python3 wtpsplit/train/train_lora.py configs/<config_name>.json
๋ํ:
wtpsplit/data_acquisition์๋ mC4 ๋ง๋ญ์น์์ ํ๊ฐ ๋ฐ์ดํฐ์ ์์ ํ
์คํธ๋ฅผ ์ป๋ ์ฝ๋๊ฐ ํฌํจ๋์ด ์์ต๋๋ค.
==============
English WtP (3l) 93.9
Korean wtpsplit/evaluation์๋ ๋ค์์ ๋ํ ์ฝ๋๊ฐ ํฌํจ๋์ด ์์ต๋๋ค.
intrinsic.py๋ฅผ ํตํ ํ๊ฐ(์ฆ, ๋ฌธ์ฅ ๋ถํ ๊ฒฐ๊ณผ).
intrinsic_pairwise.py๋ฅผ ํตํ ๋จ์ ์ํ์ค ํ๊ฐ(์ฆ, ๋ฌธ์ฅ ์/k-mer์ ๋ํ ๋ฌธ์ฅ ๋ถํ ๊ฒฐ๊ณผ).
LLM ๊ธฐ์ค ํ๊ฐ(llm_sentence.py), ๋ฒ๋ฅ ๊ธฐ์ค ํ๊ฐ(legal_baselines.py)
intrinsic_baselines.py ๋ฐ intrinsic_baselines_multi.py์ ๊ธฐ์ค(PySBD, nltk ๋ฑ) ํ๊ฐ ๊ฒฐ๊ณผ
==============
English Note that this library also supports previous WtP models.
Korean JSON ํ์์ ์์ ๊ฒฐ๊ณผ๋ evaluation_results/์๋ ์์ต๋๋ค.
==============
English You can use them in essentially the same way as SaTmodels:
Korean ํต๊ณ์ ์ ์์ฑ ํ
์คํธ ์ฝ๋ ๋ฐ ๊ฒฐ๊ณผ๋ stat_tests/์ ์์ต๋๋ค.
==============
English from wtpsplit import WtP
Korean punct_annotation.py ๋ฐ punct_annotation_wtp.py(WtP ์ ์ฉ)์ ๊ตฌ๋์ ์ฃผ์ ์คํ
extrinsic.py(WtP ์ ์ฉ)์ ๊ธฐ๊ณ ๋ฒ์ญ์ ๋ํ ์ธ์ฌ์ ํ๊ฐ
==============
English wtp = WtP("wtp-bert-mini")
Korean ๋ฏธ๋ฆฌ requirements.txt์ ํจํค์ง๋ฅผ ์ค์นํ์ญ์์ค.
==============
English # similar functionality as for SaT models
Korean ์ง์ ์ธ์ด
์ง์ ์ธ์ด๊ฐ ํฌํจ๋ ํ
==============
English wtp.split("This is a test
Korean ์์ธํ ๋ด์ฉ์ Segment any Text ๋
ผ๋ฌธ์ ์ฐธ์กฐํ์ญ์์ค.
==============
English This is another test.")
Korean ์ธ์ฉ
SaT ๋ชจ๋ธ์ ๊ฒฝ์ฐ ๋
ผ๋ฌธ์ ์ธ์ฉํด์ฃผ์ญ์์ค.
==============
English For more details on WtP and reproduction details, see the WtP doc.
Korean @article{frohmann2024segment,
title={Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation},
author={Frohmann, Markus and Sterner, Igor and Vuli{'c}, Ivan and Minixhofer, Benjamin and Schedl, Markus},
journal={arXiv preprint arXiv:2406.16678},
year={2024},
doi={10.48550/arXiv.2406.16678},
url={https://doi.org/10.48550/arXiv.2406.16678},
}
๋ผ์ด๋ธ๋ฌ๋ฆฌ ๋ฐ WtP ๋ชจ๋ธ์ ๊ฒฝ์ฐ ๋ค์์ ์ธ์ฉํ์ญ์์ค.
==============
English Paragraph Segmentation
Korean @inproceedings{minixhofer-etal-2023-wheres,
title = "Where{'}s the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation",
author = "Minixhofer, Benjamin and
Pfeiffer, Jonas and
Vuli{'c}, Ivan",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-long.398",
pages = "7215--7235"
}
==============
English Since SaT are trained to predict newline probablity, they can segment text into paragraphs in addition to sentences.
Korean ๊ฐ์ฌ์ ๋ง
==============
English # returns a list of paragraphs, each containing a list of sentences
Korean ์ด ์ฐ๊ตฌ๋ ์ค์คํธ๋ฆฌ์ ๊ณผํ ๊ธฐ๊ธ(FWF): P36413, P33526 ๋ฐ DFH-23๊ณผ ์๋ถ ์ค์คํธ๋ฆฌ์ ์ฃผ ๋ฐ ์ฐ๋ฐฉ ๊ต์ก, ๊ณผํ ๋ฐ ์ฐ๊ตฌ๋ถ์ LIT-2021-YOU-215 ๋ณด์กฐ๊ธ์ ํตํด ์ ์ ์ผ๋ก ๋๋ ๋ถ๋ถ์ ์ผ๋ก ์๊ธ์ด ์ง์๋์์ต๋๋ค.
==============
English # adjust the paragraph threshold via the `paragraph_threshold` argument.
Korean ๋ํ Ivan Vulic๊ณผ Benjamin Minixhofer๋ Ivan Vuliฤ์๊ฒ ์์ฌ๋ Royal Society University Research Fellowship โInclusive and Sustainable Language Technology for a Truly Multilingual Worldโ(๋ฒํธ 221137)๋ฅผ ํตํด ์ง์์ ๋ฐ์์ต๋๋ค.
==============
English sat.split(text, do_paragraph_segmentation=True)
Korean ์ด ์ฐ๊ตฌ๋ Google์ TPU ์ฐ๊ตฌ ํด๋ผ์ฐ๋(TRC)์ ํด๋ผ์ฐ๋ TPU๋ฅผ ํตํด์๋ ์ง์๋์์ต๋๋ค.
==============
English Adaptation
Korean ์ด ์์
์ Cohere For AI Research Grant์ ์ปดํจํ
ํฌ๋ ๋ง์ผ๋ก๋ ์ง์๋์์ผ๋ฉฐ, ์ด๋ฌํ ๋ณด์กฐ๊ธ์ ๊ณผํ์ ์ธ๊ณต๋ฌผ๊ณผ ๋ฐ์ดํฐ๋ฅผ ์ข์ ํ๋ก์ ํธ๋ฅผ ์ํด ๊ณต๊ฐํ๋ ๊ฒ์ ๋ชฉํ๋ก ์ฐ๊ตฌ๋ฅผ ์ํํ๋ ํ์ ํํธ๋๋ฅผ ์ง์ํ๋๋ก ์ค๊ณ๋์์ต๋๋ค.
==============
English SaT can be domain- and style-adapted via LoRA.
Korean ๋ํ Simone Teufel์๊ฒ๋ ์ ์ตํ ๋
ผ์์ ๊ฐ์ฌ๋๋ฆฝ๋๋ค.
==============
English We provide trained LoRA modules for Universal Dependencies, OPUS100, Ersatz, and TED (i.e., ASR-style transcribed speecjes) sentence styles in 81 languages for sat-3land sat-12l.
Korean ์ง๋ฌธ์ด ์์ผ๋ฉด ๋ฌธ์ ๋ฅผ ๋ง๋ค๊ฑฐ๋ [email protected]์ผ๋ก ์ด๋ฉ์ผ์ ๋ณด๋ด์ฃผ์๋ฉด ๊ฐ๋ฅํ ํ ๋นจ๋ฆฌ ๋ต๋ณ๋๋ฆฌ๊ฒ ์ต๋๋ค.
==============
I replaced all the newlines with spaces and it seems work.
Thanks for raising this and finding the issue. This seems related to #131 .
Hi, could you clarify what you mean by "it seems to work" now? As per #131, we find this rather surprising.
Additionally, I think it would be good to try on more natural Korean text, not this semi-translated documentation with many newlines (which is a bit unrealistic, no?)
Hi,would you happent to have any update on this @seungduk-yanolja? :)