wtpsplit Korean text is not split well

Hello,

First of all, thank you for the great work! I was excited to try out this powerful text segmentation model, so I tested it with both an English text and a translated Korean text. However, I encountered an issue where a large chunk of the Korean text was considered a single sentence. I tried another sample, but once again, the entire text was returned as a single sentence. Could you please help me figure out what I might be doing wrong?

Thank you in advance.

Code

from wtpsplit import SaT

# onnxruntime GPU
model_ort = SaT("sat-12l-sm", ort_providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
%timeit list(model_ort.split("This is a test This is another test."))
english_split = model_ort.split(text)
korean_split = model_ort.split(korean)

print(len(english_split))
print(len(korean_split))

for en, ko in zip(english_split, korean_split):
    print("English", en)
    print("Korean", ko)
    print("==============")

Result

English wtpsplit🪓
Segment any Text - Robustly, Efficiently, Adaptably⚡

Korean wtpsplit🪓
텍스트 분할 - 강력하고, 효율적이며, 적응력이 뛰어나게⚡

==============
English This repository allows you to segment text into sentences or other semantic units. 
Korean 이 저장소를 사용하면 텍스트를 문장이나 다른 의미 단위로 분할할 수 있습니다. 
==============
English It implements the models from:


Korean 다음 모델을 구현합니다.


==============
English SaT — Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation by Markus Frohmann, Igor Sterner, Benjamin Minixhofer, Ivan Vulić and Markus Schedl (state-of-the-art, encouraged).

Korean SaT — Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation, Markus Frohmann, Igor Sterner, Benjamin Minixhofer, Ivan Vulić 및 Markus Schedl 저(최첨단, 권장).

==============
English WtP — Where’s the Point? 
Korean WtP — Where’s the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation, Benjamin Minixhofer, Jonas Pfeiffer 및 Ivan Vulić 저(이전 버전, 재현성을 위해 유지 관리).

==============
English Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation by Benjamin Minixhofer, Jonas Pfeiffer and Ivan Vulić (previous version, maintained for reproducibility).

Korean WtP라는 이름은 일관성을 위해 유지됩니다. 
==============
English The namesake WtP is maintained for consistency. 
Korean 새로운 후속작인 SaT는 85개 언어에서 더 높은 성능과 더 적은 컴퓨팅 비용으로 강력하고 효율적이며 적응력이 뛰어난 문장 분할을 제공합니다. Segment any Text 논문에서 입증된 8개의 서로 다른 말뭉치와 85개 언어에 대한 최첨단 결과를 확인하십시오.


==============
English Our new followup SaT provides robust, efficient and adaptable sentence segmentation across 85 languages at higher performance and less compute cost. 
Korean 시스템 그림

설치
pip install wtpsplit
사용법
from wtpsplit import SaT

sat = SaT("sat-3l")
# 성능 향상을 위해 GPU에서 선택적으로 실행
# 예: sat.to("xla:0")를 통해 TPU도 지원, 이 경우 sat.split에 `pad_last_batch=True`를 전달
sat.half().to("cuda")

sat.split("This is a test This is another test.")
# ["This is a test ", "This is another test."] 반환

# 훨씬 더 나은 성능을 위해 모든 텍스트에 대해 개별적으로 sat.split을 호출하는 대신 이 작업을 수행
sat.split(["This is a test This is another test.", "And some more texts..."])
# 모든 텍스트에 대한 문장 목록을 생성하는 반복기를 반환

# 일반적인 문장 분할 작업에는 '-sm' 모델 사용
sat_sm = SaT("sat-3l-sm")
sat_sm.half().to("cuda") # 선택 사항, 위 참조
sat_sm.split("this is a test this is another test")
# ["this is a test ", "this is another test"] 반환

# 언어 및 도메인/스타일에 대한 강력한 적응을 위해 훈련된 lora 모듈 사용
sat_adapted = SaT("sat-3l", style_or_domain="ud", language="en")
sat_adapted.half().to("cuda") # 선택 사항, 위 참조
sat_adapted.split("This is a test This is another test.")
# ['This is a test ', 'This is another test'] 반환
ONNX 지원
🚀 이제 sat 및 sat-sm 모델에 대해 훨씬 더 빠른 ONNX 추론을 활성화할 수 있습니다! 🚀

sat = SaT("sat-3l-sm", ort_providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
>>> from wtpsplit import SaT
>>> texts = ["This is a sentence. This is another sentence."] * 1000

# PyTorch GPU
>>> model_pytorch = SaT("sat-3l-sm")
>>> model_pytorch.half().to("cuda");
>>> %timeit list(model_pytorch.split(texts))
# 144ms ± 252μs per loop (7번 실행, 각 10회 반복의 평균 ± 표준 편차)
# 이미 매우 빠르지만...

# onnxruntime GPU
>>> model_ort = SaT("sat-3l-sm", ort_providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
>>> %timeit list(model_ort.split(texts))
# 94.9ms ± 165μs per loop (7번 실행, 각 10회 반복의 평균 ± 표준 편차)
# 
==============
English Check out the state-of-the-art results in 8 distinct corpora and 85 languages demonstrated in our Segment any Text paper.


Korean ...이것은 약 50% 더 빨라야 합니다! (RTX 3090에서 테스트)

==============
English System Figure


Korean ONNX 모델과 함께 LoRA를 사용하려면:

use_lora: True 및 적절한 output_dir: <OUTPUT_DIR>을 사용하여 scripts/export_to_onnx_sat.py를 실행합니다.

==============
English Installation

Korean 로컬 LoRA 모듈이 있는 경우 lora_path를 사용합니다.

==============
English pip install wtpsplit

Korean HuggingFace 허브에서 LoRA 모듈을 로드하려면 style_or_domain 및 language를 사용합니다.

==============
English Usage
from wtpsplit import SaT


Korean 병합된 LoRA 가중치가 있는 ONNX 모델을 로드합니다: sat = SaT(<OUTPUT_DIR>, onnx_providers=["CUDAExecutionProvider", "CPUExecutionProvider"])

==============
English sat = SaT("sat-3l")
# optionally run on GPU for better performance
# also supports TPUs via e.g. sat.to("xla:0"), in that case pass `pad_last_batch=True` to sat.split
sat.half().to("cuda")


Korean 사용 가능한 모델

==============
English sat.split("This is a test This is another test.")
# returns ["This is a test ", "This is another test."]

# do this instead of calling sat.split on every text individually for much better performance

Korean 일반적인 문장 분할 모델이 필요한 경우 -sm 모델(예: sat-3l-sm)을 사용하십시오. 
==============
English sat.split(["This is a test This is another test.", "And some more texts..."])
# returns an iterator yielding lists of sentences for every text

# use our '-sm' models for general sentence segmentation tasks

Korean 속도에 민감한 애플리케이션의 경우 3계층 모델(sat-3l 및 sat-3l-sm)을 권장합니다. 
==============
English sat_sm = SaT("sat-3l-sm")
sat_sm.half().to("cuda") # optional, see above

Korean 속도와 성능 간에 탁월한 절충안을 제공합니다. 
==============
English sat_sm.split("this is a test this is another test")
# returns ["this is a test ", "this is another test"]

# use trained lora modules for strong adaptation to language & domain/style

Korean 최고의 모델은 12계층 모델인 sat-12l 및 sat-12l-sm입니다.


==============
English sat_adapted = SaT("sat-3l", style_or_domain="ud", language="en")

Korean 모델	영어 점수	다국어 점수
sat-1l	88.5	84.3
sat-1l-sm	88.2	87.9
sat-3l	93.7	89.2
sat-3l-lora	96.7	94.8
sat-3l-sm	96.5	93.5
sat-6l	94.1	89.7
sat-6l-sm	96.9	95.1
sat-9l	94.3	90.3
sat-12l	94.0	90.4
sat-12l-lora	97.3	95.9
sat-12l-sm	97.4	96.0
점수는 "영어"에 대해 사용 가능한 모든 데이터 세트에 대한 매크로 평균 F1 점수이며, "다국어"에 대해 모든 데이터 세트와 언어에 대한 매크로 평균 F1 점수입니다. 
==============
English sat_adapted.half().to("cuda") # optional, see above
sat_adapted.split("This is a test This is another test.")
# returns ['This is a test ', 'This is another test']

Korean "적응됨"은 LoRA를 통한 적응을 의미합니다. 
==============
English ONNX Support

Korean 자세한 내용은 논문을 참조하십시오.


==============
English 🚀 You can now enable even faster ONNX inference for sat and sat-sm models! 🚀


Korean 비교를 위해 다른 도구의 영어 점수는 다음과 같습니다.


==============
English sat = SaT("sat-3l-sm", ort_providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
>>> from wtpsplit import SaT
>>> texts = ["This is a sentence. This is another sentence."] * 1000

# PyTorch GPU
>>> model_pytorch = SaT("sat-3l-sm")
>>> model_pytorch.half().to("cuda");
>>> %timeit list(model_pytorch.split(texts))
# 
Korean 모델	영어 점수
PySBD	69.6
SpaCy(sentencizer; 단일 언어)	92.9
SpaCy(sentencizer; 다국어)	91.5
Ersatz	91.4
Punkt (nltk.sent_tokenize)	92.2

==============
English 144 ms ± 252 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# quite fast already, but...

# onnxruntime GPU
>>> model_ort = SaT("sat-3l-sm", ort_providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
>>> %timeit list(model_ort.split(texts))
# 
Korean WtP (3l)	93.9

==============
English 94.9 ms ± 165 μs per loop (mean ± std. dev. of 7 runs, 10 loops each
# 
Korean 이 라이브러리는 이전 WtP 모델도 지원합니다. 
==============
English ...this should be ~50% faster! (tested on RTX 3090)

Korean SaT 모델과 마찬가지로 본질적으로 동일한 방식으로 사용할 수 있습니다.


==============
English If you wish to use LoRA in combination with an ONNX model:


Korean from wtpsplit import WtP

wtp = WtP("wtp-bert-mini")
# SaT 모델과 유사한 기능
wtp.split("This is a test This is another test.")
WtP 및 재현 세부 정보에 대한 자세한 내용은 WtP 문서를 참조하십시오.


==============
English Run scripts/export_to_onnx_sat.py with use_lora: True and an appropriate output_dir: <OUTPUT_DIR>.

Korean 단락 분할
SaT는 줄 바꿈 확률을 예측하도록 훈련되므로 문장 외에도 텍스트를 단락으로 분할할 수 있습니다.


==============
English If you have a local LoRA module, use lora_path.

Korean # 각각 문장 목록이 포함된 단락 목록을 반환합니다.

==============
English If you wish to load a LoRA module from the HuggingFace hub, use style_or_domain and language.

Korean # `paragraph_threshold` 인수를 통해 단락 임계값을 조정합니다.

==============
English Load the ONNX model with merged LoRA weights: 
Korean sat.split(text, do_paragraph_segmentation=True)
적응
SaT는 LoRA를 통해 도메인 및 스타일을 적용할 수 있습니다. 
==============
English sat = SaT(<OUTPUT_DIR>, onnx_providers=["CUDAExecutionProvider", "CPUExecutionProvider"])

Korean 81개 언어의 sat-3l 및 sat-12l에 대해 Universal Dependencies, OPUS100, Ersatz 및 TED(즉, ASR 스타일로 전사된 음성) 문장 스타일에 대해 훈련된 LoRA 모듈을 제공합니다. 
==============
English Available Models

Korean 또한 6개 언어의 법률 문서(법률 및 판결), 4개 언어 쌍의 코드 전환 및 3개 언어의 트윗에 대한 LoRA 모듈을 제공합니다. 
==============
English If you need a general sentence segmentation model, use -sm models (e.g., sat-3l-sm) 
Korean 자세한 내용은 논문을 참조하십시오.


==============
English For speed-sensitive applications, we recommend 3-layer models (sat-3l and sat-3l-sm). 
Korean 또한 sat-12-no-limited-lookahead에 대해 16개 장르의 구절 분할 모듈을 제공했습니다.


==============
English They provide a great tradeoff between speed and performance. 
Korean LoRA 모듈은 다음과 같이 로드합니다.


==============
English The best models are our 12-layer models: sat-12l and sat-12l-sm.


Korean # lang_code 및 style_or_domain이 모두 필요합니다.

==============
English Model	
Korean # 사용 가능한 모듈에 대해서는 <model_repository>/loras 폴더를 확인하십시오.

==============
English English Score	
Korean sat_lora = SaT("sat-3l", style_or_domain="ud", language="en")

==============
English Multilingual Score

Korean sat_lora.split("Hello this is a test But this is different now Now the next one starts looool")

==============
English sat-1l	88.5	84.3

Korean # 이제 매우 다른 도메인의 경우
sat_lora_distinct = SaT("sat-12l", style_or_domain="code-switching", language="es-en")

==============
English sat-1l-sm	88.2	87.9

Korean sat_lora_distinct.split("in the morning over there cada vez que yo decía algo él me decía algo")
분할 임계값을 자유롭게 조정할 수도 있습니다. 
==============
English sat-3l	93.7	89.2

Korean 임계값이 높을수록 더 보수적인 분할이 이루어집니다.


==============
English sat-3l-lora	96.7	94.8

Korean sat.split("This is a test This is another test.", threshold=0.4)

==============
English sat-3l-sm	96.5	93.5

Korean # lora에도 유사하게 작동하지만 임계값이 더 높습니다.

==============
English sat-6l	94.1	89.7

Korean sat_lora.split("Hello this is a test But this is different now Now the next one starts looool", threshold=0.7)

==============
English sat-6l-sm	96.9	95.1

Korean 고급 사용법
텍스트에 대한 줄 바꿈 또는 문장 경계 확률을 가져옵니다.

==============
English sat-9l	94.3	90.3

Korean # 줄 바꿈 확률을 반환합니다(배치 지원!).

==============
English sat-12l	94.0	90.4

Korean sat.predict_proba(text)

==============
English sat-12l-lora	97.3	95.9

Korean HuggingFace transformers에서 SaT 모델을 로드합니다.

==============
English sat-12l-sm	97.4	96.0

Korean # 사용자 지정 모델을 등록하는 라이브러리를 가져옵니다.

==============
English The scores are macro-average F1 score across all available datasets for "English", and macro-average F1 score across all datasets and languages for "Multilingual". 
Korean import wtpsplit
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained("segment-any-text/sat-3l-sm") # 또는 다른 모델 이름; https://huggingface.co/segment-any-text 참조

==============
English "adapted" means adapation via LoRA; check out the paper for details.


Korean LoRA를 통해 자체 말뭉치에 적응
모델은 강력한 방식으로 LoRA를 통해 효율적으로 적응할 수 있습니다. 
==============
English For comparison, here the English scores of some other tools:


Korean 10~100개의 훈련된 분할 훈련 문장만으로도 성능이 크게 향상됩니다. 
==============
English Model	English Score

Korean 이렇게 하려면:

저장소를 복제하고 요구 사항을 설치합니다.


==============
English PySBD	69.6

Korean git clone https://github.com/segment-any-text/wtpsplit
cd wtpsplit
pip install -r requirements.txt
pip install adapters==0.2.1 --no-dependencies
cd ..

==============
English SpaCy (sentencizer; monolingual)	92.9

Korean 이 형식으로 데이터를 만듭니다.


==============
English SpaCy (sentencizer; multilingual)	91.5

Korean import torch

torch.save(
    {
        "language_code": {
            "sentence": {
                "dummy-dataset": {
                    "meta": {
                        "train_data": ["train sentence 1", "train sentence 2"],
                    },
                    "data": [
                        "test sentence 1",
                        "test sentence 2",
                    ]
                }
            }
        }
    },
    "dummy-dataset.pth"
)
설정을 만들거나 수정합니다. model_name_or_path를 통해 기본 모델을 제공하고 text_path를 통해 훈련 데이터 .pth를 제공합니다.

configs/lora/lora_dummy_config.json

LoRA 훈련:

python3 wtpsplit/train/train_lora.py configs/lora/lora_dummy_config.json
훈련이 완료되면 저장된 모듈의 경로를 SaT에 제공합니다.

sat_lora_adapted = SaT("model-used", lora_path="dummy_lora_path")
sat_lora_adapted.split("Some domains-specific or styled text")
위의 데이터 세트 이름, 언어 및 모델을 필요에 따라 조정하십시오.


==============
English Ersatz	91.4

Korean 논문 재현
configs/에는 기본 및 sm 모델과 LoRA 모듈에 대한 논문의 실행에 대한 설정이 포함되어 있습니다. 
==============
English Punkt (nltk.sent_tokenize)	92.2

Korean 각 설정에 대해 다음과 같이 훈련을 시작합니다.

python3 wtpsplit/train/train.py configs/<config_name>.json
python3 wtpsplit/train/train_sm.py configs/<config_name>.json
python3 wtpsplit/train/train_lora.py configs/<config_name>.json
또한:

wtpsplit/data_acquisition에는 mC4 말뭉치에서 평가 데이터와 원시 텍스트를 얻는 코드가 포함되어 있습니다.

==============
English WtP (3l)	93.9

Korean wtpsplit/evaluation에는 다음에 대한 코드가 포함되어 있습니다.
intrinsic.py를 통한 평가(즉, 문장 분할 결과).
intrinsic_pairwise.py를 통한 단순 시퀀스 평가(즉, 문장 쌍/k-mer에 대한 문장 분할 결과).
LLM 기준 평가(llm_sentence.py), 법률 기준 평가(legal_baselines.py)
intrinsic_baselines.py 및 intrinsic_baselines_multi.py의 기준(PySBD, nltk 등) 평가 결과

==============
English Note that this library also supports previous WtP models. 
Korean JSON 형식의 원시 결과는 evaluation_results/에도 있습니다.

==============
English You can use them in essentially the same way as SaTmodels:


Korean 통계적 유의성 테스트 코드 및 결과는 stat_tests/에 있습니다.

==============
English from wtpsplit import WtP


Korean punct_annotation.py 및 punct_annotation_wtp.py(WtP 전용)의 구두점 주석 실험
extrinsic.py(WtP 전용)의 기계 번역에 대한 외재적 평가

==============
English wtp = WtP("wtp-bert-mini")

Korean 미리 requirements.txt의 패키지를 설치하십시오.


==============
English # similar functionality as for SaT models

Korean 지원 언어
지원 언어가 포함된 표

==============
English wtp.split("This is a test 
Korean 자세한 내용은 Segment any Text 논문을 참조하십시오.


==============
English This is another test.")

Korean 인용
SaT 모델의 경우 논문을 인용해주십시오.


==============
English For more details on WtP and reproduction details, see the WtP doc.


Korean @article{frohmann2024segment,
    title={Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation},
    author={Frohmann, Markus and Sterner, Igor and Vuli{'c}, Ivan and Minixhofer, Benjamin and Schedl, Markus},
    journal={arXiv preprint arXiv:2406.16678},
    year={2024},
    doi={10.48550/arXiv.2406.16678},
    url={https://doi.org/10.48550/arXiv.2406.16678},
}
라이브러리 및 WtP 모델의 경우 다음을 인용하십시오.


==============
English Paragraph Segmentation

Korean @inproceedings{minixhofer-etal-2023-wheres,
    title = "Where{'}s the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation",
    author = "Minixhofer, Benjamin  and
      Pfeiffer, Jonas  and
      Vuli{'c}, Ivan",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.398",
    pages = "7215--7235"
}

==============
English Since SaT are trained to predict newline probablity, they can segment text into paragraphs in addition to sentences.


Korean 감사의 말

==============
English # returns a list of paragraphs, each containing a list of sentences

Korean 이 연구는 오스트리아 과학 기금(FWF): P36413, P33526 및 DFH-23과 상부 오스트리아 주 및 연방 교육, 과학 및 연구부의 LIT-2021-YOU-215 보조금을 통해 전적으로 또는 부분적으로 자금이 지원되었습니다. 
==============
English # adjust the paragraph threshold via the `paragraph_threshold` argument.

Korean 또한 Ivan Vulic과 Benjamin Minixhofer는 Ivan Vulić에게 수여된 Royal Society University Research Fellowship ‘Inclusive and Sustainable Language Technology for a Truly Multilingual World’(번호 221137)를 통해 지원을 받았습니다. 
==============
English sat.split(text, do_paragraph_segmentation=True)

Korean 이 연구는 Google의 TPU 연구 클라우드(TRC)의 클라우드 TPU를 통해서도 지원되었습니다. 
==============
English Adaptation

Korean 이 작업은 Cohere For AI Research Grant의 컴퓨팅 크레딧으로도 지원되었으며, 이러한 보조금은 과학적 인공물과 데이터를 좋은 프로젝트를 위해 공개하는 것을 목표로 연구를 수행하는 학술 파트너를 지원하도록 설계되었습니다. 
==============
English SaT can be domain- and style-adapted via LoRA. 
Korean 또한 Simone Teufel에게도 유익한 논의에 감사드립니다.


==============
English We provide trained LoRA modules for Universal Dependencies, OPUS100, Ersatz, and TED (i.e., ASR-style transcribed speecjes) sentence styles in 81 languages for sat-3land sat-12l. 
Korean 질문이 있으면 문제를 만들거나 [email protected]으로 이메일을 보내주시면 가능한 한 빨리 답변드리겠습니다.
==============

Sep 25 '24 17:09 seungduk-yanolja

I replaced all the newlines with spaces and it seems work.

Sep 26 '24 09:09 seungduk-yanolja

Thanks for raising this and finding the issue. This seems related to #131 .

Sep 26 '24 10:09 bminixhofer

Hi, could you clarify what you mean by "it seems to work" now? As per #131, we find this rather surprising.

Additionally, I think it would be good to try on more natural Korean text, not this semi-translated documentation with many newlines (which is a bit unrealistic, no?)

Oct 02 '24 19:10 markus583

Hi,would you happent to have any update on this @seungduk-yanolja? :)

Oct 21 '24 19:10 markus583