SoMaJo icon indicating copy to clipboard operation
SoMaJo copied to clipboard

How to just split the sentences?

Open sambaPython24 opened this issue 7 months ago • 3 comments

Is there any way to just split the text into the sentences like the from nltk.tokenize import sent_tokenize function?

sambaPython24 avatar May 12 '25 14:05 sambaPython24

The sentence splitter operates on tokenized input, so splitting sentences without first tokenizing the text is not possible.

However, there are two ways to extract untokenized sentences from SoMaJo's output. You could either detokenize the output or you could use the character offset information to access the character span in the input.

For the first option, detokenizing SoMaJo's output, see the suggested solution in https://github.com/tsproisl/SoMaJo/issues/17#issuecomment-653398494.

For the second option, accessing the corresponding character span in the input, something like this might suit your needs:

import io

from somajo import SoMaJo


def extract_raw_sentence(tokens, raw_text):
    start = tokens[0].character_offset[0]
    end = tokens[-1].character_offset[1]
    return raw_text[start:end]


pseudofile = io.StringIO(
    "der beste Betreuer?\n"
    "-- ProfSmith! : )\n"
    "\n"
    "Was machst du morgen Abend?! Lust auf Film?;-)"
)

tokenizer = SoMaJo("de_CMC", character_offsets=True)
raw_text = pseudofile.read()
pseudofile.seek(0)

sentences = tokenizer.tokenize_text_file(pseudofile, paragraph_separator="empty_lines")
for sentence in sentences:
    print(extract_raw_sentence(sentence, raw_text))

This produces the following output:

der beste Betreuer?
-- ProfSmith! : )
Was machst du morgen Abend?!
Lust auf Film?;-)

Note that the second option will be slower due to the overhead that the alignment algorithm for the character offsets incurs.

tsproisl avatar May 13 '25 10:05 tsproisl

Thank you very much for both answers! How can I customize the abbreviations like No. or Nr. that should be not used for splitting the sentence? In which file can they be found?

Or in other words: How do you distinguish how to split the sentence?

sambaPython24 avatar May 13 '25 11:05 sambaPython24

Sorry for the delayed response. Abbreviations are defined in src/somajo/data:

  • abbreviations_(de|en).txt: Abbreviations that are not matched by (?:[[:alpha:]]\.){2,}, i.e. are not sequences of single letters followed by single dots.
  • eos_abbreviations.txt: Abbreviations that frequently occur at the end of a sentence. If such an abbreviation is followed by a potential sentence start, e.g. by a capital letter, it will be interpreted as the end of a sentence.
  • single_token_abbreviations_(de|en).txt: Multi-dot abbreviations that represent single tokens and should not be split.

tsproisl avatar May 19 '25 08:05 tsproisl