zemberek-nlp Consider adding extractFromDocument() to TurkishSentenceExtractor

Consider adding extractFromDocument() to TurkishSentenceExtractor

Open ahmetaa opened this issue 8 years ago • 6 comments

Currently there are two methods

List<String> extract(String paragraph)
List<String> extract(List<String> paragraphs)

Those methods ignore line breaks as a sentence boundary. A common use is to extract sentences from a complete document represented as a single String. Which contains line breaks. So this method would do the following

List<String> extractFromDocument(String document) {
  List<String> paragraphs = split document from line breaks;
  return extract(paragraphs); 
}

also other method name can be changed to extractFromParagraph() if this is added.

@mdakin wdyt?

Feb 05 '17 11:02 ahmetaa

maybe for now only providing this would be enough?

List<String> paragraphs = split document from line breaks;

Feb 05 '17 12:02 mdakin

Ok, like this?

List<String> splitFromLineBreaks(String foo)

in the same class or in a utility class..

Feb 05 '17 12:02 ahmetaa

So the usage will be like:

extractor = ...;
for (String paragraph: extractor.splitFromLineBreaks(doc)) {`
   extractor.extract(paragraph);
}

or directly

 extractor.extract(extractor.splitFromLineBreaks(doc))

So I am not sure now, maybe your initial suggestion was not bad, like adding a new method like extractFromDocument that explains it uses line breaks as paragraph endings. Your call bro.

Feb 05 '17 12:02 mdakin

Thanks, This is not a pressing issue anyway. I will make one of these and see if it will stick.

Feb 05 '17 12:02 ahmetaa

There is a possibility of using objects like Document, Paragraph etc, but that is another issue. So your initial suggestion is fine I think, maybe having separate method names like extractWords , extractSentences and extractParagraphs could be better instead of overloading.

Feb 06 '17 09:02 mdakin

Agreed, soon there will be a need for such structures. But when they come, overloading those methods may suffice. I will go with my initial suggestion then.

Feb 06 '17 10:02 ahmetaa

zemberek-nlp zemberek-nlp copied to clipboard

Consider adding extractFromDocument() to TurkishSentenceExtractor

zemberek-nlp
zemberek-nlp copied to clipboard