zemberek-nlp
zemberek-nlp copied to clipboard
Consider adding extractFromDocument() to TurkishSentenceExtractor
Currently there are two methods
List<String> extract(String paragraph)
List<String> extract(List<String> paragraphs)
Those methods ignore line breaks as a sentence boundary. A common use is to extract sentences from a complete document represented as a single String. Which contains line breaks. So this method would do the following
List<String> extractFromDocument(String document) {
List<String> paragraphs = split document from line breaks;
return extract(paragraphs);
}
also other method name can be changed to extractFromParagraph() if this is added.
@mdakin wdyt?
maybe for now only providing this would be enough?
List<String> paragraphs = split document from line breaks;
Ok, like this?
List<String> splitFromLineBreaks(String foo)
in the same class or in a utility class..
So the usage will be like:
extractor = ...;
for (String paragraph: extractor.splitFromLineBreaks(doc)) {`
extractor.extract(paragraph);
}
or directly
extractor.extract(extractor.splitFromLineBreaks(doc))
So I am not sure now, maybe your initial suggestion was not bad, like adding a new method like extractFromDocument that explains it uses line breaks as paragraph endings. Your call bro.
Thanks, This is not a pressing issue anyway. I will make one of these and see if it will stick.
There is a possibility of using objects like Document, Paragraph etc, but that is another issue. So your initial suggestion is fine I think, maybe having separate method names like extractWords , extractSentences and extractParagraphs could be better instead of overloading.
Agreed, soon there will be a need for such structures. But when they come, overloading those methods may suffice. I will go with my initial suggestion then.