bergamot-translator Need API to find reasonable split for large blob of text

If we take, say, the entire text of Crime and Punishment https://www.gutenberg.org/files/2554/2554-0.txt then it shouldn't be sent to the app all at once as one big string. So ideally we'd impose some sort of size limit (like a maxi-batch) and break it into pieces. Currently the API provides no way for the consumer to know where a good split point is without translating everything. We should provide a facility to find a sentence split point in the vicinity of an offset in the text, so that the client can limit incoming blob size. cc @ugermann

Apr 17 '21 23:04 kpu

That's more or less why I implemented sentence splitting as a stream. You submit a blob of text and get back one sentence one at a time. You can batch them off at your leisure once you have enough. If you use splitting mode one_paragraph_per_line or wrapped_text, you'll get an empty string_view at each paragraph boundary (the difference being that in one_paragraph_per_line, EOL counts as a paragraph boundary, whereas in wrapped_text, an empty line counts as a paragraph boundary, and non-empty lines are concatenated into a paragraph). So you can either send off the text one paragraph at a time, and/or count the sentences as they come in and make your own decisions as to when enough is enough (in the case of giant blobs of text without paragraph boundaries).

We could push that logic into the translator by way of returning a multi-part response, sending back the response in pieces if necessary. In terms of sentence splitting the mechanics are there; you'll just have to let go of the idea of one giant string in, one giant string out; my original suggestion of returning a sequence of text chunks (sentences, paragraphs, whatever) for a single input was based on that idea but was immediately shot down at the time.

Apr 18 '21 00:04 ugermann

See also: ssplit_stream() here: https://github.com/ugermann/ssplit-cpp/blob/master/src/command/ssplit_main.cpp. Replace ssplit_chunk() by translate_chunk().

Apr 18 '21 00:04 ugermann

Every case I see in https://github.com/ugermann/ssplit-cpp/blob/master/src/command/ssplit_main.cpp presumes the entire text to split is already in RAM or memory mapped. With that assumption, we've already lost the battle for bounded memory consumption. The browser shouldn't even be shipping us the entire document. (And your app shouldn't be reading unbounded input lines.)

So what I'm looking for is a means to say "here's the first 1 MB of text, tell me where the last sentence split boundary is, don't bother translating after that boundary, and the next request will start from there." This could be as simple as an API to efficiently find the LAST sentence boundary in the chunk of text. (And since we're capping sentence length, there will be one in a guaranteed range.)

Apr 18 '21 11:04 kpu