sentencex icon indicating copy to clipboard operation
sentencex copied to clipboard

Support destructive splitting

Open waldyrious opened this issue 2 years ago • 2 comments

It's understood that the library performs non-destructive splitting by default, but would it be possible to add an option to allow "destructive" splitting? In other words, trimming of whitespace around each sentence, and removal of whitespace-only splitted parts.

waldyrious avatar Nov 05 '23 23:11 waldyrious

I think I understood what you meant, but could you please provide an example to confirm?

There is a plan to provide an advanced API other than segment- may be get_boundaries which returns more information about the nature of splitting and downstream applications can augment the behavior. See https://github.com/wikimedia/sentencex-js/issues/1

This means, the default API can be opinionated, but the new API can defer such decisions to a caller. The exact API is not finalized. Still collecting requirements for that API.

santhoshtr avatar Nov 06 '23 04:11 santhoshtr

I think I understood what you meant, but could you please provide an example to confirm?

Yes, sure. Here's what I currently get with a file based on the README example in sentencex-js and using the content from this comment as a test:

import segment from 'sentencex'

const text = `This should? Produce a violation.

So should! This example.

Abbreviations (e.g. like) these (i.e. should) not.

"Sentences in." Quotes should, too.

Pausing... for... thought... should not?

This rule does weird things if a sentence is
already wrapped. It should maybe unwrap in
cases like this?`;

console.log(segment("en", text))

I saved this as sentencex-test.mjs and ran npm install --save sentencex followed by node sentencex-test.mjs. The result was this:

[
  'This should?',
  'Produce a violation.',
  '\n\n',
  'So should!',
  'This example.',
  '\n\n',
  'Abbreviations (e.g. like) these (i.e. should) not.',
  '\n\n',
  '"Sentences in." Quotes should, too.',
  '\n\n',
  'Pausing...',
  'for...',
  'thought...',
  'should not?',
  '\n\n',
  'This rule does weird things if a sentence is\nalready wrapped.',
  'It should maybe unwrap in\ncases like this?'
]

I would have expected the whitespace-only entries to not be present:

[
  'This should?',
  'Produce a violation.',
  'So should!',
  'This example.',
  'Abbreviations (e.g. like) these (i.e. should) not.',
  '"Sentences in." Quotes should, too.',
  'Pausing...',
  'for...',
  'thought...',
  'should not?',
  'This rule does weird things if a sentence is\nalready wrapped.',
  'It should maybe unwrap in\ncases like this?'
]

...but I understand that this is by design, hence the suggestion of an option.

That said, in further tests I noticed that, contrary to my assumption above, whitespace around sentences is not preserved (internal whitespace is preserved). Shouldn't that also be kept for full "reconstructability" of the original text?

p.s. - I realize I'm using sentencex-js above even though I am commenting in the Python-based sentencex repo, but I assumed the two projects to produce identical results, and this one to be the canonical one for decisions about the API.

waldyrious avatar Nov 11 '23 23:11 waldyrious