atom-flowmark icon indicating copy to clipboard operation
atom-flowmark copied to clipboard

Consider phrase breaks?

Open jlevy opened this issue 5 years ago • 7 comments

Some seem to desire breaks on phrases like commas or clauses. This might or might not be a good idea, so listing here to track+discuss.

jlevy avatar Mar 17 '19 03:03 jlevy

(From @ivanistheone in #13:)

I was excited to try this package but it didn't wrap things accoring to semantic -- e.g. on commas and or other logical clauses.

Expected:

Data is what makes machine learning work.
Sure clever math solutions and optimized algorithms play an important role too,
but it's really the data that is the differentiating factor.
Specifically,
we're talking about a source of plentiful, high quality, structured, clean,
and well labelled examples of the machine learning task to be performed.

The features we use to represent each instance of a machine learning task are of central importance for the overall success of a machine learning system.
Indeed,
machine learning practitioners in the industry often describe most of the performance gains they observe come from using better features,
rather then using fancy machine learning models.
Luckily there the field of \emph{feature engineering} exists,
which consists of an arsenal of best practices and tricks for associating the most useful feature vectors as possible for each instance of the dataset.

Observed after reformat + wrap:

Data is what makes machine learning work.
Sure clever math solutions and optimized algorithms play an important role too, but it's
really the data that is the differentiating factor.
Specifically, we're talking about a source of plentiful, high quality, structured, clean,
and well labelled examples of the machine learning task to be performed.

The features we use to represent each instance of a machine learning task are of central
importance for the overall success of a machine learning system.
Indeed, machine learning practitioners in the industry often describe most of the
performance gains they observe come from using better features, rather then using fancy
machine learning models.
Luckily there the field of \\emph{feature engineering} exists, which consists of an
arsenal of best practices and tricks for associating the most useful feature vectors as
possible for each instance of the dataset.

specifically I'd expect the but it's to be on the the next line.

jlevy avatar Mar 17 '19 03:03 jlevy

Current intended behavior is (1) to break on sentences (unless they are so short they might not be sentences at all, in which case we err on the side of not breaking) (2) to emphasize simplicity and language neutrality (e.g. not to use any overly complex NLP or crazy rules that would make this not work or be unpredictably nondeterministic as the package evolves)

It's possible though that breaking on longer phrases is a good idea, but we'd need simple rules. It might also be harder to explain and get used to.

@ivanistheone did you have any thoughts or use cases on why you'd prefer phrase breaks to sentence breaks?

This could also be a flag, but that comes at a cost too.

jlevy avatar Mar 17 '19 03:03 jlevy

did you have any thoughts or use cases on why you'd prefer phrase breaks to sentence breaks?

The high-level reason is that phrases are the smallest coherent unit, so it makes sense to see them each on their own line. Similar to how one paragraph contains one idea, each phrase is one coherent building block used to construct that idea.

There are also several pracrical, low-level reasons for the one-phrase-per-line approach:

  • In my experience working on books, I do a lot of exiting and moving around, which is made much easier when I can do "surgery" on the text with only vertical selection commands (always cutting entire chunks)
  • the long phrases in the source code stick out visually and serve as red flags for parts that need to be simplified --- e.g. if you write a run-on phrase that is 200+ characters it will be clearly visible and the source code uglyness might prompt you to shorten it or simplify
  • similarly use of introductory words + comma like Indeed, However, etc. becomes apparent (ragged, ugly source code), which forces me to use them sparingly.
  • one phrase per line makes github diffs look nice --- although thanks to diff --color-words and latexdiff this is not so important when working on command line
  • more thoughts on that here https://rhodesmill.org/brandon/2012/one-sentence-per-line/ (although I think sub-prase line breaks might be going too far)

ivanistheone avatar Mar 17 '19 08:03 ivanistheone

Thanks! Yes, am familiar with most of goals (some more discussion here if you're interested).

Your first benefit is interesting, for sure, and perhaps better with phrase breaks than sentence breaks. The 2nd and 3rd are interesting as well, but I'm not sure every editor would share this perception, so I'd hesitate to make it default always. Note the 4th and 5th are mostly already benefits with sentence-per-line-with-wrap-on-overflow, the current behavior.

At Holloway, we use Flowmark on large documents with several committers pretty effectively and I've found it's a good, realistic compromise so far that balances semantic editing and stability with keeping the text sane looking. But will leave this open; perhaps this could be a setting in the future, and glad to see if anyone else asks for it!

jlevy avatar Apr 01 '19 06:04 jlevy

This form of semantic break would be useful in formatting markdown documentation. Adjusting the split rules to split-on-semantic-breaks-if-required would be useful.

For example, a heuristic that splits-on-semantic-breaks at >%~60 of maximum text width, but splits a non-semantic breaks if required to stay below max width?

Eg:

Data is what makes machine learning work.
Sure clever math solutions and optimized algorithms play an important role too,
but it's really the data that is the differentiating factor.
Specifically, we're talking about a source of plentiful, high quality, structured, clean,
and well labelled examples of the machine learning task to be performed.

The features we use to represent each instance of a machine learning task are of central
importance for the overall success of a machine learning system.
Indeed, machine learning practitioners in the industry often describe most of the
performance gains they observe come from using better features,
rather then using fancy machine learning models.

Luckily there the field of \emph{feature engineering} exists,
which consists of an arsenal of best practices and tricks for associating the most 
useful feature vectors as possible for each instance of the dataset.

asford avatar Feb 15 '21 03:02 asford

As a quick follow-up... the compromise .clauses selector may be a good ~80% effect, ~10% effort solution for this style of splitting.

I've played around with it with this test data, https://observablehq.com/d/59be2e7af575c8ad, and it definitely has warts but could be a good start.

asford avatar Feb 16 '21 04:02 asford