fs2 icon indicating copy to clipboard operation
fs2 copied to clipboard

added fs2.text.linesWithEndings

Open fizzy33 opened this issue 4 years ago • 4 comments
trafficstars

Worth noting that if we wanted to replace fs2.text.lines with

  /** Transforms a stream of `String` such that each emitted `String` is a line from the input. */
  def lines[F[_]]: Pipe[F, String, String] =
    s => linesWithEndings(s).map(_._1)

The performance hit using LinesBenchmark on my mac mini 2021 was down 3K ops/s from 96K ops/s to 93K ops/s. Which seems sensible as there are a few more allocations and one more map.

The unit test is a robotic copy of the test for lines.

So one open question I have is to replace fs2.text.lines or not.

While more LineEnding's could be expressed the current code doesn't support them so the effort was on expressing simply what the existing system does.

fizzy33 avatar May 23 '21 21:05 fizzy33

@diesalbla fwiw I spent about 2 hours in morbid curiousity trying to eek more performance including trying what you suggested in various incarnations and couldn't eek out any more performance. Any case where substring's were used were signficantly less performant. Honestly really surprised by this. I also wanted to see if I could get the performance we wanted using scanChunks but no success. Someone already worked really hard to get it to perform. I suspect the StringBuilder allocated on the call stack is very cheap along with the arrays that StringBuilder.toString generates. They show to be significantly cheaper than String.substring in the profiler I was using.

@nikiforo we have the use case for this in http4s where in roundtrip unit tests it is needed to properly reify the input data with the expected and actual data.

Agreed on the EOF I have similar mis-givings. Does EOS (end of string | end of stream) lessen it ? Suggested from the comment that lines is about streams not files. I toss and turn on the Option[LineEnding] versus having EOF as well. Chose EOF because it was less allocations and assuming all else is equal (which it isn't).

fizzy33 avatar May 24 '21 12:05 fizzy33

If we went the option route, we'd probably need to cache instances of Some(CRLF) and Some(CR), right?

The Java socket API uses end-of-input and end-of-output. I think either EOF or EOS would be fine here.

mpilquist avatar May 24 '21 12:05 mpilquist

If we went the option route, we'd probably need to cache instances of Some(CRLF) and Some(CR), right?

:smiley:

The Java socket API uses end-of-input and end-of-output. I think either EOF or EOS would be fine here.

Consider this line. If you Stream("one\r\ntwo").linesWithEndings you would get Stream(("one", CRLF), ("two", EOS)). That looks fine. However, if you Stream.empty[String].linesWithEndings you would get either Stream.empty and no EOS, or Stream("", EOS). Both of them look wrong from where I stand.

nikiforo avatar May 24 '21 13:05 nikiforo

So oddly enough I spent the morning hacking through ServerSentEvent unit tests in http4s. ServerSentEvent defines end event as an empty line terminated with a \n. The existing protocol decoder was using fs2.text.lines and in the absence of line endings assumed \n which created corner cases when combined with this related behaviour fs2.Stream("\n").through(fs2.text.lines).toVector.size == 2.

@nikiforo great point the little response I have for you there is the existing behaviour I mentioned of fs2.Stream("\n").through(fs2.text.lines).toVector.size == 2 (not quite an exact analogy)

I have no strong opinion on EOS or Option[LineEnding] I can make both work. Being new here I am not sure how we come to consensus. There are more "exotic" ways to encode this (just throwing spaghetti against the wall),

sealed trait TextOrLineEnding
case class Text(value: String) extends TextOrLineEnding
case class LineEnding(value: String) extends TextOrLineEnding
def linesWithEndings[F[_]]: Pipe[F, String, TextOrLineEnding] = {

Noting with this encoding you lose the obvious no two Text instances in a row but it does have legs.

Another alternate is have the text be optional

def linesWithEndings[F[_]]: Pipe[F, String, (Option[String],LineEnding)] = {

You could arguably pass the @nikiforo test with Stream.empty[String].linesWithEndings == Stream(None -> EOS)

Anyway I am having a bit of fun just to see if any bright ideas pop out. There is a practical need here. One needs the actual line endings if you are decoding protocols where \n \r\n \r and End are not considered equivalent

Anyone else find it fascinating something as simple as this can get so intricate ?

fizzy33 avatar May 26 '21 14:05 fizzy33