flix icon indicating copy to clipboard operation
flix copied to clipboard

Add `BufReader` trait to standard library

Open magnus-madsen opened this issue 7 months ago • 12 comments

Add something like:

import java.io.BufferedReader

enum BufReaderRef(BufferedReader)

trait BufReader[t] {
    type Aef: Eff
    pub def readln(rd: t): Result[IoError, String] \ BufReader.Aef[t]
}

instance BufReader[BufReaderRef] {
    type Aef = IO
    pub def readln(rd: BufReaderRef): Result[IoError, String] \ IO = match rd {
        case BufReaderRef.BufReaderRef(r) => ???
    }
}
  • [ ] Break into multiple files to follow standard format.
  • [ ] Add unit tests
  • [ ] Add more instances (e.g. one for List[String] or even f[String])

magnus-madsen avatar Jun 08 '25 09:06 magnus-madsen

@mlutze I was thinking Ry could look into this.

magnus-madsen avatar Jun 08 '25 15:06 magnus-madsen

I think my description is wrong.

We probably do want a Reader instance for java.io.BufferedReader, but we probably also want our own BufReader struct thingy that wraps anything with a Reader instance.

magnus-madsen avatar Jun 11 '25 10:06 magnus-madsen

@magnus-madsen

BufReader struct thingy that wraps anything with a Reader instance

Is it possible to add this constraint to the enum def itself? I wouldn't think so... like

enum BufReaderRef[a](a) with BufReader[a]

or would we just add the constraint to the instance def (and anywhere we return a BufReaderRef in the api):

enum BufReaderRef[a](a)

instance BufReader[BufReaderRef[a]] with BufReader[a]

What's the use case for BufReaderRef? enum BufReaderRef[a](a) just looks a little strange to me on its own 😅. If it's to avoid returning java types directly in our api, could we instead just return a (with BufReader[a])?

  1. trait instances for List[String] / f[String] wouldn't be legal right, since they are complex instances?

rywiese avatar Jun 13 '25 11:06 rywiese

Is it possible to add this constraint to the enum def itself? I wouldn't think so... like

enum BufReaderRef[a](a) with BufReader[a]

This is not possible.

or would we just add the constraint to the instance def (and anywhere we return a BufReaderRef in the api):

enum BufReaderRef[a](a)

instance BufReader[BufReaderRef[a]] with BufReader[a]

BufReader only needs to wrap Reader, not another BufReader

mlutze avatar Jun 13 '25 13:06 mlutze

What @mlutze said

magnus-madsen avatar Jun 13 '25 14:06 magnus-madsen

But BufReader should probably be a struct with its own array, buffer size, and region. (I think). @mlutze Can chip in here.

magnus-madsen avatar Jun 13 '25 14:06 magnus-madsen

Okay I get I think. BufReader is a struct that wraps a Reader + everything it needs to do its own buffering, just the way java BufferedReader is a class that wraps a java Reader and adds buffering.

I worked on this today but got tripped up on a subtle issue:

the java Reader reads into a char[] buffer, whereas our Reader (like the rust reader) reads into an Int8 buffer. So as proposed, our BufReader would have to decode the bytes into chars.

The way java does it:

  • InputStream reads into byte[].
  • InputStreamReader
    • wraps the InputStream along with a charset (new InputStreamReader(inputStream, charset))
    • reads into char[]
    • implements Reader
  • BufferedReader wraps the InputStreamReader, so it can just buffer the chars and read them line at a time.

(Note also that java InputStream is (now) an instance of Flix Reader, but not an instance of java Reader. But new InputStreamReader(inputStream, charset) is an instance of java Reader, cause the types match there.)

We could just give our BufReader a charset but maybe that's not the best separation of concerns. So maybe we should think up a middle layer for doing the decoding. I'll think some more on it tomorrow and see if I can find what rust does. Open to ideas @magnus-madsen @mlutze

rywiese avatar Jun 16 '25 14:06 rywiese

Okay I get I think. BufReader is a struct that wraps a Reader + everything it needs to do its own buffering, just the way java BufferedReader is a class that wraps a java Reader and adds buffering.

Agreed.

I think where we get confused is the difference between reading bytes vs. chars. Is there a BufferedInputStream in Java or how is this supposed to work?

magnus-madsen avatar Jun 16 '25 18:06 magnus-madsen

Yeah there is https://docs.oracle.com/javase/8/docs/api/java/io/BufferedInputStream.html

which does buffering, but reads bytes (not chars). so maybe this is the functionality we want to make instead? and not concern ourselves with char bufs (or strings, or reading lines).

this is also extends InputStream, so it can also implement our Reader trait

rywiese avatar Jun 16 '25 19:06 rywiese

Yeah there is https://docs.oracle.com/javase/8/docs/api/java/io/BufferedInputStream.html

which does buffering, but reads bytes (not chars). so maybe this is the functionality we want to make instead? and not concern ourselves with char bufs (or strings, or reading lines).

this is also extends InputStream, so it can also implement our Reader trait

I think so. But we will want our own BufReader. The names are indeed confusing.

magnus-madsen avatar Jun 17 '25 03:06 magnus-madsen

Yeah so rust has a BufReader which does what I described:

which does buffering, but reads bytes (not chars)

and also implements the Rust Reader.

So our BufReader should do the same I think... and then we can later build something that handles decoding into chars. That component could be built on top of a (byte) Reader - which could be a buffered implementation. Then we would be decoding after pulling data out of the buffer, which differs from the Java, which decodes before putting into the buffer. I think buffering raw bytes would make the buffer more reusable

rywiese avatar Jun 17 '25 07:06 rywiese

Sounds good. And then we need to figure out the naming too.

magnus-madsen avatar Jun 17 '25 07:06 magnus-madsen

and then we can later build something that handles decoding into chars

So i've been working on building this on top of the linked PR, and have discovered some interesting things that are making me re-think this design choice to buffer bytes vs buffer chars (in other words, decode before or after buffering).

Basically,

  • Rust strings (from what i'm gathering) have a utf8 representation at the byte level, effectively allowing [u8] to be unchecked-casted into a string (if utf8 is assumed).
  • This naturally leads to a rust design (which I have implemented in flix in the linked PR) where bytes are buffered, then copied into a string in an unsafe block on the fly during the readLine() call. (https://doc.rust-lang.org/std/io/struct.BufReader.html)
  • Java strings are not like this, so they must be decoded into Chars using Charset decode(). This function takes a ByteBuffer input and gives a CharBuffer output (both objects, not primitives).
  • Since these buffers are both objects, it would generate a lot of garbage to create them on the fly (sized dynamically) in the readLine() operation. It is more efficient to reuse a single buffer of fixed sized for decoding, which naturally leads to the design in java which is to buffer the chars. readLine() just scans through the char buffer looking for a \n char.

Assuming we want to use Charset.decode() for the flix implementation, we probably want to shift towards buffering chars instead of bytes.

The java way of doing this is new BufferedReader(new InputStreamReader(inputStream, Charset.UTF8)), where

  • inputStream allows reading of bytes and is provided on the socket
  • InputStreamReader wraps an InputStream (byte reader) along with a charset to implement a (char) Reader.
  • BufferedReader adds the readLine() functionality to a (char) Reader, by buffering the chars read from the underlying reader.

So we could:

  1. Rework my PR to support buffering of chars instead of bytes. BufReader would probably need a Charset member.
  2. Do something like this, but package up a new BufferedReader(...) in the Socket tuple (along with a LineReader typeclass). This probably seems easiest now that I think about it?
  3. Keep the byte buffering, and get creative on an efficient way to decode bytes on the fly when a new line is requested. This will likely involve unchecked casts, and extra care would have to be taken to consider multi-byte chars. @CadeMichael has some ideas.

rywiese avatar Jun 23 '25 11:06 rywiese

It sounds like we need to discuss this on the whiteboard. Maybe we can find time tomorrow. CC @mlutze and @JonathanStarup

magnus-madsen avatar Jun 23 '25 11:06 magnus-madsen

Could you outline (not implement) a design that "fits best" with Java?

magnus-madsen avatar Jun 23 '25 11:06 magnus-madsen

I had a brief talk with Ry earlier and one thing that might be overkill but seems like it could be useful would be a semi backtracking effect for "peeking" the next byte much like a lexer parsing tokens.

I believe mutli bit UTF characters prefix the first byte with how many trailing bytes compose the character. We could parse the buffer stream and peek the next byte until we get a valid character and stop when we hit an endline character, or return an error if the bytes don't correspond to valid characters.

The main issue Ry brought up with the charset library is it takes a buffer -> charbuffer. Meaning that it doesn't mutate the underlying buffer. So if you read a line from the buffer it wouldn't strip that line from the buffer, the buffer would contain the same bytes as it did before the readline operation.

CadeMichael avatar Jun 23 '25 11:06 CadeMichael

Could you outline (not implement) a design that "fits best" with Java?

Option 1: Rely on java types under the hood:

trait LineReader[t] with CharReader[t] {
    type Aef: Eff
    pub def readLine(r: t): Result[IoError, String] \ r + LineReader.Aef[t]
}

instance LineReader[java.io.BufferedReader] {
    // delegate to BufferedReader::readLine
}

- enum TcpSocket(Socket, InputStream, OutputStream)
+ enum TcpSocket(Socket, InputStream, BufferedReader, OutputStream)

instance LineReader[TcpSocket] {
    // delegate to `LineReader[BufferedReader]`
}

// Socket construction (in `TcpAccept`)
- TcpSocket.TcpSocket(socket, inputStream, outputStream)
+ let bufferedReader = new BufferedReader(new InputStreamReader(inputStream, Charset.UTF8)) // java types
+ TcpSocket.TcpSocket(socket, inputStream, bufferedReader, outputStream)

Option 2: Our own char buffering (change BufReader to buffer chars)

// currently `Reader`
trait ByteReader[t] {
    type Aef: Eff
    pub def read(b: Array[Int8, r], r: t): Result[IoError, Int32] \ r + ByteReader.Aef[t]
}

trait CharReader[t] {
    type Aef: Eff
    pub def read(b: Array[Char, r], r: t): Result[IoError, Int32] \ r + CharReader.Aef[t]
}

// or maybe one `Reader` with an associated type?

trait LineReader[t] with CharReader[t] {
    type Aef: Eff
    pub def readLine(r: t): Result[IoError, String] \ r + LineReader.Aef[t]
}

pub struct BufReader[t: Type, r: Region] {
    buffer: Array[Char, r], // instead of `Array[Int8, r]`
    ...
    reader: t // now a CharReader instead of (byte) Reader
}

instance CharReader[BufReader[t, r]] with CharReader[t] {
    read: same as written currently
}

instance LineReader[BufReader[t, r]] {
    readLine: scans through the char buffer looking for \n
}

struct DecodingCharReader[t] {
    byteReader: t,
    charset: Charset
}

instance CharReader[DecodingCharReader[t]] with ByteReader[t] {
    // read() combines `byteReader.read()` and `charset.decode()`
}

- enum TcpSocket(Socket, InputStream, OutputStream)
+ enum TcpSocket(Socket, InputStream, t, OutputStream)

instance LineReader[TcpSocket[t]] with LineReader[t] {
    // readLine delegates to `t`
}

// Socket construction (in `TcpAccept`)
- TcpSocket.TcpSocket(socket, inputStream, outputStream)
+ let bufReader = BufReader.fromCharReader(DecodingCharReader.fromByteReader(inputStream, utf8));
+ TcpSocket.TcpSocket(socket, inputStream, bufReader, outputStream)

rywiese avatar Jun 23 '25 13:06 rywiese

This is the partial implementation that I had already started writing, using the existing BufReader as is (built on bytes).

If we continue building on top of BufReader (buffering bytes), then a sketch of the implementation is as follows (I was already working on this)

trait BufReader[t] with Reader[t] {
    pub def readUntil(p: Int8 -> Bool, bufReader: BufReader[t, r]): Result[IoError, Vector[Int8]] \ r + BufReader.Aef[t]
}

mod BufReader {

    pub def readLine(charset: Charset, bufReader: BufReader[t, r]): Result[IoError, String] \ r + BufReadable.Aef[t] + IO with BufReadable[t] =
        region rc {
            let builder = StringBuilder.empty(rc);
            let newLine = unchecked_cast('\n' as Int8);
            let carriageReturn = unchecked_cast('\r' as Int8);
            forM(
                bytes <- BufReadable.readUntil(b -> (b == newLine) or (b == carriageReturn), bufReader)
            ) yield {
                let line = charset.decode(ByteBuffer.wrap(bytes)).array();
                foreach(char <- line) {
                    StringBuilder.append(char, builder)
                };
                StringBuilder.toString(builder)
            }
        }

}

The problems with this approach, which triggered me to dive deeper into rust vs java, are:

  1. We create garbage via ByteBuffer.wrap every line, vs every filling of the buffer.
  2. We still have these unchecked_cast to detect newline bytes. Maybe not such a big deal, but didn't seem ideal either.

But maybe the garbage from 1 is tolerable, in which case this is a non-issue? I don't have a good feel yet for what kind of performance concerns we're aiming for here. If this is too expensive, then it makes more sense to use one of the previous proposals where we decode into the buffer (new garbage for every buffer filling, not every newline).

rywiese avatar Jun 23 '25 13:06 rywiese

In Option 1:

+ TcpSocket.TcpSocket(socket, inputStream, bufferedReader, outputStream)

A stream can be both bytes or strings (e.g. images or text). If we have a bufferedReader here are we not assuming only text?

magnus-madsen avatar Jun 23 '25 15:06 magnus-madsen

Is it correct to say that there are two dimensions: Reading bytes and reading chars? Along another axis, reading buffered or unbuffered? Perhaps with the exception that when reading buffered chars we care about reading whole lines?

magnus-madsen avatar Jun 23 '25 17:06 magnus-madsen

In Option 1:

+ TcpSocket.TcpSocket(socket, inputStream, bufferedReader, outputStream)

A stream can be both bytes or strings (e.g. images or text). If we have a bufferedReader here are we not assuming only text?

Good point. I guess with this approach, if we're reading non-text we still implement Read via inputStream to read bytes. We also LineReader, but it wouldn't make sense for the caller to call readLine. So maybe it doesn't make sense to make TcpSocket implement LineReader directly, for that reason

rywiese avatar Jun 23 '25 20:06 rywiese

Is it correct to say that there are two dimensions: Reading bytes and reading chars? Along another axis, reading buffered or unbuffered?

Mostly yes, and i think it would be good to keep these concerns separate. Though as you point out the lines get blurred slightly:

Perhaps with the exception that when reading buffered chars we care about reading whole lines?

Any sort of readWhile/readUntil (I said readUntil in my proposal but maybe readWhile is more standard) logic has to be built on top of buffering. This includes readLine(), which is equivalent to readWhile(c -> c == '\n'). This is because we have to read some data into memory, iterate through the data until the predicate fails, and then the remaining suffix has to go somewhere, so we need a buffer to put it in.

So maybe it's a bit more accurate to say that if we care about reading whole lines (in some generic sense), we need buffering.

We could also consider making Read and BufRead polymorphic, like:

trait Read[t] {
    type Elem
    type Aef: Eff
    pub def read(buffer: Array[Read.Elem[t], r], reader: t)...
}

trait BufRead[t] with Read[t] {
    ...
    pub def readWhile(...
}

This would allow us to choose whether to buffer chars or bytes without changing buffering logic

rywiese avatar Jun 23 '25 21:06 rywiese

Any sort of readWhile/readUntil (I said readUntil in my proposal but maybe readWhile is more standard) logic has to be built on top of buffering. This includes readLine(), which is equivalent to readWhile(c -> c == '\n'). This is because we have to read some data into memory, iterate through the data until the predicate fails, and then the remaining suffix has to go somewhere, so we need a buffer to put it in.

one thing to consider is the platform dependent newline character, but it should be really simple to account for.

CadeMichael avatar Jun 23 '25 21:06 CadeMichael

The associated type to Read sounds pretty nice. What would it look like in practice? How much needs to be duplicated?

mlutze avatar Jun 24 '25 08:06 mlutze

The associated type to Read sounds pretty nice. What would it look like in practice? How much needs to be duplicated?

I'm working on a prototype to show what it looks like. I don't think anything needs to be duplicated, just updating the existing Reader to have an associated type, and adding an associated type of Int8 to all existing instances. Then more instances with type Char can be added also. The BufReader which is under review now would be changed to be type agnostic, but it shouldn't change any of the logic or tests.

rywiese avatar Jun 25 '25 07:06 rywiese

But with that said, I've made all requested changes to the BufReader. I think best would be to merge as-is, close this ticket, and then iterate. I can make some new issues to implement the new design that i'm working on/prototyping.

rywiese avatar Jun 25 '25 07:06 rywiese

I can make some new issues to implement the new design that i'm working on/prototyping.

https://github.com/flix/flix/issues/10964

rywiese avatar Jun 25 '25 09:06 rywiese

let me know when you merge once there is a Working implementation I can finish Process and underlying changes to the readline implementation shouldn't break it.

CadeMichael avatar Jun 25 '25 10:06 CadeMichael