Add `BufReader` trait to standard library
Add something like:
import java.io.BufferedReader
enum BufReaderRef(BufferedReader)
trait BufReader[t] {
type Aef: Eff
pub def readln(rd: t): Result[IoError, String] \ BufReader.Aef[t]
}
instance BufReader[BufReaderRef] {
type Aef = IO
pub def readln(rd: BufReaderRef): Result[IoError, String] \ IO = match rd {
case BufReaderRef.BufReaderRef(r) => ???
}
}
- [ ] Break into multiple files to follow standard format.
- [ ] Add unit tests
- [ ] Add more instances (e.g. one for List[String] or even f[String])
@mlutze I was thinking Ry could look into this.
I think my description is wrong.
We probably do want a Reader instance for java.io.BufferedReader, but we probably also want our own BufReader struct thingy that wraps anything with a Reader instance.
@magnus-madsen
BufReader struct thingy that wraps anything with a Reader instance
Is it possible to add this constraint to the enum def itself? I wouldn't think so... like
enum BufReaderRef[a](a) with BufReader[a]
or would we just add the constraint to the instance def (and anywhere we return a BufReaderRef in the api):
enum BufReaderRef[a](a)
instance BufReader[BufReaderRef[a]] with BufReader[a]
What's the use case for BufReaderRef? enum BufReaderRef[a](a) just looks a little strange to me on its own 😅. If it's to avoid returning java types directly in our api, could we instead just return a (with BufReader[a])?
- trait instances for
List[String]/f[String]wouldn't be legal right, since they are complex instances?
Is it possible to add this constraint to the enum def itself? I wouldn't think so... like
enum BufReaderRef[a](a) with BufReader[a]
This is not possible.
or would we just add the constraint to the instance def (and anywhere we return a
BufReaderRefin the api):enum BufReaderRef[a](a) instance BufReader[BufReaderRef[a]] with BufReader[a]
BufReader only needs to wrap Reader, not another BufReader
What @mlutze said
But BufReader should probably be a struct with its own array, buffer size, and region. (I think). @mlutze Can chip in here.
Okay I get I think. BufReader is a struct that wraps a Reader + everything it needs to do its own buffering, just the way java BufferedReader is a class that wraps a java Reader and adds buffering.
I worked on this today but got tripped up on a subtle issue:
the java Reader reads into a char[] buffer, whereas our Reader (like the rust reader) reads into an Int8 buffer. So as proposed, our BufReader would have to decode the bytes into chars.
The way java does it:
-
InputStream reads into
byte[]. -
InputStreamReader
- wraps the
InputStreamalong with a charset (new InputStreamReader(inputStream, charset)) - reads into
char[] - implements
Reader
- wraps the
-
BufferedReader wraps the
InputStreamReader, so it can just buffer the chars and read them line at a time.
(Note also that java InputStream is (now) an instance of Flix Reader, but not an instance of java Reader. But new InputStreamReader(inputStream, charset) is an instance of java Reader, cause the types match there.)
We could just give our BufReader a charset but maybe that's not the best separation of concerns. So maybe we should think up a middle layer for doing the decoding. I'll think some more on it tomorrow and see if I can find what rust does. Open to ideas @magnus-madsen @mlutze
Okay I get I think.
BufReaderis a struct that wraps aReader+ everything it needs to do its own buffering, just the way javaBufferedReaderis a class that wraps a javaReaderand adds buffering.
Agreed.
I think where we get confused is the difference between reading bytes vs. chars. Is there a BufferedInputStream in Java or how is this supposed to work?
Yeah there is https://docs.oracle.com/javase/8/docs/api/java/io/BufferedInputStream.html
which does buffering, but reads bytes (not chars). so maybe this is the functionality we want to make instead? and not concern ourselves with char bufs (or strings, or reading lines).
this is also extends InputStream, so it can also implement our Reader trait
Yeah there is https://docs.oracle.com/javase/8/docs/api/java/io/BufferedInputStream.html
which does buffering, but reads bytes (not chars). so maybe this is the functionality we want to make instead? and not concern ourselves with char bufs (or strings, or reading lines).
this is also extends
InputStream, so it can also implement ourReadertrait
I think so. But we will want our own BufReader. The names are indeed confusing.
Yeah so rust has a BufReader which does what I described:
which does buffering, but reads bytes (not chars)
and also implements the Rust Reader.
So our BufReader should do the same I think... and then we can later build something that handles decoding into chars. That component could be built on top of a (byte) Reader - which could be a buffered implementation. Then we would be decoding after pulling data out of the buffer, which differs from the Java, which decodes before putting into the buffer. I think buffering raw bytes would make the buffer more reusable
Sounds good. And then we need to figure out the naming too.
and then we can later build something that handles decoding into chars
So i've been working on building this on top of the linked PR, and have discovered some interesting things that are making me re-think this design choice to buffer bytes vs buffer chars (in other words, decode before or after buffering).
Basically,
- Rust strings (from what i'm gathering) have a utf8 representation at the byte level, effectively allowing
[u8]to be unchecked-casted into a string (if utf8 is assumed). - This naturally leads to a rust design (which I have implemented in flix in the linked PR) where bytes are buffered, then copied into a string in an unsafe block on the fly during the
readLine()call. (https://doc.rust-lang.org/std/io/struct.BufReader.html) - Java strings are not like this, so they must be decoded into
Chars using Charsetdecode(). This function takes aByteBufferinput and gives aCharBufferoutput (both objects, not primitives). - Since these buffers are both objects, it would generate a lot of garbage to create them on the fly (sized dynamically) in the
readLine()operation. It is more efficient to reuse a single buffer of fixed sized for decoding, which naturally leads to the design in java which is to buffer the chars.readLine()just scans through the char buffer looking for a\nchar.
Assuming we want to use Charset.decode() for the flix implementation, we probably want to shift towards buffering chars instead of bytes.
The java way of doing this is new BufferedReader(new InputStreamReader(inputStream, Charset.UTF8)), where
-
inputStreamallows reading ofbytesand is provided on the socket -
InputStreamReader wraps an
InputStream(byte reader) along with a charset to implement a (char)Reader. -
BufferedReader adds the readLine() functionality to a (char)
Reader, by buffering the chars read from the underlying reader.
So we could:
- Rework my PR to support buffering of chars instead of bytes.
BufReaderwould probably need aCharsetmember. - Do something like this, but package up a
new BufferedReader(...)in the Socket tuple (along with aLineReadertypeclass). This probably seems easiest now that I think about it? - Keep the byte buffering, and get creative on an efficient way to decode bytes on the fly when a new line is requested. This will likely involve unchecked casts, and extra care would have to be taken to consider multi-byte chars. @CadeMichael has some ideas.
It sounds like we need to discuss this on the whiteboard. Maybe we can find time tomorrow. CC @mlutze and @JonathanStarup
Could you outline (not implement) a design that "fits best" with Java?
I had a brief talk with Ry earlier and one thing that might be overkill but seems like it could be useful would be a semi backtracking effect for "peeking" the next byte much like a lexer parsing tokens.
I believe mutli bit UTF characters prefix the first byte with how many trailing bytes compose the character. We could parse the buffer stream and peek the next byte until we get a valid character and stop when we hit an endline character, or return an error if the bytes don't correspond to valid characters.
The main issue Ry brought up with the charset library is it takes a buffer -> charbuffer. Meaning that it doesn't mutate the underlying buffer. So if you read a line from the buffer it wouldn't strip that line from the buffer, the buffer would contain the same bytes as it did before the readline operation.
Could you outline (not implement) a design that "fits best" with Java?
Option 1: Rely on java types under the hood:
trait LineReader[t] with CharReader[t] {
type Aef: Eff
pub def readLine(r: t): Result[IoError, String] \ r + LineReader.Aef[t]
}
instance LineReader[java.io.BufferedReader] {
// delegate to BufferedReader::readLine
}
- enum TcpSocket(Socket, InputStream, OutputStream)
+ enum TcpSocket(Socket, InputStream, BufferedReader, OutputStream)
instance LineReader[TcpSocket] {
// delegate to `LineReader[BufferedReader]`
}
// Socket construction (in `TcpAccept`)
- TcpSocket.TcpSocket(socket, inputStream, outputStream)
+ let bufferedReader = new BufferedReader(new InputStreamReader(inputStream, Charset.UTF8)) // java types
+ TcpSocket.TcpSocket(socket, inputStream, bufferedReader, outputStream)
Option 2: Our own char buffering (change BufReader to buffer chars)
// currently `Reader`
trait ByteReader[t] {
type Aef: Eff
pub def read(b: Array[Int8, r], r: t): Result[IoError, Int32] \ r + ByteReader.Aef[t]
}
trait CharReader[t] {
type Aef: Eff
pub def read(b: Array[Char, r], r: t): Result[IoError, Int32] \ r + CharReader.Aef[t]
}
// or maybe one `Reader` with an associated type?
trait LineReader[t] with CharReader[t] {
type Aef: Eff
pub def readLine(r: t): Result[IoError, String] \ r + LineReader.Aef[t]
}
pub struct BufReader[t: Type, r: Region] {
buffer: Array[Char, r], // instead of `Array[Int8, r]`
...
reader: t // now a CharReader instead of (byte) Reader
}
instance CharReader[BufReader[t, r]] with CharReader[t] {
read: same as written currently
}
instance LineReader[BufReader[t, r]] {
readLine: scans through the char buffer looking for \n
}
struct DecodingCharReader[t] {
byteReader: t,
charset: Charset
}
instance CharReader[DecodingCharReader[t]] with ByteReader[t] {
// read() combines `byteReader.read()` and `charset.decode()`
}
- enum TcpSocket(Socket, InputStream, OutputStream)
+ enum TcpSocket(Socket, InputStream, t, OutputStream)
instance LineReader[TcpSocket[t]] with LineReader[t] {
// readLine delegates to `t`
}
// Socket construction (in `TcpAccept`)
- TcpSocket.TcpSocket(socket, inputStream, outputStream)
+ let bufReader = BufReader.fromCharReader(DecodingCharReader.fromByteReader(inputStream, utf8));
+ TcpSocket.TcpSocket(socket, inputStream, bufReader, outputStream)
This is the partial implementation that I had already started writing, using the existing BufReader as is (built on bytes).
If we continue building on top of BufReader (buffering bytes), then a sketch of the implementation is as follows (I was already working on this)
trait BufReader[t] with Reader[t] {
pub def readUntil(p: Int8 -> Bool, bufReader: BufReader[t, r]): Result[IoError, Vector[Int8]] \ r + BufReader.Aef[t]
}
mod BufReader {
pub def readLine(charset: Charset, bufReader: BufReader[t, r]): Result[IoError, String] \ r + BufReadable.Aef[t] + IO with BufReadable[t] =
region rc {
let builder = StringBuilder.empty(rc);
let newLine = unchecked_cast('\n' as Int8);
let carriageReturn = unchecked_cast('\r' as Int8);
forM(
bytes <- BufReadable.readUntil(b -> (b == newLine) or (b == carriageReturn), bufReader)
) yield {
let line = charset.decode(ByteBuffer.wrap(bytes)).array();
foreach(char <- line) {
StringBuilder.append(char, builder)
};
StringBuilder.toString(builder)
}
}
}
The problems with this approach, which triggered me to dive deeper into rust vs java, are:
- We create garbage via
ByteBuffer.wrapevery line, vs every filling of the buffer. - We still have these unchecked_cast to detect newline bytes. Maybe not such a big deal, but didn't seem ideal either.
But maybe the garbage from 1 is tolerable, in which case this is a non-issue? I don't have a good feel yet for what kind of performance concerns we're aiming for here. If this is too expensive, then it makes more sense to use one of the previous proposals where we decode into the buffer (new garbage for every buffer filling, not every newline).
In Option 1:
+ TcpSocket.TcpSocket(socket, inputStream, bufferedReader, outputStream)
A stream can be both bytes or strings (e.g. images or text). If we have a bufferedReader here are we not assuming only text?
Is it correct to say that there are two dimensions: Reading bytes and reading chars? Along another axis, reading buffered or unbuffered? Perhaps with the exception that when reading buffered chars we care about reading whole lines?
In Option 1:
+ TcpSocket.TcpSocket(socket, inputStream, bufferedReader, outputStream)A stream can be both bytes or strings (e.g. images or text). If we have a bufferedReader here are we not assuming only text?
Good point. I guess with this approach, if we're reading non-text we still implement Read via inputStream to read bytes. We also LineReader, but it wouldn't make sense for the caller to call readLine. So maybe it doesn't make sense to make TcpSocket implement LineReader directly, for that reason
Is it correct to say that there are two dimensions: Reading bytes and reading chars? Along another axis, reading buffered or unbuffered?
Mostly yes, and i think it would be good to keep these concerns separate. Though as you point out the lines get blurred slightly:
Perhaps with the exception that when reading buffered chars we care about reading whole lines?
Any sort of readWhile/readUntil (I said readUntil in my proposal but maybe readWhile is more standard) logic has to be built on top of buffering. This includes readLine(), which is equivalent to readWhile(c -> c == '\n'). This is because we have to read some data into memory, iterate through the data until the predicate fails, and then the remaining suffix has to go somewhere, so we need a buffer to put it in.
So maybe it's a bit more accurate to say that if we care about reading whole lines (in some generic sense), we need buffering.
We could also consider making Read and BufRead polymorphic, like:
trait Read[t] {
type Elem
type Aef: Eff
pub def read(buffer: Array[Read.Elem[t], r], reader: t)...
}
trait BufRead[t] with Read[t] {
...
pub def readWhile(...
}
This would allow us to choose whether to buffer chars or bytes without changing buffering logic
Any sort of
readWhile/readUntil(I saidreadUntilin my proposal but maybereadWhileis more standard) logic has to be built on top of buffering. This includesreadLine(), which is equivalent toreadWhile(c -> c == '\n'). This is because we have toreadsome data into memory, iterate through the data until the predicate fails, and then the remaining suffix has to go somewhere, so we need a buffer to put it in.
one thing to consider is the platform dependent newline character, but it should be really simple to account for.
The associated type to Read sounds pretty nice. What would it look like in practice? How much needs to be duplicated?
The associated type to Read sounds pretty nice. What would it look like in practice? How much needs to be duplicated?
I'm working on a prototype to show what it looks like. I don't think anything needs to be duplicated, just updating the existing Reader to have an associated type, and adding an associated type of Int8 to all existing instances. Then more instances with type Char can be added also. The BufReader which is under review now would be changed to be type agnostic, but it shouldn't change any of the logic or tests.
But with that said, I've made all requested changes to the BufReader. I think best would be to merge as-is, close this ticket, and then iterate. I can make some new issues to implement the new design that i'm working on/prototyping.
I can make some new issues to implement the new design that i'm working on/prototyping.
https://github.com/flix/flix/issues/10964
let me know when you merge once there is a Working implementation I can finish Process and underlying changes to the readline implementation shouldn't break it.