emily
emily copied to clipboard
[FEATURE] File I/O
File IO is obviously really important.
I'm interested in working on some project, and it seems like doing it the "wrong" way but getting something working and replaceable is doable. But I want to open up a little discussion on what the "right" way to handle file IO would involve, the endgame so that users would be able to implement the same types of things in library code.
I have been thinking about this. Here is a bag of thoughts I have. I don't think all of these ideas need to be in version 0.0, but we need to pick an approach that leaves room for all these things.
Three points to start that I think constrain the design:
One, Emily should be a language which understands modern notions of what file access is. That is: It's not guaranteed the file system will exist. (You could be in Javascript in a browser.) It's not guaranteed you have access to the whole filesystem. (You could be on iPhone, or subject to some kinds of access restrictions the user chose.). You could have access only through some limited get-a-handle mechanism. (You could be sandboxed on OS X, or in a browser with the js file extensions.) Ideally, the approach should eventually accomodate all these situations-- if the way to get a file is a global fopen() that takes a path string, we did something wrong.
Two, I haven't adequately documented this yet. But at the moment, Emily has a setup where all interactions with "magic" things— things only the VM can do— live in an object named "Internal". In the current VM you can see this in internalPackage.ml. The objects in the stdlib are implemented by using the members of "internal". The point is to require the vm (the implementation) to contain as little complexity as possible. Only the implementation details that have to live in the VM get put in the VM.
Three: Strings are unicode. Period.
With these things in mind, here's what I think an implementation eventually needs, listed in order from most feasible to least feasible. An initial implementation, it would make sense for it to cover just the minimum set of things here for usability.
- At the user level, there should be a kind of object which can be queried to get a file object given a path/key. The initially available object should just query the global filesystem. (Eventually more objects should exist which query, say, a documents folder or a files-bundled-with-current-application folder. Windows and Mac OS currently have API for these.)
- At the user level, there should be some kind of a stream object (file object) that you can push content (eventually this could mean either a stream of unicode strings or a stream of bytes). You should be able to get these for a path out of the filesystem object. You should be able to create one of these which is a reader for a string, or one of these who after writing in content you extract a string. There should be builtin stream objects for STDOUT, STDERR and STDIN (although I've no idea what they do in javascript world ?!).
- There should be some way to create convenience functions similar to the builtin "print, println, printsp" for any stream object.
- These objects should be based on some really minimal interface on the "internal" object. Maybe there's something like an fopen that returns a magic cookie that can be used to implement the other functions (an ocaml function or a new type of Value, maybe). Although I think we should try to put some effort into forward compatibility in the stdlib, I do not think we should need to think about forward compatibility in the internal object (that is, if the interface on "internal" changes totally between 0.3 and 0.4, I won't feel bad at all, because in theory nobody uses those functions except the stdlib).
- Long term goals: When extracting data from a file reader: Several languages have super awkward things like "scanf" or the cin >> object that allow you to get "typed" data out of a string or file. I would like to have something like this that rests on top of the minimal stream object and is actually more than anything like a lexer, say like sedlex in the ocaml implementation, where you give it "next thing" patterns and it matches them. I do intend to start implementing language features inside Emily itself so we're going to need good support for lexing/parsing kinda stuff at a pretty low level! :)
Does this all make sense?
Stray implementation thoughts: * I refer above to "strings" and "bytes". There is no notion of a bytestring object currently. I'm only just now working on the string object. Probably all IO being strings to start would be acceptable. * It might be a good idea for me to add an enum-type object if we're going to meaningfully communicate something like an fopen function. * I have no idea how errors should be reported at the current time :(
Re: bytestrings, that makes me wonder about how to handle encoding/decoding from/to Unicode strings. Making it explicit, e.g. (file.read 12).decode "utf-8"
or "naïveté".encode 'utf-8'
might be for the better. Implicit decoding and encoding, like Python 3 does for parts of file I/O, would probably lead to pain. Assuming a specific encoding is particularly bad. Force people to specify.
Now, of course, you don't want people to rely on ASCII bytestrings. So don't include any text manipulation functions for bytestrings. ;)
The current wrong way I'm doing it is implementing an internal basic file interface, which basically consists of:
fopen fname mode : string -> string -> float (* returns a file handle as a float *)
fseek fd location : float -> float -> ()
fread fd count : float -> float -> string (* returns 'count' bytes of the string *)
fwrite fd data : float -> string -> () (* writes the string into the file *)
We don't need any new value types because the file handles are integers, and we can store that integer in a 64 bit float with no loss of precision. This would all work in conjunction with an interface written in emily code, which I think can be done in terms of only these four functions. The emily object could use these internal functions differently if you're in binary mode vs being in text mode. (And we do want a binary mode and not just a unicode mode so we can parse files and images and things like that.) The actual open
function in emily shouldn't return a float, it should return an object with these methods.
Some of the behaviors can be more complicated, for instance, in text mode the emily side could do checking on the string returned from fread
and keep performing incremental reads to read a number of characters instead of a number of bytes. If you read n
bytes you are guaranteed to get no more than n
characters, if you get fewer you can just read more bytes in until you have n
characters.
We don't need any new value types because the file handles are integers, and we can store that integer in a 64 bit float with no loss of precision.
This sounds like a really bad idea. File handles should be opaque.
@porglezomp, that sounds sensible to me— these are all functions in the "internal" object, right?
Also, blocking I/O is not going to work well in the 21st century.
Also the text conversion (unicode interpretation) might need to be before the Emily part— it seems like we'd need some more machinery than we have now (a concept of byte strings, a way to parse unicode) for the unicode conversion to occur on the Emily side.
@whitequark recommended the "Uutf" (convert binary to/from unicode) and "Uucp" (unicode character classes) libraries for handling Unicode, and they look really good— i'm thinking about putting uutf and maybe uucp in 0.3. (Still trying to decide whether to keep using OCaml strings for strings or whether to switch to a codepoint array.) If we had Uutf on the OCaml side, we could transparently convert to unicode strings on file read.
EDIT: Oh, and Sedlex has a lot of what UUTF/UUCP already does in it, but it isn't really documented for external use and I'm still totally unable to make sense of uncommented ocaml code :/
@TazeTSchnitzel The goal right now is to get Emily to the point of at least being a credible toy language— zero file IO is a huge limitation and prevents bootstrap-y things like calling through to a C FFI. But yes, the stream/file objects we eventually present to the user will eventually need nonblocking reading options.
Oh sure, but what you add is hard to later remove.
As far as I am aware, we are right now talking about the implementation of the "internal" object, i.e., the interface between the interpreter and the stdlib. This is not a stable interface and if it is being accessed from outside the stdlib then something is wrong.
Hmm, fair enough.
A bigger issue might be portability. Are file handles always 32-bit? I'm including non-UNIX systems here.
sizeof(HANDLE) == sizeof(void*)
on Windows
Which, in practical terms, is only going to be 4
on ARM tablets these days. Windows is 64-bit.
@porglezomp, here's a question-- where are you even getting an integer filehandle from inside ocaml?
That's the nasty part right now. It's apparently standard to do
let fd_of_int (x: int) : Unix.file_descr = Obj.magic x in
let int_of_fd (x: Unix.file_descr) : int = Obj.magic x in
because the Unix.file_descr
is always going to be internally represented as an integer. (Also note that even though the module is called Unix, it still works on Windows.)
That begs the question of why you need to expose that nastiness to Emily anyway. Can't you hide it behind a method on an object?
I mean, even for the internals, would it not be better to have fopen
return an object with read
, write
and close
methods? That way you don't expose the raw file handle at all.
If you're talking about the nastiness of the conversion, that is all hidden from Emily.
But if you're talking about the nastiness of integer file handles, that's the question. I was planning to have the standard interface hide all that, so you do only get an open that returns an object, but I do think there's merit in eventually exposing the guts to the user as an alternative. If you give someone access to the spookyScaryOpenFunction
then they can write their own file abstractions. If they're harder to use than the normal file methods, most people won't bother to use them anyway.
Well, no, they can't write their own file abstractions. That file handle integer is only as useful as the internal functions exposed by Emily that accept it. Outside of those functions, it has no meaning.
I was just saying that the methods in the object maybe shouldn't map 1:1 with the file operations, because dangerous things, but now I'm thinking you're right.
But this brings me to another point, there aren't people writing big systems with Emily yet, do we want to limit ourselves to making stable interfaces when we're just going towards 0.3 right now? Rust had lots of stdlib functions changing right up to 1.0, we don't have to go that far, but I think that it's easier to develop good libraries if we're not constrained by having our first try be our final try—not yet at least.
Stable interfaces don't really matter within the internal
thing. I just really don't think integer file handles is a good idea, even there. Hide it with a method.
"I mean, even for the internals, would it not be better to have fopen return an object with read, write and close methods? That way you don't expose the raw file handle at all."
@TazeTSchnitzel So… I feel like you're kind of jumping in and picking some very very small details without even understanding the system we're talking about and I feel like it's not helping this discussion? First off, I do not agree that it matters if the thing exposed to the stdlib is "opaque" or "transparent" if "transparent" means only that you can access a meaningless numeric filehandle value. The return value in this context can be a filehandle entity, which incidentally happens to be implemented as an float in this one interpreter. Second off, "An object with read, write and close methods" is not a meaningful thing to say in this context. Objects are implemented in Emily.
As far as the portability thing goes though, yes that is a pretty big problem-- we should not do anything which could in principle break Windows support. :( @porglezomp, since it doesn't appear we have any builtin types which can losslessly represent the integer filehandle, maybe it would be a better idea to add a FileHandle constructor to value in Value.ml? That would avoid having to use Obj.magic D: D: D: and allow us to represent the value in whatever form the file methods actually use it in.
So… I feel like you're kind of jumping in and picking some very very small details without even understanding the system we're talking about and I feel like it's not helping this discussion?
Possibly, I'm sorry if I'm being unhelpful.
First off, I do not agree that it matters if the thing exposed to the stdlib is "opaque" or "transparent" if "transparent" means only that you can access a meaningless numeric filehandle value.
The main issue with that is just portability. I'm suggesting you could avoid doing some weird type-casting thing to transform a file handle into an integer, by simply not exposing a handle at all.
Second off, "An object with read, write and close methods" is not a meaningful thing to say in this context. Objects are implemented in Emily.
Ah, okay. Can an internal function not return something that acts like an object, though? An object is just a function, right? So can you make an internal function return a function that can handle .read
, .write
and .close
? That way, you avoid the handles issue by not having a handle exposed at all.
@mcclure even though it uses 64 bits, you can't create all of those. It appears that Windows is never willing to give you more than 2^24 handles (looking at this), because of how the table registering all the file handles is structured.
I'm not 100% sure on this though, so I could add a new value constructor. My objection here is that it just seems strange to have it as a special type in Emily.
@parglezomp That makes sense, but unless we can literally find a note in the documentation for either ocaml-Unix or Windows that only the bottom 52 bits of a filehandle value are utilized, I'd still consider it unsafe. Without a promise in the documentation Windows would be free to for example start using the top bits abruptly in Windows 11.
I would also like to minimize the number of special Value types required, but also, I would like to avoid using Obj.magic unless really necessary. I've not been writing OCaml that long but my sense was that anything that invokes the Obj module is not a guaranteed memory-safe application… _ We have room for about another 100 base types in Value, so I think stealing a few for operating system abstractions is acceptable (although if the language already had something like an intptr_t, maybe I'd feel differently). I'm probably going to add a base Value type or two for the C FFI when I get there.
I think given the balance of concerns the new value type is the way to go.
If you don't want to go for the object idea, you could do what PHP does here. It has a "resource" type, an opaque value used to safely pass around file handles, pointers and so on.
To explain a bit more: "Resource" is basically completely opaque to PHP code. You can't really see what's inside, nor change it, the only thing you can find out is its number (whether it's the first, second, third etc. resource created). Inside the guts of PHP though, that resource contains some pointer or what have you. So you can pass it to fopen()
and fclose()
and they can work with it.
The nice thing about "resource" is that it can be freely repurposed for anything. I think that'd be useful to Emily, because then you don't need to add a new value type for each kind of handle. Just add a single opaque one.
I don't know how possible that would be, though. :/