ocaml Immutable arrays

This PR proposes to add immutable arrays to OCaml. This is a feature originally by @antalsz and has been in use for a bit more than six months at Jane Street, in their fork of the compiler with extensions (below dubbed flambda-backend). Jane Street has asked for help from Tarides to port the feature onto OCaml 5.x.

Motivation

Immutable arrays represent an improvement over regular arrays in several situations:

Benefits on trunk

If a particular use-case does not require mutation — instead needing only random-access, packed memory — then an immutable array conveys this fact more clearly than the use of an ordinary array. The distinction improves safety, like the distinction between string and bytes.
An immutable array improves reasoning properties: we know a function cannot change its contents. This fact can also be used by verification tools.
Immutable arrays can safely be covariant, unlike mutable ones.

As an example use case, @dbuenzli’s uucp and uunf libraries store static Unicode data in some big arrays (among other data structures), themselves comprising sub-arrays. It has proved difficult in the past to compile this efficiently, and some array initialisation remains at runtime, since the optimizer is unable to verify that all the data is constant. Using immutable arrays, the optimizer can be extended to allocate it all statically in a read-only data section of the executable.

In addition, the presence of a standard module:

publicises the possibility of using immutable arrays, encouraging more people to consider using them to write safe interfaces;
avoids having several incompatible implementations of the same feature;
leaves open the possibility to implement more optimisations dedicated to immutable arrays in the future.

Benefits on flambda-backend

In addition to the benefits above, immutable arrays work better with locality modes in use at Jane Street: mutable data structures cannot be safely allocated on the stack as it would allow local variables to escape, but immutable arrays and records can.

Design

The Lambda intermediate language already has immutable arrays; therefore, there is little to add. A mutability flag is added to the typed AST node Texp_array, and a new predefined type 'a iarray is added.

~~Parser support is added for the form [: e1; e2; …; en :]~~ EDIT: Syntactic support has been removed for now. One can use type-directed disambiguation in array literals and patterns:

let a : int iarray = [| 1;2;3;4;5 |]

let () =
  let open Iarray in
  print_int a.:(2)   (* prints 3 *)

let _ =
  match a with
  | [| |]           -> "empty"
  | [| 1;2;3;4;5 |] -> "1--5"
  | _               -> "who knows?"

A new module Iarray — and the corresponding IarrayLabels — is added to the Stdlib, containing all the functions of Array, excluding the mutating functions. The memory representation of immutable arrays is the same as that of mutable arrays, so Iarray reuses the runtime primitives for regular arrays.

Documentation is yet to be written: we thought it made more sense to first see what are the opinions about the design.

Apr 12 '24 16:04 OlivierNicole

Documentation is yet to be written: we thought it made more sense to first see what are the opinions about the design.

My 2c after a quick look are below. Re-reading it I realize it sounds rather negative, but it is not my intention, please bear that in mind :)

The PR seems way too invasive for such a fringe freature. For example, there is a new syntax for immutable arrays, which we all know is a "big change" (eg we have no such syntax for floatarray). Also, we already have array and floatarray and the fact that they share most operations does not scale very well. Are we going to add a third "array" variant? Shouldn't we add immutable float arrays as well? How is that going to scale?

Also, why not do

module Iarray : sig
  (* signature without "set" functions *)
end = Array

and call it a day? I fail to see why this needs special support in the typechecker, pattern-matching engine, middle-end, compiler primitives, etc. Sure, once in a while having immutable arrays would be nice, but is it that common enough that it justifies adding this much code to the compiler?

publicises the possibility of using immutable arrays, encouraging more people to consider using them to write safe interfaces;

I think very few libraries would actually need immutable arrays as part of their public API. Do you have concrete examples?

avoids having several incompatible implementations of the same feature;

This is only a problem if libraries start using this data structure in their public APIs. Do you know of any doing so?

leaves open the possibility to implement more optimisations dedicated to immutable arrays in the future.

I don't think this alone would be reason enough to merge this PR.

Just my view at the moment, would love to hear other opinions! Cheers.

Apr 12 '24 16:04 nojb

To me, there are two separable concerns here, which we may want to debate separately.

Do we want immutable arrays as a first-class concept in the compiler? This means that the middle ends are aware of the immutability, and we expose primitives for conversion between mutable arrays and immutable ones. Furthermore, it means that we can say that iarray is covariant. (This is a big reason we can't just slap a more restrictive signature on top of Array.) To me, the argument here is not so hard: immutable arrays have some advantages of lists (immutability and covariance) and some advantages of arrays (tighter memory footprint and constant-time random access). I find achieving this combination to be somewhat convincing.
Do we want first-class syntax for immutable arrays? This is harder, because it's permanently(*) consuming a finite resource (the number of one-character widgets we can put inside of brackets for different ordered collections). We explored the possibility of not doing this within Jane Street, and the pushback was around the usefulness of pattern-matching to extract array contents, something impossible to do without dedicated syntax.

(*) In anticipation of the debate on this PR (I'm working with @OlivierNicole on the Jane Street side), I've been worried about this consumption of syntax. But in the end, I think it's OK. If we want to revisit this decision later, it seems easy enough to add the ability to bind different array-like syntaxes to different implementations. For example, we could have a module [: .. :] = struct (* the definitions needed to make and match against arrays *) end; when that module is in scope, the use of the syntax [: 1; 2; 3 :] uses the definitions in that module. This is very much like the existing ability to define custom array-access operators. I am not proposing such an addition now. Instead, I'm using this idea as a potential way, in the future, to claw back a decision to consume a little bit of syntax, if we so choose.

I do think the comments such as those from @nojb are very helpful here. We like this design within Jane Street, where I can find 158 jbuild files (our equivalent of dune files) that indicate a dependency on our iarray library, which is how we have exposed the functions in the Iarray stdlib module. The reason we've chosen to try to upstream this feature is that we'd like to start using it in our open-source library APIs. And this is all within the last 8 months or so. But the design constraints and desires of the open-source OCaml community are different than those within Jane Street, and so we welcome the debate about what the best overall design is.

Apr 12 '24 17:04 goldfirere

The PR seems way too invasive for such a fringe freature. For example, there is a new syntax for immutable arrays, which we all know is a "big change" (eg we have no such syntax for floatarray).

I understand the concern for unwarranted big changes. Just to give some sense of the contents of the patch here, the few hundreds of lines of the change (I’m not counting bootstrap commits and added tests) consist of the new Iarray module—essentially copied from Array—, syntax-related boilerplate changes, and straightforward modifications to typechecking and Lambda generation to account for the notion of immutable arrays. In addition, it seems to me that the changes would incur close to no maintenance burden.

Also, we already have array and floatarray and the fact that they share most operations does not scale very well. Are we going to add a third "array" variant? Shouldn't we add immutable float arrays as well? How is that going to scale?

Given that arrays and immutable arrays have the same runtime representation, support for immutable arrays does not incur any new runtime primitives (except for unsafe conversion between the two). It’s true that immutable arrays have their own type, which is part of the feature, and thus their have their own module. But unlike float arrays, immutable array support happens at the type level.

Shouldn't we add immutable float arrays as well?

It could be argued that a new module Float.Iarray should match Float.Array (it is not the case in this PR). If it is deemed useful enough to add, it would result in writing some additional OCaml functions, essentially copied from Float.Array, using the same primitives as Float.Array.

Apr 15 '24 08:04 OlivierNicole

For what it's worth, we've been tempted to write floatiarray, but just haven't gotten around to it. I agree that it make sense here.

Apr 15 '24 19:04 goldfirere

It's a small thing, but [::] is potentially confusing. It looks like a two pieces of list syntax, one inside the other. If gramatically possible, could it be changed to the two-token sequence [: :]?

Apr 15 '24 23:04 johnwhitington

In fact, that’s what it is, it just happens to be valid to leave no space between the two tokens. I agree that it is less confusing with a space.

Apr 16 '24 09:04 OlivierNicole

I believe that offering immutable arrays would be a useful feature. Documenting that an array is immutable helps clarify the code (and can help the compiler optimize it).

In fact, in Osiris, our ongoing program verification project for OCaml, we do intend to (eventually) distinguish immutable and mutable arrays. If this distinction existed in OCaml already, then we would not need to create it at the level of Osiris.

Regarding immutable array literals, I am less interested. Array literals are useful only when the length and content of the array are statically known, which in my experience is fairly rare, but perhaps this depends on the application area. It seems more common to me to 1- allocate a mutable array, 2- initialize it, 3- freeze it.

Apr 20 '24 17:04 fpottier

It is true that the syntax is technically not necessary to create immutable arrays, given our unsafe primitive %array_to_iarray. I see three drawbacks to not having a syntax:

Without a syntax for patterns, it becomes impossible to pattern-match on immutable arrays.
Without a syntax for immutable array literals, optimizing immutable arrays allocation (e.g. making it static) requires some ad hoc rules in the compiler to recognize the “initialise a regular array and then freeze it” pattern.
The unsafe primitive %array_to_iarray is currently not exposed as it allows to circumvent the typechecker. Doing away with the syntax would require to expose it, which, as with all unsafe primitives, opens the risk of introducing bugs with it.

Apr 25 '24 14:04 OlivierNicole

The unsafe primitive %array_to_iarray is currently not exposed

I imagine that this primitive operation will have to be exposed, in the same way that Bytes.unsafe_to_string is currently exposed. Otherwise, the construction of an immutable array would be quite inefficient; one would have to allocate a mutable array, initialize it, then copy it into a fresh immutable array.

Apr 25 '24 14:04 fpottier

To clarify, what I meant was that with immutable array literals and functions like Iarray.init, immutable arrays can be created without ever needing an unsafe conversion primitive, which is a benefit from the point of view of safety (from my point of view, the less one needs to use primitives that can crash the program, the better).

May 02 '24 08:05 OlivierNicole

What are use-cases of immutable arrays in practice, can we discuss some examples? @OlivierNicole mentioned that this could be useful for constant tables, as used in Unicode libraries. What are other applications that would be enticing? (@fpottier, why are you interested in adding specific support for immutable arrays in Osiris?)

You mention exposing immutable arrays in APIs to write safe interfaces. I suppose that the ability to do this depends quite a bit on the conversion mechanisms that exist between mutable and immutable arrays. Can you discuss examples of using immutable arrays to write safe interfaces, and detail the conversion mechanisms?

Two counterpoints could be:

Regarding the analogy with string/bytes: string is a datatype that many application domains are happy to use in an immutable way -- because they are willing to copy a lot of data around. Are arrays really similar?
Regarding safe interfaces: one scenario I can easily picture is that I own a mutable array, I want to pass it to some auxiliary function, but I don't expect that function to mutate the array itself. I would be happy to have the guarantee that this function treats the array as immutable. But this requires being able to view a mutable array as an immutable array, which corresponds to a type of "arrays that I am not allowed to mutate (but someone else may)", rather that a type of "arrays that no one will ever change", and those are two different notions. Which one are you proposing? (Or maybe the distinction does not occur in the JS setting due to the judicious use of local modalities to blur the distinction?)

May 13 '24 12:05 gasche

One example of immutable array would be the arrays that are used in the internal representation of Coq/Rocq terms (for reasons of memory compactness, I believe?): https://github.com/coq/coq/blob/d7d392191a367839b0f5a7772e48b8f24e9f1b3e/kernel/constr.ml#L105

May 13 '24 12:05 gasche

In graphical programming it can be more natural to have arrays of points rather than lists of points (a lot of algorithms fiddle with indices, you want to quickly find out how much of them you have etc.). But then when you pass things around you don't want people to fiddle with your arrays. If arrays were immutable I would perhaps expose the lists in this interface as arrays (and the underlying rings's point lists aswell).

But then arrays are much less flexible to construct, split and merge than lists. I would likely prefer an all round efficient persistent polymorphic vector on the line of RRB vectors in the stdlib (it could be based on immutable arrays, but then it could also be based on mutable arrays)

May 13 '24 13:05 dbuenzli

What are use-cases of immutable arrays in practice, can we discuss some examples? @OlivierNicole mentioned that this could be useful for constant tables, as used in Unicode libraries. What are other applications that would be enticing? (@fpottier, why are you interested in adding specific support for immutable arrays in Osiris?)

Immutable arrays are just one particular implementation of the abstract concept of "immutable sequence", which arguably is fairly useful and common. The upside of immutable arrays is that they are compact and support efficient random access; their downside is that concatenation and extraction of a subsequence are expensive (in terms of time and space). In some applications, these downsides are tolerable. For instance, Menhir uses immutable arrays in many places to represent sequences of symbols.

In Osiris (a program verification system that we are developing), immutable values are much easier to reason about than mutable values. In short, immutable values "are" just values, whereas mutable values are addresses in memory, and one must explicitly use Separation Logic assertions to describe who is allowed to read or write these pieces of memory. Osiris will have specific support for reasoning about "pure" expressions (expressions that do not allocate, read, or write mutable memory). A program that manipulates (primitive) immutable arrays can be considered pure, whereas a program that manipulates mutable arrays cannot.

May 13 '24 14:05 fpottier

I would likely prefer an all round efficient persistent polymorphic vector on the line of RRB vectors in the stdlib

The library Sek, developed by Arthur Charguéraud (@charguer) and myself, is intended to offer an efficient general-purpose implementation of both persistent and ephemeral sequences (that is, both immutable and mutable sequences). That said, we have received extremely little feedback about it, so I do not know whether it reaches its intended target. Also, it is probably too complex to be comfortably integrated in the standard library.

May 13 '24 14:05 fpottier

I could see how to use dynarrays as buffers to programmatically accumulate elements, and then "freeze" them as an immutable array. This might be a reasonable approach for @dbuenzli's GG example where the sequence data structure is used to represent polygons as sequences of rings. (And rings as sequences of points, etc.)

May 13 '24 14:05 gasche

Also, it is probably too complex to be comfortably integrated in the standard library.

Indeed I remember looking at it and finding it too complicated. I would rather like something along the lines of pvec (repo).

May 13 '24 14:05 dbuenzli

What are use-cases of immutable arrays in practice, can we discuss some examples? @OlivierNicole mentioned that this could be useful for constant tables, as used in Unicode libraries. What are other applications that would be enticing? (@fpottier, why are you interested in adding specific support for immutable arrays in Osiris?)

There are a few places in Flambda2 where we use either arrays or lists, but where immutable arrays would have been the best match. I think there are other places in the compiler where immutable arrays would make sense (a good proportion of the lists in the various compiler IRs would work as well as immutable arrays, and for some of them I expect it would be a better fit). It's possible, for instance, that using immutable arrays instead of lists in the typedtree would reduce a bit the size of the cmi files, leading to slightly shorter compilation times for very big applications (when demarshalling cmi files takes a noticeable amount of time).

May 13 '24 15:05 lthls

(Thinking out loud: changing the .cmi format (to use arrays) is fairly painful due to bootstrappin, so I wouldn't do this for an experiment, but it might be possible to just traverse the type to count the total size of lists of interest, and deduce potetial space savings from that.)

May 13 '24 15:05 gasche

Jumping into the discussion with a few comments.

The notion of immutable arrays is useful as a typing information. Sometimes, you'd like the array to be unmodified by a function, but subsequently recover the ability to modify it. For that, you need more than types, something like separation logic permissions. There are also situations where you'd like to separate the allocation from the initialization. Maybe even you'd like to initialize the cells of the array one-by-one, e.g., for filling a stack/vector. Thus, introducing an ML type for immutable arrays will be helpful in some case, but introduce yet more frustration in many other cases.
The notion of immutable arrays is useful to enable compiler optimizations. In particular, as already mentioned by others in the discussion, you can turn dynamic heap allocations into static allocations or into stack allocations. I speculate that such optimizations would apply almost exclusively to the arrays that can be assigned the ML type for immutable arrays, and would not apply to more complex scenarios with read-only arrays.
The implementation choice for immutable arrays is critical. Of course, you do need as core building block an implementation of an array as a single block of contiguous cells. But then, using such basic blocks, you can build various tree-shaped data structures that have the same interface as immutable arrays, but better complexity for certain operations. There is a whole range of structures with different trade-offs, depending on the arity of your tree. You can have an array of sqrt(n) arrays of size sqrt(n). Or if you fix the branching factor, e.g. 64, then you can have a complete tree with fairly small depth. If you organized your tree properly, you can even keep constant time access to the edges of the sequences, as done in Sek. Or, again as done in Sek, you can have the internal representation vary depending on the length the sequence. So, please keep in mind that you'd like to make a family of implementations available. I would advise against binding a new syntax to one specific implementation.
Regarding syntax, I would also suggest avoiding introducing new constructors in the parse tree. Afaik, doing so breaks all ppx modules. Here is a suggestion, which I believe would work, without being too ugly.

type 'a iarray = IArray of 'a array

val array_to_iarray : 'a array -> 'a iarray

let my_iarray = IArray [| 1; 2 |]

let _ = 
   match my_iarray with
   | IArray [| a; b |] -> a

First, you would extend the typechecker to check that IArrayBlock is only applied to array literals. Second, you extend the compiler to eliminate all reference to the constructor IArrayBlock and instead refer to your custom representation. I speculate that this approach could be later generalized to support not just one but several possible immutable array representations, using not just one constructor IArray but a family of such constructors.

May 14 '24 08:05 charguer

Another remark: Batteries has long exposed a design of "arrays with read/write capabilities" as the submodule BatArray.Cap (documentation). (As far as I know, there are few users of this feature; a quick search suggests one, two.) This design is polymorphic on the mutability, the type is ('a, 'mut) t. In the case of Batteries, 'mut is a polymorphic variant [< `Read | Write ], but this could also be done with GADTs today.

Having the mutability information as a parameter is strictly more expressive than the current proposal, in particular it makes it possible to express that a function will not mutate an argument that may otherwise be mutable, by having the type f : [< `Read | `Write ] t -> ... -- exposing that any subset of capabilities is accepted. This is different from requiring f : [ `Read ] t -> ..., which enforces that the input argument is read-only for everyone. (Constant literal arrays in read-only memory are still correct at this restricted type [ `Read ] t.)

May 14 '24 08:05 gasche

Having the mutability information as a parameter is strictly more expressive than the current proposal

In my experience these kind of phantom types are also strictly more annoying to work with in practice.

May 14 '24 08:05 dbuenzli

Having the mutability information as a parameter is strictly more expressive than the current proposal

In my experience these kind of phantom types are also strictly more annoying to work with in practice.

And yet, if you're working with mutable data structures, this kind of thing is very useful in giving you guarantees that approach those of immutable data structures. This is similar to C++'s const mechanism and I believe Rust's concept of mutability is built entirely on this concept. Without using phantom types for this purpose, OCaml's support for mutable data structures is quite inferior to C++'s.

May 14 '24 08:05 bluddy

And yet, if you're working with mutable data structures, this kind of thing is very useful in giving you guarantees that approach those of immutable data structures.

Not really. Personally I'm more interested in knowing that an array is immutable than if I'm allowed to mutate it. These are different things.

May 14 '24 08:05 dbuenzli

The notion of immutable arrays is useful to enable compiler optimizations

Indeed, and in particular, CSE: if the array a is immutable, then multiple occurrences of a.(i) can be shared, even if there are function calls in between.

May 14 '24 08:05 fpottier

The implementation choice for immutable arrays is critical. Of course, you do need as core building block an implementation of an array as a single block of contiguous cells.

Regarding syntax, I would also suggest avoiding introducing new constructors in the parse tree.

I believe that we should separate the discussion of primitive immutable arrays, implemented as a contiguous block of memory, and the discussion of various kinds of "immutable sequence" data structures, which can be packaged as separate libraries.

That said, regarding syntax, you are right: we should strive to avoid the proliferation of array access syntaxes, and we should allow libraries to define their own array access syntaxes, if possible.

May 14 '24 09:05 fpottier

Regarding the distinction between "truly immutable arrays" (immutable for everyone) and "temporarily immutable arrays" (immutable for me, but possibly mutable for someone else; or immutable now, but possibly mutable later), I believe it would be a good thing to discuss just the simpler proposal, namely having a primitive type of truly immutable arrays.

In my opinion, the example of string versus bytes shows that this simple approach can work quite well in practice. Is there any evidence that users have complained about the strict separation between string and bytes?

May 14 '24 09:05 fpottier

I'm making the distinction because I can easily think of use-cases for "temporarily immutable arrays", but I have a harder time thinking of use-cases for "truly immutable arrays". A few have been mentioned here.

(I forgot to make a humorous point. @fpottier mentions that they are considering immutable arrays in Osiris because they are much easier to specify and use in proofs. This reminds me of the joke about searching for keys under the street light.)

May 14 '24 11:05 gasche

This reminds me of the joke about searching for keys under the street light.

Point taken, but there is something to be said in favor of searching under the street light...

May 14 '24 11:05 fpottier

but I have a harder time thinking of use-cases for "truly immutable arrays".

Just think about your uses cases for lists :-)

Is there any evidence that users have complained about the strict separation between string and bytes?

One of the problem in the current state of affairs is that there is no good way to turn a bytes into a string without incurring a copy or using unsafe functions.

Of course an abstraction could be provided for that it just doesn't exist for now[^1] and I'm slightly uncomfortable each time I allocate a bytes value, modify it and then return it as a string using Bytes.unsafe_to_string – even though the safety is easy to assert that's not what OCaml was supposed to be to me.

Other than that mismatches between what you have and what an API asks for can be annoying. Something which R/W modalities do solve, but phantom types tend not to be very usable (unwieldy error message, modality propagation in your definitions, less trodden path of variant subtyping syntax), I'm wondering if that's something that wouldn't be better solved by something built-in the language rather than encoded. Note however that this particular problem is not limited to bytes and string, bigarrays of bytes also comme into play. So many byte sequences to choose from.

[^1]: And I'm not sure it's worth adding one. Perhaps Buffer could be cajoled into that, if we stop presenting it as an append-only structure and provide a few guarantees. Same for iarrays, perhaps Dynarray could provide enough guarantees, to offer this (make n has a backing array of n, if I you don't exceed n then Dynarray.to_iarray gives me a zero-copy iarray, and resets its backing array to an empty array).

May 14 '24 12:05 dbuenzli