CompCert icon indicating copy to clipboard operation
CompCert copied to clipboard

RFC: stable memory block names for globals

Open jeremie-koenig opened this issue 7 years ago • 6 comments

These changes introduce a more structured namespace for memory blocks: a block id can either be a global ident (new), or a dynamically allocated positive (as before). Using this approach, we can ensure that global environments always assign the same block name to a global identifier id, so that for all ge1, ge2 we have:

Genv.find_symbol ge1 id = Some b1 ->
Genv.find_symbol ge2 id = Some b2 ->
b1 = b2.

Rationale

In all work around compositional semantics, we need to ensure that the memory blocks for globals are named consistently across modules. For instance,

  • In CompCertX, we force programs to follow a predetermined structure so that the same identifiers will always be mapped to the same block ids.
  • Compositional CompCert assumes that the global environments of linked modules are consistent, arguing that we can always add external definitions to ensure that's the case.

Our changes should make it possible to eliminate these hacks/assumptions.

Status

We have updated all of CompCert v3.2 to use this new naming scheme.

We plan to use these changes in our own ongoing work on compositional compilation, and we may tweak them further over the coming weeks. However, since these changes have a broader potential applicability and we're hoping for them to be merged into CompCert at some point, we want to collect some feedback as we go forward with that plan.

Notes:

  • The new namespace is specified as an abstract Module Type, making it easy to explore richer namespace structures in the future.
  • In Initializers.v, it is no longer necessary for constval/transl_init to abuse Vptr to store the global identifier of referenced variables, since it is now possible to use a pointer value that makes sense.
  • Since global pointers are preserved by compilation, it is now possible to use val directly in events, as demonstrated by our globmem-eventval branch, which defines eventval := val.
  • With the uniform block naming convention, it should be possible to define a linker for global environments corresponding to the existing linker for programs, and to relate new notions of semantic composition to the already implemented syntactic composition.

Known remaining issue:

  • Our implementation of BMap using EMap may be problematic for the performance of the interpreter. This can be easily fixed with some more work.
  • Since block names for globals are predetermined, it might be advantageous to replace find_symbol with a simpler has_symbol boolean predicate (for example), or remove it entirely.

jeremie-koenig avatar Feb 05 '18 21:02 jeremie-koenig

Thanks for the proposal. I haven't had time yet to study the diff but will do soon. @andrew-appel, @jhjourdan, @robbertkrebbers, any opinions on this proposal?

xavierleroy avatar Feb 10 '18 16:02 xavierleroy

We will discuss this in the VST group and get back to you.

andrew-appel avatar Feb 10 '18 19:02 andrew-appel

Today we (@andrew-appel, @lennartberinger, @scuellar, @jeremie-koenig, @pwilke) discussed this by phone. We agree that the proposal addresses part of the problem, but does not address the problem of consistent (across compilation units) mapping of ASCII strings ("names") to AST.ident. We believe that very possibly this pull request could be a useful part of the solution to that larger problem, but we are still studying and discussing it.

andrew-appel avatar Feb 15 '18 19:02 andrew-appel

One thing @pwilke and I did not think to mention this during the call: in this pull request, the ASCII names <-> AST.ident mapping does in fact show up, because the interpreter needs to be able to print out pointers:

  • In BlockNames.v we use an axiom ident_to_string: ident -> string to implement a Block.to_string : block -> string function;
  • In extraction.v we realize this axiom as (fun i -> Camlcoq.coqstring_of_camlstring (Camlcoq.extern_atom i));
  • In Interp.ml we use Block.to_string in the definition of format_value.

This allows us to print out a global pointer &globvar[12] as <glob:globvar+48>, and a dynamic one &localvar as <dyn:123+0> (say). Previously both would have been printed out using numerical block-ids. We could move the definition of ident_to_string to a more central place if we want to make that a more general feature or use it in other ways.


This does not really address the problem mentioned by @andrew-appel, which I understand as: What if I want to use strings that never went through the parser, therefore are not in the map? This happens anytime you use CompCert as a library rather than a self-contained compiler: in VST, when you use the output of clightgen; likewise in CertiKOS, our "source-ish" code is Clight/Asm ASTs in .v files.

VST/CertiKOS use things like:

Definition foo : ident := 1%positive.
Definition bar : ident := 2%positive.
...
Definition myast : function := ... foo ... bar ...

Instead, we would want to use something like:

Definition foo : ident := ident_of_string "foo".
Definition bar : ident := ident_of_string "bar",

or perhaps even declare ident_of_string as a coercion and use strings directly when defining myast. This way we would never again need to mess around with potentially inconsistent tables of names <-> ident mappings in VST and CertiKOS, and there would never be a problem putting together modules that have been developped independently.


I believe we can accomplish this in the following way:

In the Coq part of CompCert, we axiomatize ident_to_string and string_to_ident as a predefined, but unspecified bijection between strings and ident's. Such a mapping does exists, but any particular one will be inefficient in general (the strings we do happen to use in a given case will correpond to sparsely distributed, large numbers).

But since we're free to use any such mapping, in the ML part, we can simply implement string_to_ident using a stateful process that allocates small numbers on-demand (as is already done). The key thing here is that although string_to_ident will be effectful, this can never be observed from the outside. In effect, we progressively "discover" the specific mapping we want to use.

Using this approach, we don't need to deal with the statefulness of the table in Coq, but still get the benefits. I believe implementing this can be done independently of our changes to the block namespace.

jeremie-koenig avatar Feb 15 '18 23:02 jeremie-koenig

Here's a proof-of-concept https://github.com/CertiKOS/compcert/commit/85d2b87ccd1202bf9041abf662bce0768fcc7349

jeremie-koenig avatar Feb 16 '18 04:02 jeremie-koenig

At this point I thought I'd write some kind of "diff walk-through" to facilitate any digging people would want to do.

The main changes occur in:

  • common/AST.v declares the string_of_ident parameter (realized together with string_of_positive with Caml code in extraction.v). This is needed in BlockNames.v to print out block names as strings.
  • common/BlockNames.v is a new file which defines the interface and implementation of the new block namespace. It includes an empirical eauto-based blomega tactic that can discharge most of the inequalities about blocks that xomega used to handle.
  • common/Values.v now defines block := Block.t, rather than positive.
  • common/Memtype.v is modified to reflect changes to the memory model interface:
    • There is a new invariant that Block.init <= nextblock m.
    • There is a new operation alloc_at used for allocating global blocks by name.
    • inject_neutral no longer needs a threshold (Block.init is always what we want).
  • common/Memory.v implements those changes.
    • Note that memory states now use BMap rather than PMap. In the most recent version BMap is implemented using IMap through an embedding of the new block type into positive, meaning the performance implications for the interpreter should be minimal.
  • common/Globalenvs.v is somewhat simplified:
    • genv_next is no longer necessary (the constant Block.init now plays this role)
    • genv_defs is now indexed by AST.ident values rather than block
    • the remaining maps and invariants in Genv.t are no longer necessary (they can be simple definitions and lemmas based on genv_defs and block operations); only genv_public remains
    • init_mem is modified to use alloc_at

Less remarkable changes are:

  • common/Maps.v: make sure the MAP interface has everything needed of BMap, and make sure EMap and IMap implement it.
  • cfrontend/Initializers(proof).v has one hack pleasantly removed, whereby ident values would be stored as block in pointers (now stored as Block.glob i).
  • driver/Interp.ml now uses block_compare in state comparisons, and Block.to_string to print out pointers.

The rest of the patch consists of replacing positive operators, predicates, lemmas, and tactics by their new block equivalents --- a set of numerous but fairly trivial changes. The ~550 new lines introduced by the patch mainly are BlockNames.v and the new Mem.alloc_at.

jeremie-koenig avatar Feb 20 '18 00:02 jeremie-koenig