typed-racket icon indicating copy to clipboard operation
typed-racket copied to clipboard

Reduce expansion from type and contract generation.

Open samth opened this issue 8 years ago • 40 comments
trafficstars

This reduces zo size in plot-gui-lib by about 11x.

samth avatar Oct 17 '17 15:10 samth

This reduces zo size in plot-gui-lib by about 11x.

Whoa

jackfirth avatar Oct 17 '17 17:10 jackfirth

@samth can you say a little bit about high-level strategy here? I'm of the naive opinion that when one writes a module in typed racket that exports, say, identifiers f and g with types T S, then there could be another module made at that same point that contains, roughly, (provide (contract-out [f T] [g S])) and then at each reference to the typed module from an untyped context, there doesn't need to be any contracts generated at all.

I'm clearly missing something (possibly not so) subtle and I'm definitely not on the critical path, but I am curious how wrong my understanding is, so if you have the time to fill me in, that'd be great.

rfindler avatar Oct 17 '17 21:10 rfindler

@rfindler What you describe as your naive opinion is in fact the case currently, although it's a bit more complicated than that. Roughly, a module like this:

(module m typed/racket
  (provide f)
  (: f : Integer -> Integer)
  (define (f x) (+ x 5)))

currently expands to the following (with a few simplifications):

(module m typed/racket
  (provide (rename-out [f* f]))
  (define (f x) (+ x 5))
  (define-syntax (f* stx) (if typed-context? #'f #'contracted-f))
  (module* #%contract-defs #f 
    (define contracted-f (contract (-> exact-integer? exact-integer?) f 'typed 'untyped))
    (provide contracted-f))
  (add-mod! (#%variable-reference))
  (begin-for-syntax
    (module* #%type-decl #f 
      (hash-set! types-of-defined-things #'f (make-Function
                                              (list (make-BaseType 'Integer)) 
                                              (make-BaseType 'Integer)))))

Note that I've left out how the f* macro gets access to contracted-f, the answer involves some trickery. Also, contracted-f is defined using the same machinery that implements contract-out so that it gets the blame right etc.

One question you might have at this point is why all the submodules. First, the #%contract-defs submodule means that you don't need to load and execute the contracts if you don't need the contracted version of the f. Second, the #%type-decl submodule means that you don't need to construct the types and initialize the hash table if you don't need that (for example, when just running the module after it's been compiled).

Another question you might have is what does add-mod! do. Well, since we stashed the hash table initialization in the #%type-decl submodule, we need to actually do that initialization in order to be able to type check other modules that might refer to f and need to know f's type. To do that, add-mod! registers the name of this module in a list which is created for the type checking of that other module. Then the type checker for the other module iterates over the list and dynamic-requires every module in it (or really the #%type-decl submodule), running the hash-set! calls and populating the environment. If the module requiring m isn't typed, then no one looks at the list, and those modules aren't required.

So, where does duplication come in? Well, imagine that you had two modules, both of which have an (-> Integer Integer) function provided from them. The #%contract-defs modules for each of them will have basically the same (-> exact-integer? exact-integer?) contract, which is duplicated work, and if that actual contract was very large, we'd have enormous compiled files, which is roughly what currently happens.

That's the background, and in the next comment I'll write about how this PR changes things to reduce duplication.

samth avatar Oct 17 '17 21:10 samth

I see I was not clear in my writing. From what I can tell, the compiled version of this program:

#lang typed/racket
(require typed/racket/gui)
(provide f)
(: f (-> (Instance Frame%)))
(define f
  (λ () (new frame% [label ""])))

contains an entire copy of the Frame% contract, instead of a reference to that contract.

(I will note that this is not like the duplication you discuss; I didn't write out the Frame% type in this file!)

rfindler avatar Oct 17 '17 22:10 rfindler

(PS: it also seems to contain copies of many other contracts; I think I see the text% contract and the dc<%> contract in there ... any maybe more?)

rfindler avatar Oct 17 '17 22:10 rfindler

Ok, now that you've read the background, here's the high-level strategy:

  1. Maintain a table (at expansion time) mapping types to identifiers which are bound to the contract for that type.
  2. Use that table to generate just a reference to that identifier instead of the actual implementation of a contract when needed.
  3. Update that table in roughly the same manner that the type environment is currently updated in the #%type-decl submodules (note that this technique was originally invented by @mflatt in the You Want It When paper).

Oh, and we're going to do the same thing for types as well, since the serialization of types can be large.

With that said, here's the outline of the new expansion (note that this isn't fully implemented yet):

(module m typed/racket
  (provide (rename-out [f* f]))
  (define (f x) (+ x 5))
  (define-syntax (f* stx) (if typed-context? #'f #'contracted-f))
  (add-mod! (#%variable-reference))
  (module* #%contract-defs #f 
    (define C (-> exact-integer? exact-integer?)) ;; new def
    (define contracted-f (contract C f 'typed 'untyped))
    (provide contracted-f))
  (begin-for-syntax
     (module* #%contract-defs-names #f
        (require (for-template (submod ".." #%contract-defs))
                      (submod ".." #%type-decl))
        (hash-set! predefined-contracts T #'C))
  (begin-for-syntax
    (module* #%type-decl #f 
      (define T (make-Function (list (make-BaseType 'Integer)) (make-BaseType 'Integer))) ;; new def
      (provide T)
      (hash-set! predefined-types T #'T)
      (hash-set! types-of-defined-things #'f T)))

Several things have changed here. First, we've added definitions of C and T in the submodules that use them. That's helpful for being able to refer to them later (note that defining C this way would make the contract less optimized, so we wouldn't do that in practice). Second, we've got a new submodule, #%contract-defs-names. Third, we are mutating two new tables, predefined-types and predefined-contracts.

The new mutations set up two new tables, which are mappings that tell us that if we want an expression that evaluates to the type T, we can use #'T, and that if we want an expression that evaluates to a contract for T, we can use #'C. We have to have the update of predefined-contracts happen in a new separate submodule because types such as T, and the generation of syntax for contract expressions such as #'C, happens at phase 1, but the actual contract value C is at phase 0. We don't need this split for #%type-decl because types and expressions that produce types both happen at phase 1.

Given that, what can we do now? Consider this module:

(module user typed/racket
   (require 'm) (provide g)
   (: g : Integer -> Integer)
   (define (g x) (f (f x)))

Now, when we go to generate a contract for g, we can look in our predefined-contracts table, see that we have a match, and just use #'C for the contract, instead of generating the syntax for g's contract. Similarly, we have to put an expression that constructs g's type in the expansion of user, but we can just use T for that. So now we've avoided lots of duplication.

However, there's one more issue. Lots of Typed Racket files don't depend on any other modules written in typed/racket, but they do depend on core Racket libraries to which Typed Racket has ascribed a type. If those types are big, then everyone who uses them directly will have their own duplicated copy, even if any individual path through the dependency tree would have only one copy, even in the scheme I've outlined.

We've already faced this with type serialization, and the solution is to initialize the table with some commonly-used types so there's no duplication at all. That's what's going on in the manually-constructed #%contract-defs and #%contract-defs-names submodules in this pull request, in the base.rkt and draw.rkt files. We generate those contracts once, even though that file doesn't use them, and register them in the predefined-contracts table, so that everyone who tries to generate that contract can share the definition.

samth avatar Oct 17 '17 22:10 samth

This doesn’t seem to take advantage of the fact that some of the types already have names and thus the contracts can also have corresponding names (without having to make guesses about when to make a C).

Is that not needed? Not helpful?

Does this strategy always avoid duplicating a contract for any type that is already just a name?

rfindler avatar Oct 17 '17 22:10 rfindler

Right, this doesn't use yet another table, which implements the meaning of define-type. Currently, the strategy for when to create a definition for a contract (as opposed to just inlining it) is that it's done whenever there's not a loss to doing it from the contract system's optimizations for functions. I'll probably change that to do it only when the contract would be generated multiple times. But more generally, we will create a lot more contract definitions than we will type names in the sense of define-type, and we want to avoid duplicating them as well.

Also, there's another wrinkle, which is that for classes, we have type names for the class types, but most of the contracts refer to the instances, and sadly those don't share a contract.

Note that the part where it generates the contents of the #%contract-defs-names submodule is not yet implemented, but the rest of it, including several predefined contracts, is implemented. Thus your example module now generates a contract that looks like (->* () n1435) where n1435 is defined in (submod typed/racket/gui/base #%contract-defs).

samth avatar Oct 17 '17 22:10 samth

Just a point of clarification: when you write "....and sadly those don't share a contract", you mean that there isn't a type name? Surely if one had a contract that corresponds to the type Frame%, then one could just write (instanceof/c Frame/c) and avoid duplication.

Regardless, that n1435 is very exciting! (Although I must confess that Frame/c would be slightly more exciting. Perhaps for a future refinement 😄 )

rfindler avatar Oct 18 '17 01:10 rfindler

#%no-add-mod is because I forgot at that point about the trick with #%plain-module-begin.

Yes, if other modules use extra-env-lang, they'll also have to grow these submodules. Probably we should add them by default and have a way to not add them for the ones that are needed.

samth avatar Oct 18 '17 13:10 samth

@rfindler Unfortunately, it really can't always reuse class contracts to produce instance contracts. There are a few reasons for this. One is that when an instance is provided from Typed Racket, we need to use an opaque object contract; that's what happens in this program:

#lang typed/racket
(define c (class object%
            (define/public (m x) x)
            (super-new)))

(define d (new c))
(provide c d)

A second reason is that there's not just one contract for a given type, there's both the contract when the value comes from untyped code and the one when the value goes to untyped code. (There's also a third one when the same contract has to handle both.)

Fortunately, this isn't really a source of much duplication, because most modules have instances in their interfaces, rather than classes, and because the individual method contracts can be shared.

samth avatar Oct 18 '17 15:10 samth

I've improved some of the bigger problems with this PR, and it's become clear that implementing the remaining big piece will be harder than I thought, so I'm considering moving forward with just this part first, especially since it's already a big win. I still plan to address the outstanding comments, of course.

samth avatar Oct 18 '17 17:10 samth

That sounds good to me. I'm interested to see the build plot (http://build-plot.racket-lang.org/) after this change, but I haven't gotten around to adding a way to make a build plot using your "reduce-expansion" branch instead of the one that pkgs.racket-lang.org reports.

mflatt avatar Oct 18 '17 18:10 mflatt

@samth how is this coming? It seems like it could be a great improvement for all people who build racket!

rfindler avatar Nov 02 '17 20:11 rfindler

We'll see how Travis feel's about this. I've addressed all of the outstanding comments.

samth avatar Dec 17 '17 23:12 samth

This now builds properly on my machine with some plot fixes that I've pushed. We'll see how Travis feels about this commit tomorrow.

samth avatar Dec 18 '17 03:12 samth

All the tests for this pass locally for me.

samth avatar May 01 '18 20:05 samth

The PR builds and passes tests on my machine, but the size of compiled libraries (e.g. math and plot) seem to have increased when compared to the nightly build from yesterday.

e.g. from plot/private/plot2d/compiled/

  • vector_rkt.zo went from 20K (nightly) to 62K (reduce-expansion)

  • rectangle_rkt.dep went from 2.6K (nightly) to 15K (reduce-expansion)

  • etc

Resulting in the math library increasing in size from 9.4 MB (nightly) to 13.4 MB (reduce-expansion) and plot increasing in size from 8.5 MB (nightly) to 11.2 MB (reduce-expansion).

pnwamk avatar May 01 '18 21:05 pnwamk

Here are gists with the aforementioned files:

rectangle_rkt.dep on nightly and reduce-expansion

vector_rkt.zo (via raco decompile ...) on nightly and reduce-expansion

pnwamk avatar May 02 '18 14:05 pnwamk

The increase in size of .dep files is because the file now explicitly depends on all its typed transitive dependencies, so that types or contracts that those files create can be re-used. We can reduce this by omitting dependencies that are implied because we already depend on an intermediate module.

The increase in size of .zo files seems to mostly be in syntax objects, which I don't yet understand.

samth avatar May 02 '18 15:05 samth

Here's the nightly vs reduce-expansion diff for vector_rkt.zo:

https://gist.github.com/pnwamk/b3db1aa5ce8496c362feef6ae2b125d0

pnwamk avatar May 02 '18 21:05 pnwamk

The vector_rkt.zo change appears to be 90% from syntax object references to all of Typed Racket's internal identifiers. @mflatt could this be caused by quote-syntax capturing too much/needing to use #:local?

samth avatar May 03 '18 17:05 samth

I have not investigated at all, but using #:local could indeed increase marshaled information by a lot.

mflatt avatar May 03 '18 17:05 mflatt

Oh, I got the sign wrong. The code in this diff doesn't introduce any uses of #:local.

samth avatar May 03 '18 17:05 samth

Here's the minimal example for the weird zo size explosion:

#lang typed/racket/base
(provide f)
(: f (-> (Vectorof Real) (Vectorof Real)))
(define (f v) (error 'fail))

Changing the second (Vectorof Real) to Real reduces the expanded code size from 181 lines to 177 lines. But it reduces the output of raco decompile from 9890 lines to 724 lines.

The diff between the expansions is:

[samth@huor:~/.../plot/private/plot2d (master) plt] diff -u small.rktl large.rktl 
--- small.rktl	2018-05-08 15:40:09.320377948 -0400
+++ large.rktl	2018-05-08 15:40:22.904528386 -0400
@@ -19,10 +19,12 @@
        (just-meta 0 (rename racket/private/sort vector-sort! vector-sort!))
        (just-meta 0 (rename racket/private/sort vector-sort vector-sort))
        (only racket/private/sort))
+      (define-values (g658) (#%app make-Vector -Real))
+      (#%app hash-set! predefined-type-table g658 (quote-syntax g658))
       (#%app
        register-type
        (quote-syntax f)
-       (#%app simple-> (#%app list (#%app make-Vector -Real)) -Real)))))
+       (#%app simple-> (#%app list g658) g658)))))
    (begin-for-syntax
     (#%app
      add-mod!
@@ -114,10 +116,10 @@
          (#%app list)
          '#f
          '#f
+         (#%app list g6)
          '#f
-         '#f
-         popular-plus-one-key-id5
-         popular-chaperone-key-id44
+         popular-plus-one-key-id3
+         popular-chaperone-key-id42
          '#f)))
      (define-values
       (idZ3)
@@ -165,7 +167,9 @@
    (define-values
     ()
     (begin
-      (quote-syntax (:-internal f (-> (Vectorof Real) Real)) #:local)
+      (quote-syntax
+       (:-internal f (-> (Vectorof Real) (Vectorof Real)))
+       #:local)
       (#%plain-app values)))
    (define-values (f) (lambda (v) (#%app error 'fail)))
    (define-syntaxes (f) (#%app make-redirect2 (quote-syntax f)))

samth avatar May 08 '18 19:05 samth

I tried using quote-syntax/prune in the one place where a new quote-syntax appears in the expansion, but it didn't change anything.

samth avatar May 08 '18 19:05 samth

At this point, I'm pretty stumped. The expanded and decompiled code from that example is at https://gist.github.com/samth/80a866b55f7f60539716edf9bdba64ec but I don't understand why the one additional quote-syntax brings in another 1000 bindings.

samth avatar May 08 '18 20:05 samth

I'll fix the docs to explain that identifier-prune-lexical-context is a no-op in the set-of-scopes expander.

The problem in "vector.rkt" is that type->sexp generates an identifier with (gensym), which produces a symbol instead of an identifier, and then it coerces to an identifier using quasisyntax. As a result, the identifier gets the context of the "init-envs.rkt" library, pulling along its imports and bindings. Don't use gensym to generate an identifier. Using generate-temporaries avoids the problem, since generate-temporaries produces identifiers.

mflatt avatar May 21 '18 19:05 mflatt

I've fixed the use of gensym, and now the zo file sizes in plot seem similar, but I haven't done a full comparison. @pnwamk can you redo the comparison you did earlier in this discussion?

samth avatar May 22 '18 04:05 samth

It looks at a glance like library sizes (at least math and plot) have slightly increased:

Nightly -> PR
-------------------
9.2M -> 10M (math)
8.0M -> 8.5 (plot)

pnwamk avatar May 22 '18 14:05 pnwamk