ChezScheme icon indicating copy to clipboard operation
ChezScheme copied to clipboard

Add a compilation pass to cp0 to reduce expression using types

Open gus-massa opened this issue 7 years ago • 39 comments

cc: @mflatt

This is still work in progress, so the files still have some unnecessary comments and rough parts, but it's already big and I think I need some feedback before continuing.

In general, this changes are useful with optimization-level 2, because most of the type checks are skipped in level 3.

The idea was to reduce most common idioms, for example:

(lambda (x) (vector-set! x 0 0) (vector? x)) 
  ==> (lambda (x) (vector-set! x 0 0) #t)

(lambda (x) (when (vector? x) (vector? x))) 
  ==> (lambda (x) (when (vector? x) #t))

(lambda (x) (when (vector? x) (box? x))) 
  ==> (lambda (x) (when (vector? x) #f))

(lambda (x) (unless (vector? x) (error 'e "")) (vector? x))
  ==> (lambda (x) (unless (vector? x) (error 'e "")) #t)

More examples of the reductions are in the first part of cptypes.ms

The main part of the code is in cptypes.ss. It has an explanation of the parameters of the main function at the beginning of the file.

(Some auxiliary functions are copied from cp0.ss. There is also an hamt implementation in hamt.ss. It's necessary to get fast lookup of the discovered types.)


The first commit is just to recalculate automatically the bootfiles in travis-ci. It's handy for testing while writing the initial versions, but my idea is to remove/revert it in the final version.

The second commit adds the signature information available in primdata.ss to the primref. so the other parts of the code can use it. This step is tricky, so I left it in an isolated commit, but my plan is to squash it in the final version. (The third commit has the bootfiles.)

The last commit is the main part of the implementation.

gus-massa avatar Jul 17 '17 21:07 gus-massa

Michael Adams created a type-recovery algorithm and implemented it for Chez Scheme as part of his dissertation research, with help from Andy Keep and others, but I never took the time to put it into the release stream, which would have involved fleshing out the primdata information, among other things. Before incorporating another solution, it would be nice to give that work a close look. This paper describes the algorithm.

dybvig avatar Jul 18 '17 03:07 dybvig

I read the preprint only twice, so I hope I didn't misunderstand too much.

There are a lot of small differences that can be ported in either way. For example, his implementation reduces #2%car to #3%car when posible, but I didn't add that. Also, he is using a lattice of types with a very small height, but I'm using 4 or 5 steps in the numeric tower.

The main difference is that he is using a graph to handle cycles, like (let loop ([n 100]) ... ) and I just give up and don't try anything smart with cycles, so all the analysis can be done in one pass.

The other difference is that I'm always copying the info from each environment to the surrounding environment. This can lead to quadratic run time in cases like

(lamdba (x1 ... xn)
  (let ([r1 (random)])
    ...
       (let ([r2 (random)])
         (display (list (unbox x1) ... (unbox xn)))))
  (display (list (box? x1) ... (box xn))))

He is using an abstract analysis of the flow of the types an structure similar to a skiplist to avoiding to update the types in each step, and the types just teletransport to the correct spot. So this is analyzed in O(n log(n)) instead of O(n^2).

As an informal benchmark, I'm comparing the runtime between the master branch and my branch. With something like

cd a6nt/s; time make allx o=3

my branch, o=3: 28.6 sec master, o=3: 26.4 sec

(with o=2, it runs twice, so you must divide the time by 2 to compare) my branch, o=2: 148.8 sec (~74.4s sec) master, o=2: 170.6 sec (~85.3s sec)

I'm "cheating" a little because most of the times I'm not saving the type of a variable when it is a procedure?. (My code is correct (I hope), but it miss some optimization opportunities, but not too much in usual code.) If I add all the procedures, the code has to track too many variables and the compilation gets ¿50%? slower. He is saving all the info for all procedures, with more details, so this can have an important effect.

gus-massa avatar Jul 23 '17 13:07 gus-massa

Sorry for the long delay. I updated the PR. This is a short description of the commits:

  • fix enumerate signature A small fix, otherwise the PR doesn’t compile.

  • add signatures field to primref record
  • Add cptypes pass to cp0 to reduce expression using types These are the main parts of the old PR. See the description in the initial post.

  • revert to version that can update primref record (skip)
  • recalculate the bootfiles automatically in travis-ci (skip)
  • bootfiles (signatures) These are some commits to try to test the intermediate versions in Travis without updating the bootfiles with each change. These commits should be deleted of the final version.

  • Fix hashing of prelex
  • fxmap version of cptypes
  • change fxmap api to hide internal structure These commits change the hamt by an intmap/fxmap made by @97jaz. In particular the new structure can merge two similar intmap in O(d log(n)), where d is the number of differences and n is the total amount of variables. Also, the prelex have a new field that is a counter, that is used as the hash key for the map.

  • remove some unnecessary pred-env merges
  • update comments and fix exact-integer? and result type of ìf`
  • merge types after if
  • rename merge -> union/intersection
  • refactor pref-union/intersect These are some assorted changes, mostly to avoid unnecessary merges in the pred-env to try to avoid quadratic behavior in long chains of nested let’s. They also provide more information about the return type of an if. (They can probably be squashed in a single commit, but some of them are mixed with the previous commits and they are difficult to reorder.)

  • upper and lower bound for types
  • cache for primref->result/argument-predicate
  • add safeongoodargs flag to primref Try to replace some applications of the primitives with the unsafe version when the types of all the arguments are well enough. This changes (car <pair>) to (#3%car <pair>) but it doesn’t change (vector-ref <vector> <n>) to (#3%vector-ref <vector> <n>) because it’s difficult to verify if <n> is in the correct interval. There are surprisingly many primitives where all the arguments can be checked.


About the thesis of Michael Adams: The implementation is quite different but most of the results are equivalent. (Something similar to a list of features is in mats/cptypes.ss) My PR uses nanopass instead of a custom graph. My version currently passes the checks in Travis. The runtime of the o=3 version increase only a little and the runtime of the o=2 version is smaller. It can be merged after checking a few points that I’m not 100% sure that are correct and fixing a few/many style problems. IIUC the main features that are missing in my version are:

  • Types for loops: I don’t track the types in recursive functions like
    (let loop ([n 0])
      (when (< n 100)
        (loop (+ n 1))) ; <- obviously n is a number

I think this can be added to my current implementation later. It’s not straightforward but I have a vague idea about the implementation, but I have not written a single line of it.

  • [fixed] Quadratic behavior with too many let’s, for example in
    (let ([x1 (read)])
      (let ([xn (read)])
        (+ x1 xn)
      ) ; <- merge the types
    ) ; <- merge the types

In my first version the extended version of this code has quadratic behavior because the types of each x1 and xn should be merged after each let. With the new version I just keep the dead variables in the environment. The new hashing is fast and these additional variables don’t make the code slower.

  • Quadratic behavior with too many if’s, for example in
    (if x1 
      (if x2 1 2)
      (if x3 3 4)
    ) ; <- merge the types discovered in each branch 

This example is not quadratic, and I’m still not sure if something like this can be quadratic in the current version. But if in the future I add support for loops and function calls, it will more easy to find quadratic examples. The thesis has a chapter to fix the problem of the nested if. I think it’s very interesting but the solution is difficult to integrate with my version. In particular, IIUC in the thesis the position of each if is important, but I try to reduce the if’s as soon as possible (for example if x2 is a number, the second if is reduced to the constant 1). I think that long chains of if are not so common, so my version is good enough without this feature.

  • [half fixed] Avoid unnecessary type checks
    (car <pair>) ==> (#3%car <pair>) [fixed]
    (vector-ref <vector> <n>) ==> (#3%vector-ref <vector> <n>) [not fixed]
    (bitwise-copy-bit <sint> <uint> <bit>) ==> (bitwise-copy-bit <sint> <uint> <bit>) [not fixed]

I added a reduction for the cases where it is possible to check all the arguments and replace the application with the unsafe version. This is possible with car, cdr, and many (~200) primitives, but it’s not possible with vector-ref and most vector primitives. It’s possible in the future to add some support for more general cases, in particular the operations on vectors. A special case are the primitives like bitwise-copy-bit. They are marked as safeongoodarguments, but the type system I currently use can’t track the variables with enough precision, so they are not changed in the current versions, but it’s possible to extend the type system in the future.

gus-massa avatar Nov 05 '17 18:11 gus-massa

Note that large uses of match (at least in Racket) can generate quite deeply nested if expressions. I'm not sure if that's likely to be a problem with the issue you mentioned.

samth avatar Nov 13 '17 02:11 samth

I´ll take a look at the expansion match in Racket, but I expect them not to be so big to cause a problem.

I think that it's easier to explain the problem with an example. In the current version of my PR this code:

(lambda (t)
  (unless (symbol? t)
    (error 'e "msg"))
  (let ([x1 (read)]
        [x2 (read)]
        [xn (read)])
    (if (eq? t 'v1)
        (display (+ 1 x1 x2 xn))
        (if (eq? t 'v2)
            (display (+ 2 x1 x2 xn))
            (if (eq? t 'vm)
                (display (+ -1 x1 x2 xn))
                (display (+ -2 x1 x2 xn)))))
    (list (number? x1) (number? x2) (number? xn))))

The optimization step for this code is O(n*m*log(N)), where n is the number of auxiliary variables, m is the number of conditions and N is the total amount of variables (not shown in the example). But the length of the code is O(n*m) because each variable must be used in (almost) each branch condition. If a variable is used only in one single branch in a long chain of ifs, after the first type merge it will dispersar and all the the other merges will not see the variable.

So this simpler code code is optimized in a time essentially O(n+m) (ignoring something like log(N))

    (if (eq? t 'v1)
        0
        (if (eq? t 'v2)
            1
            (if (eq? t 'vm)
                (+ -1 x1 x2 xn)
                (+ -2 x1 x2 xn))))))

I'm (slightly) worried than in a future version, I (or someone) will add some tracking of types inside function calls. Something like:

(lambda (t)
  (unless (symbol? t)
    (error 'e "msg"))
  (let ([x1 (read)]
        [x2 (read)]
        [xn (read)])
    (let ([sum (lambda (z) (dislay (+ z x1 x2 xn)))]) ; let's assume it is not inined
      (if (eq? t 'v1)
          (sum 1)
          (if (eq? t 'v2)
              (sum 2)
              (if (eq? t 'vm)
                  (sum -1)
                  (sum -2))))
      (list (number? x1) (number? x2) (number? xn))))

In this case, if the optimizer is smart enough to realize that after calling sum all the captured variables x1, x2, xn are integers, then the merge of the types will be O(n*m*log(N)) but the code will be of length O(n+m) only because sum is not inlined.

This is not a problem in my current version, because it just ignore function calls. Only the primitives may mark the variables.

gus-massa avatar Nov 20 '17 20:11 gus-massa

The complexity of the merge algorithm is given in Okasaki and Gill as worst case O(m+n).

97jaz avatar Nov 20 '17 20:11 97jaz

Hey @gus-massa,

Thanks for all your work on this!

It looks like this pull request is not quite ready to go though, because when I run the Chez Scheme unit tests I end up with a variety of failures. I would note that you should not need to change the .travis.yml file to get things working, if your boot files are up to date.

Two types of errors I saw

I saw two types of unit test failures when running the mats.

First, there are a few places where the output of cp0 is tested and the $cptypes pass improves on the output by using unsafe primitives in place of the type-checking version. These (if you think they are correct) you should go ahead and update. As part of the review process, these changes will be looked at as well.

Second, there are places where the changes in the PR are leading to some errors in the mats. The sint-list->bytevector procedure is incorrectly optimized when compiling the Chez Scheme libraries, and this is leading a handful to tests related to it to fail, and there are places where a specific error was expected and there is now an invalid memory reference error.

I spent some time looking into the sint-list->bytevector problem, and I was able to isolate it to an erroneous optimization that is getting rid of a loop in the case of a list of 16-bit signed integers. The original bytevector.ss file is a little big to operate on, so I managed to pair it down to a smaller test that demonstrates the problem:

(define $s16-native-list->bytevector
  (lambda (ls)
    (let ((v (make-bytevector (fx* (length ls) 2))))
      (let loop ((ls ls) (i 0))
        (unless (null? ls)
          (let ((k (car ls)))
            (unless (and (fixnum? k) (fx<= -32768 k 32767)) (errorf #f "invalid value ~s" k))
            (bytevector-s16-native-set! v i k))
          (loop (cdr ls) (fx+ i 2))))
      v)))

When I run expand/optimize on this with the current Chez Scheme from the git-repo, 2d1d3ad, I get the following:

(begin
  (set! $s16-native-list->bytevector
    (lambda (ls.0)
      (let ((v.1 (#2%make-bytevector (#2%fx* 2 (#2%length ls.0)))))
        ((letrec ((loop.2 (lambda (ls.3 i.4)
                            (if (#2%null? ls.3)
                                (#2%void)
                                (begin
                                  (let ((k.5 (#2%car ls.3)))
                                    (if (if (#2%fixnum? k.5)
                                            (#2%fx<= -32768 k.5 32767)
                                            #f)
                                        (#2%void)
                                        (#2%errorf #f "invalid value ~s" k.5))
                                    (#2%bytevector-s16-native-set! v.1 i.4 k.5))
                                  (loop.2 (#2%cdr ls.3) (#2%fx+ 2 i.4)))))))
           loop.2)
          ls.0 0)
        v.1)))
  (#2%void))

However, when I run expand/optimize on this with the current Chez Scheme patched with your changes, it (incorrectly) optimizes away the loop:

(begin
  (set! $s16-native-list->bytevector
    (lambda (ls.0)
      (let ((v.1 (#2%make-bytevector (#2%fx* 2 (#2%length ls.0)))))
        (if (#2%null? ls.0)
            (#2%void)
            (let ((k.2 (#2%car ls.0)))
              (if (if (#2%fixnum? k.2) (#3%fx<= -32768 k.2 32767) #f)
                  (#2%void)
                  (#2%errorf #f "invalid value ~s" k.2))
              (#2%bytevector-s16-native-set! v.1 0 k.2)))
        v.1)))
  (#2%void))

I also verified it was cptypes making the change by switching up run-cp0 with the following:

(run-cp0
  (let ()
    (define (echo msg x)
      (display msg)
      (newline)
      (pretty-print (#%$uncprep x))
      x)
    (lambda (cp0 x)
      (let* ((x (#%$cp0 (echo "original input" x)))
             (x (#%$cptypes (echo "output of cp0 (1)" x)))
             (x (#%$cpletrec (echo "output of cptypes (1)" x)))
             (x (#%$cp0 (echo "output of cpletrec (1)" x)))
             (x (#%$cptypes (echo "output of cp0 (2)" x)))
             (x (#%$cpletrec (echo "output of cptypes (2)" x))))
        (cp0 (echo "output of cpletrec (2)" x))))))

This printed out (amongst other things):

---
output of cp0 (1)
(begin
  (set! $s16-native-list->bytevector
    (lambda (ls.6)
      (let ((v.7 (#2%make-bytevector (#2%fx* 2 (#2%length ls.6)))))
        ((letrec ((loop.8 (lambda (ls.9 i.10)
                            (if (#2%null? ls.9)
                                (#2%void)
                                (begin
                                  (let ((k.11 (#2%car ls.9)))
                                    (if (if (#2%fixnum? k.11)
                                            (#2%fx<= -32768 k.11 32767)
                                            #f)
                                        (#2%void)
                                        (#2%errorf #f "invalid value ~s" k.11))
                                    (#2%bytevector-s16-native-set! v.7 i.10 k.11))
                                  (loop.8 (#2%cdr ls.9) (#2%fx+ 2 i.10)))))))
           loop.8)
          ls.6 0)
        v.7)))
  (#2%void))
output of cptypes (1)
(begin
  (set! $s16-native-list->bytevector
    (lambda (ls.6)
      (let ((v.7 (#2%make-bytevector (#2%fx* 2 (#2%length ls.6)))))
        ((letrec ((loop.8 (lambda (ls.9 i.10)
                            (if (#2%null? ls.9)
                                (#2%void)
                                (let ([k.11 (#2%car ls.9)])
                                  (if (if (#2%fixnum? k.11)
                                          (#3%fx<= -32768 k.11 32767)
                                          #f)
                                      (#2%void)
                                      (#2%errorf #f "invalid value ~s" k.11))
                                  (#2%bytevector-s16-native-set! v.7 i.10 k.11))))))
           loop.8)
          ls.6 0)
        v.7)))
  (#2%void))
---

You'll notice the recursive call to loop.8 was eliminated. My suspicion (though I've not tracked this through the cptypes code) is that the something with the merge points around the (unless (and (fixnum? k) (fx<= -32768 k 32767)) (errorf ---)) is merging to bottom when it should not be, or the pred-implies? is incorrectly identifying the result as bottom, which is causing it not to generate the `loop`` call in the output. I could be wrong though, there could be something more subtle going on here, and it is worth noting that similar shaped code for the 32-bit signed integer does not manifest the same problem.

A bug like this might also be what is causing the invalid memory exceptions I'm seeing.

A note on style

I noticed it looks like you are using a box to represent an input/output parameter. Instead of using boxes I recommend using multiple-value returns. The advantage is that you avoid allocation, since (in the worst case) multiple-value returns are handled similar to a function call, and in the best case, it is possible for the multiple-value consumer to be pushed with in the multiple-value producer (see: An Efficient Implementation of Multiple Return Values in Scheme by Ashley & Dybvig).

A note on asymptotic behavior

Like @samth, I'm a little concerned about the potential for O(n^2) behavior, traditionally we've tried to keep optimizations to O(n log n), only allowing for O(n^2) passes where it was required for correctness. That said, if we can get this into shape, we can do some testing to see how deeply nested the if expressions have to get before we really start to see performance degradation in the compiler, and in the worst case, we can always cut off the analysis if we discover we are going down a path that is going to lead to bad performance. In practice, as you've said, we may find we rarely hit the limit.

Let me know if you'd like help on this, and thanks again for the work you've already put in!

akeep avatar Dec 04 '17 03:12 akeep

Thanks for the minimized example. The problem with bytevector-[u/s]16-native-set! was that it has a wrong signature. (The 32 bits versions is correct.) So cptypes detects that the variable k is a fixnum? and it uses the wrong signature to detect that k must be a symbol?, so it expect that the bytevector-u16-native-set! will generate an error, so cptypes drops the rest of the expression.

I submitted a fix in a different PR https://github.com/cisco/ChezScheme/pull/240 because this looks isolated enough and simple enough to be merged soon.

I'll reply to the rest of your comment later.

gus-massa avatar Dec 05 '17 15:12 gus-massa

I'm working mostly in Windows, but I never could fix the problems with iconv.dll, so if I run all the test I get a lot of errors. I'm using a VM with Linux now, so I can see the actual errors.

I found a few more (~5) errors with the signatures, that were causing explicit error or invalid memory errors. (I'll submit the PR soon.) I also found a a few places were I marked the primitive erroneously as safeongoodars. I also fixed the checks of the expanded code. (I'll update this PR in a week after I fix most of the problems.)

I still have to fix the errors with records. @mflatt found an error that is probably related to records, but he has no an small/easy example. Some of the messages of the errors in the mats look similar, so I hope that fixing the mats will fix it.

About Style (box -> values): OK, I'll change it

About asymptotic behavior: I'll try to think the details later. I'll try to fix the other problems before.

gus-massa avatar Dec 09 '17 16:12 gus-massa

That all sounds good, thanks again for all of your hard work on this!

As the first real consumer of the signature data in primdata.ss, I'm afraid you've ended up being the tester for that data.

I'm not a windows user, but you might consider putting in an issue about the problems you've run into running the tests there, as this is something that should be possible, and there are windows users in the Chez Scheme community that might be able to help out.

akeep avatar Dec 09 '17 21:12 akeep

@gus-massa, try https://github.com/win-iconv/win-iconv to make an iconv.dll that works on Windows. I use this for x86 and x64.

burgerrg avatar Dec 11 '17 13:12 burgerrg

@burgerrg I tried now to compile win-iconv, but I got an error (Perhaps because I'm using Cygwin instead of MinGW?) Anyway, I downloaded the compiled version from https://mlocati.github.io/articles/gettext-iconv-windows.html and I copied the file libiconv-2.dll to ChezScheme\a6nt\bin\a6nt and it works.

I still get some errors in the master branch in Windows. I'll try to classify them and send a pull request or a bug report if I can't fix it.

@akeep In my "secret" branch, I fixed all the errors that the test in "allx" can detect. But I want to make a few more tweaks before updating the PR. I'm optimistic about the problems with the quadratic time, I think it's solvable with a few modifications to the current structures. I also tried to change the box to values, but some parts are painful. For example to look at all the arguments of a function:

;--- with box ---
[(call ,preinfo ,pr ,e* ...)
 (let* ([r* (map (lambda (e) (box #f)) e*)]
        [t* (map (lambda (e) (box (unbox types))) e*)]
        [e* (map (lambda (e r t) (cptypes e 'value r t #f #f))
                 e* r* t*)])
   (set-box! types (something-with (unbox types) e* r*))
   (for-each (lambda (t) (set-box! types (pred-env-intersect (unbox types) (unbox t)))) t*)
   `(call ,preinfo ,pr ,e* ...))]

and

;--- with box ---
[(call ,preinfo ,pr ,e* ...)
 (let* ([e/r/t*  (map (lambda (e)
                        (let-values ([(e r t t-t f-t)
                                      (cptypes e 'value types)])
                          (list e r t)))
                      e*)]
        [e* (map car e/r/t*)]
        [r* (map cadr e/r/t*)]
        [t* (map caddr e/r/t*)])
   (let ([types (something-with types e* r*)])
     (values `(call ,preinfo ,pr ,e* ...)
             #f
             (fold-left (lambda (types t) (pred-env-intersect types t)) types t*)
             #f #f)))]

gus-massa avatar Dec 21 '17 16:12 gus-massa

@gus-massa I agree that looks pretty hairy, but I would approach this a little differently.

First, instead of using map and generating a list of lists, then mapping over the lists, I'd probably consider either building a loop to do this in place, or I'd write a multi-valued map macro that allowed for specification for the base-case null list values. In this case though, the nanopass framework can help out, because it can already create the multi-valued map for you:

[(call ,preinfo ,pr ,[e* r* t* t-t* f-t*])
 (let ([types (something-with types e* r*)])
   (values `(call ,preinfo ,pr ,e* ...)
           #f
           (fold-left (lambda (types t) (pred-env-interset types t)) types t*)
           #f #f))]

If you take a peak at cpnanopas.ss, you'll also see places where we've written custom loops, or extracted that into a function when we were going to use that several times.

akeep avatar Dec 28 '17 03:12 akeep

After another long delay ... Fixing the quadratic time was possible, but it was harder than I expected.

I squashed most of the commits and the fix in a few big commits. Short description:


  • add signatures field to primref record …
  • Version to upate primref record during rebase: step 1 (skip)
  • Version to upate primref record during rebase: step 2 (skip)
  • Version to upate primref record during rebase: step 3 (skip)
  • recalculate the bootfiles automatically in travis-ci (skip)
  • bootfiles (signatures)

Add the signatures to the primref. This step is tricky, so I left the intermediate steps to make a future rebase easier, but most of these commits should be removed in the final version.


  • Hashing of prelex for cptypes
  • Add fxmap for cptypes
  • Add more operations to fxmap
  • test for fxmap-{remove,reset}/base (skip)

Implementation of an intmap that will hold the types recovered during the cptype pass.


  • Remove special case for (#2%map p '()) in cp0 [fix in master] …
  • Add cptypes pass to cp0 to reduce expression using types …

This is the main part of the PR. This includes the previous version, some improvements and many fixes.


  • add safeongoodargs flag to primref …

Reductions to replace a primitive by the unsafe version. I left the safeongoodargs separated because it has a lot of instances to be reviewed, so perhaps it's better to merge it later.



About the previous comments:

box -> values

done

fix tests

done

records

I changed like the 90% of the implementation for records. I think that now it is correct.

quadratic time

The main idea is that the types are stored in a immutable intmap/fxmap that remembers how much changes were used to create it. So each time it is necessary to merge two intmaps it is possible to select the smaller one and add to the biggest one only the recent differences. The main advantage of this intmaps/fxmaps is that the difference operation can be very efficient.

Some construction like let use a special operation to forget the temporal variables. This operation not only removes the variable but it also tries to reconstruct the inmap before the temporal variable was added. [In particular, with this reconstruction some trees that look equal? are eq?, like in the additional test. Most of the times only a part of the tree is reconstructed.] This helps the following parts of the code to compare two inmaps faster, because the iterations can skip the common parts using eq? without comparing the subtrees.

The analysis of the if is more tricky, because it's necessary to use a different strategy when the test part is longer than both branches. Also, if one one of the branches is longer than the test it's important to use the intmap of the smaller branch for the iterations and compare using lookups in the intmap of the biggest branch. (The "length" is actually calculated the number of changes in the relevant intmap, that is bounded by the actual length of the code.)

This pass runs in O(N log(N) (log(N)+R)) where N is the total length of the code and R is the depth of the record hierarchy, because to compare two records the program must walk from a rtd to the parents until it finds the other rtd or the base rtd. In practice R is small, I tested with the Chez Scheme compilation and I got R=3, and I expect that most programs don't have rtd with parent with parent with parent with parent with parent. So the effective bound is O(N log^2(N)).

[It is possible to fix the problem of the records with too many ancestors, but in that case I think it's better to change it in another time.]

gus-massa avatar Feb 13 '18 03:02 gus-massa

This all sounds good. It will take me a few days to get some time to look through and test all these changes, but I'm looking forward to going through it. Thanks for all the work 👍

akeep avatar Feb 14 '18 03:02 akeep

Hi @gus-massa,

Sorry it has taken me so long to get back to you. I found a few bugs in testing and ended up fixing them myself as I was reviewing and experimenting a bit with the code. Unfortunately, I was unable to push the changes directly into this pull request, so I have them up on a GitHub desktop created branch called pr/192.

Before I get into the list of the things that I changed (and where), there were a few things that I did not do, that are minor changes we still might want to consider.

In the fxmap code, it looks like $empty is probably a singleton and we might get away with an ever so slightly faster eq? check using the singleton instead of using $empty? (It is even possible to force the $empty record protocol to produce a singleton, but that is likely overkill.

In the cptypes code:

We might consider making $record/rtd $record/ref lists into records... it is both more space efficient (2 words for the record instead of 4 for 2 pairs) and more performance efficient when checking for membership.

I think there are still a handful of places where we can avoid doing multiple car/cadr/etc. calls on the same object. Less important for car/cdr, but for cadr, etc. slightly more important since we're walking multiple pointers repeatedly doing that.

The primitive call handling is a bit complicated, and I'm wondering if it is worth it to follow the cp0 example of creating handlers for the primitives. There may be some code we can communize in here, and I think it might clean things up a bit, though we are sort of on the borderline of needing more complicated handling.

On the performance side, I suspect there is still quadratic behavior lurking in here somewhere, but with the changes you've made I suspect it would be very difficult to construct the program that exposed this. Part of the reason I say this is that the merging of fxmaps has a linear worst case, and the handling of call arguments also does a linear walk over the items bound in the types maps when the pred-env-rebase function is called.

That said, I'm happy enough with where the performance is on this. I think you've done a lot of good things to curtail the cases where we might run into trouble, and it doesn't seem to impact the compile time much if at all on my mac, though I've not done any formal benchmarking of it.

So, onto the changes I've made on my branch:

Removed counter field from prelex, using the operand field instead to provide the index into the fxmap. This follows other uses within the compiler where we use the operand field as a little place for state that is used within a single pass. This has a few advantages. First, it keeps the record a little smaller. Second, it means that the prelex numbering can start from 0 for each compilation unit, which should help keep the numbers for the fxmap a bit smaller in longer running sessions with multiple calls to the compiler. Finally, it avoids adding to the burden of the tc-mutex, since the counter is now a local variable bound within the pass and it is safe for us to set the prelexes, since only the instance of the pass holding this block of code has a handle on it. As part of this change prelex-counter is now defined in cptypes and the operand is cleared after the variables go out of scope.

  base-lang.ss, cptypes.ss

Fixed the highest-set-bit function in fxmap so that it will work in the 32-bit versions of Chez Scheme. The fxsrl by 32 raises an exception in the 32-bit compiler, and was leading to tests to fail in 32-bit mode.

  fxmap.ss

Restructured predicate-implies? so that it uses committed choice instead of uncommitted choice in comparing x and y. Basically, this means, instead of doing:

(or (and (predicate-1? x) (predicate-1? y) ---)
     (and (predicate-2? x) (predicate-2? y) ---)
     ...)

we now do:

(cond
  [(predicate-1? x) (and (predicate-1? y) ---)]
  [(predicate-2? x) (and (predicate-2? y) ---)]
  ...)

This avoids running predicates on x that we know will fail because an earlier predicate matches, generally getting out of the predicate-implies? faster. This did require a little restructuring, because in some cases x was dominant and in other cases y was dominant. This is now restructured with y dominate, after the eq? and x 'bottom check.

  cptypes.ss

Replaced let-values calls with cata-morphism syntax, including removal of map calls that built up a list of values that then needed to be separated out with (map car ...) (map cadr ...) etc. calls. This avoid building up structures we don't need, since the nanopass framework will generate a mutltivalued map for these situations.

  cptypes.ss

The if clause in cptypes/raw now uses types1 (the result of the recursive call on e1) in place of the incoming types value when processing the e2 or e3 expressions in the cases where e1 is known statically to produce either a false or non-false value.

  cptypes.ss

Fixed a bug with directly-applied variable-arity lambda. The original code marked all directly-applied variable arity lambda's as producing bottom, because it was checking for the interface to be equal to the number of arguments. However, variable arity functions are represented with a negative number. For instance, the original code would transform the expression:

(begin
  ((lambda (a . b) (set! t (cons* b a t))) 'a 'b 'c)
  t)

to

((lambda (a . b) (set! t (cons* b a t))) 'a 'b 'c)

anticipating that the call would raise an error, however, it is a perfectly valid (if some what unusual) expression. I ended up adding tests that run cptypes without cp0 to get this type of expression to cptypes, so we can test it. The code also chooses the correct clause from a case-lambda with multiple clauses.

  cptypes.ss

Fixed make-time, time-second-set!, and time-second to indicate that second can be an exact-integer, since it is not really restricted to the fixnum range (and in fact we even test this fact in the mats on 32-bit machines).

  primdata.ss

Changed check of prelex-was-assigned (which is not reliably set on the input to any given pass) with prelex-assigned, which should always have an accurate, if conservative, value in it.

  cptypes.ss

Added enable-type-recovery parameter to allow the type recover to be turned on and off (defaults to on), and added cptype to the cp0 not run path that runs cpletrec, so that cptypes can be run independent of cp0. This is helpful for testing and allows us to benefit from type recovery, even in cases where we do not want cp0 to do anything that would re-arrange the input code

  compile.ss, front.ss, primdata.ss

Stylistic changes, mostly for consistency with other parts of the compiler, though I'm not married to these changes if you'd really prefer to keep things the way the are.

  1. clauses of define-record type now use parenthesis instead of square brackets.
  2. indented by 2 spaces where things were only indented by one space
  3. define, let, define-pass, nanopass pass productions clauses, now use parenthesis for outer markers instead of square brackets.
  fxmap.ss, cptypes.ss

Sorry for the changes being written to a branch, there is probably a better way for us to collaborate on this code, but sadly I'm not as experienced with the GitHub tooling as I probably should be and didn't want to spend the whole evening searching for a better solution.

Thanks again for all your hard work on this, let me know what you think and how you'd like to proceed.

-andy:)

akeep avatar Apr 04 '18 01:04 akeep

One thing I almost forgot, I have not yet updated the documentation to add the enable-type-reccovery to CSUG, but we'll need to do that as well before we add pull in these changes.

akeep avatar Apr 04 '18 02:04 akeep

I have no problem with using an auxiliary branch to get the fixes. I was already doing something like that with @97jaz and @mflatt .

About the changes:

I have read the changes and they look good, but I need a few more days to read them more carefully. (Also, my notebook has a problem with the screen, and I'm using the old one for a few days ...)

I agree with the change of types -> types1 in the if clause of cptypes/raw. (It was a mistake. I was probably shadowing types and later decided to rename it as types1 to get more consistent names, and I forgot to update the name of the variable in these sites.)

About the ideas for changes:

  • change ($empty? x) -> (eq? x <singleton>)

OK. I'm only worry about the possibility in the future of reusing fxmap.ss in another pass (like cp0) and getting a different "singleton". This would be a problem is the fxmaps are somehow stored and shared. This problem is too hypothetical, so it can be solved with a comment in the file with a warning for the future.

Another small improvement would be writing a version of fxmap-merge that doesn't try to build a result fxmap. I'm using it in fxmap-for-each/diff, trying to minimize the result and then discarding it at the end.

  • making $record/rtd and $record/ref lists into records

OK.

  • remove temporal ir in the call clause of cptypes/raw.

OK. I think I wrote that for laziness, to be able to return the default ir when necessary without thinking. (Sometimes it's nice because it's clear that the reduction affects only the types and not the main expression.) I used it only 4 times, so it's easy to fix it with cut&paste.

  • use define-inline like cp0

I like the idea, but I'd prefer to wait. I like that define-inline makes the code much easier to parse because it's clear where each primitive is handled. But I still not sure about the details of the interface. In particular, the argument types can be a) the original fxmap that is the argument of cptypes/raw b) the result after adding the types recovered in the arguments c) the result of adding also the types that are checked in the primitive definition/signature Currently, the effective value is c) because it makes the implementation easier, but in an isolated reduction b) is more natural, but it may be more error prone than a). The value of the argument is important because all the recovered types must be added first to the "main" fxmap in the result, and later add the additional types to the results that will be used in the "then" and "else" branches. If you add them in other order, you must use some of the unusual functions of the fxmaps. [Mmm. This description is probably not very clear, but I hope it's clear that I don't have the details clear enough yet. So, I'd prefer to think about this and discuss the implementation details later before writing them.]

About the quadratic time:

I'm sure I fixed all the corner cases (but I have been wrong in the past ...). The first important point is that the numeration of the prelex is consecutive or quasi-consecutive so the fxmaps are almost balanced, or to be more precise the height is bounded by log(N) +k where N is the total number of variables in the program and k is a small number, I guess 2 or 3. So there are not degenerated cases where a big fxmap is actually a long list. Also, the fxmap with the types in the results are constructed in such a way that in an expression like (+ e1 e2) the types1 of e1 and the types2 of e2 share as much as posible the structure with the original types. So the differences between typesX and types can be iterated in less than ((changes typesX) - (changes types)) * log(N) time. To ensure this property it is necessary be careful and in particular to use fxmap-remove/base to remove the temporal variables and rebuild the branches where they were stored. So it is possible to pick from types1 and types2 the one with fewer changes and copy only those changes to the other one. In particular if one of them has a few changes from the original types and the other has a lot of changes from the original types, then the merge operation must copy only a few changes and is fast. Selecting the one with fewer changes is the main idea to ensure that the global runtime is O(N log^2(N)). (Actually, merging all the fxmaps in a if clause is trickier if you want to avoid quadratic time, so I have to use some weird functions with the fxmaps.)

gus-massa avatar Apr 06 '18 03:04 gus-massa

There are three lines in cptypes.ss that have a #;CHECK reminder/warning. These are parts that I think that are correct, but I'm not sure. (For example, the clause for immutable-list in cptypes/raw.) I think that you already looked at the code and thought it was fine, but can you take another look at them so I remove the reminder/warning?

gus-massa avatar Apr 06 '18 21:04 gus-massa

About the ideas for changes:

  • change ($empty? x) -> (eq? x )

OK. I'm only worry about the possibility in the future of reusing fxmap.ss in another pass (like cp0) and getting a different "singleton". This would be a problem is the fxmaps are somehow stored and shared. This problem is too hypothetical, so it can be solved with a comment in the file with a warning for the future.

Actually, you are already treating this as a singleton in places, where you return empty-fxmap in fxmap-remoe and fxmap-remove/base and when you initialize pred-env-empty, it is treated as potentially different only in fxmap-for-each/diff

There are three lines in cptypes.ss that have a #;CHECK reminder/warning. These are parts that I think that are correct, but I'm not sure. (For example, the clause for immutable-list in cptypes/raw.) I think that you already looked at the code and thought it was fine, but can you take another look at them so I remove the reminder/warning?

The immutable-list and cte-optimization-lock look good to me. immutable-list is a special case that allows for cp0 to look inside a list it knows was constructed during expansion of some internal syntax when we know there is no way it could be modified, so we can treat it as constant without doing any escape analysis on it.

The predicate-implies? check is a good question. Obviously, eq? is the safest check here, though I think eqv? is also safe enough. In theory, I believe that these are quoted constants which should not mutated, though Chez Scheme does not prevent this mutation. That said, I think I'd rather stick with eqv? right now, unless we find some places where we think this is being overly conservative. If we do loosen it up, we may also want to consider checking for record equality, which r6rs equal? does not do.

akeep avatar Apr 07 '18 20:04 akeep

The predicate-implies? check is a good question. ...

I think it's better to use eqv? because predicate-implies? is used in predicate-implies-not?. So we get these reductions

(eq? 3.3 3.3) => #t      [unespecified]
(eq? 3.3 7.7) => #f
(eqv? 3.3 3.3) => #t
(eqv? 3.3 7.7) => #f

The first one is officialy unespecified, but using eqv? in predicate-implies? makes cptypes reduce it to #t allways. Using eq? would make the it sometimes be reduced to #f, because the result would depend on some of the internal details, for example if both 3.3 are constant propagated from the same origin, or if they are calculated independently, or if there was a round trip to fasl.

Something similar applies to the third one, but the result of the third one must be #t , so comparing internaly with eq? would be a problem.


I also found another small optimization of the program in cptypes. I removed the wrapper around cptypes/raw and now the recursion calls cptypes directly. The wrapper was there to replace t-types and f-types with the value of types when one of them was #f (that means the there are no additional predicates for the #t or #f branches). I now take care of that case inside the clause of if when it is necesary, instead of in every recursive call. So it must save a few milliseconds here and there. (I also had to make a trivial tiny modification in the clause of let/let*/letrec, but I have to debug the code for a while until I notice that I must fix it too.) preview

gus-massa avatar Apr 08 '18 15:04 gus-massa

Sounds good. If the recursion is now within cptypes and the body expression is just a call to Expr, then all of the cata-morphisms can have the cptypes : removed. This will cause Expr to recur to Expr directly for the cata-morphisms. It probably won't have a measurable impact on performance, but it does simplify the code a bit.

I agree on the eqv?, I think getting the numeric equivalence is worthwhile.

akeep avatar Apr 08 '18 15:04 akeep

I rebased the PR, and in the last commit I added all the proposed changes exept define-inline.

I removed the auxiliary commits, so this version doesn't have the correct bootfiles and it gets an error in travis. I have the outdated bootfiles and the trick to make travis recompile them in another branch gus-massa/17-5-Types-Pass-Extended (tests)

The main missing parts are the updates to the LOG and CSUG.

gus-massa avatar Apr 13 '18 03:04 gus-massa

Cool. I'll can rebuild the boot files after merging this in to a local copy of master.

We should get the LOG and CSUG changes pulled in too if we can. I can take a look at this over the weekend, and hopefully get things put together.

Just so I understand, it sounds like the things you checked in on this pull request are up to date, with the exception of the LOG and CSUG changes. As far as the travis changes, I've actually re-done how travis checks things on master and it should be doing the checking correctly now, though it will still run afoul of the changes that invalid the boot files.

akeep avatar Apr 13 '18 03:04 akeep

It took me a bit more effort than I expected to get the boot files built, but I managed to get these built and tested.

Unfortunately, in testing I ran into what looks like it might be an unexpected error. I was testing on a6osx, i3osx, ti3osx, and ta6osx, and ran into a strange bug on one of the runs of ta6osx.

I'm getting the error: pred-env-union/base: when running one of the threaded tests. My suspicion is that something is getting fouled up with all the threads running (there is a call to eval in the thread test, which is invoking the cptypes pass:

219     (define (pred-env-union/from from base types new-base)
220       ; Calculate the union of types and from, and intersect it with new-base
221       ; Iterate over the difference of from and base.
222       (let ([ret new-base])
223         (fxmap-for-each/diff (lambda (key x y)
224                                (let ([z (fxmap-ref types key #f)])
225                                  ;x-> from
226                                  ;y-> base
227                                  ;z-> types
228                                  (set! ret (pred-env-add/key ret key (pred-union x z)))))
229                              (lambda (key x)
230                                (let ([z (fxmap-ref types key #f)])
231                                  ;x-> from
232                                  ;z-> types
233                                  (set! ret (pred-env-add/key ret key (pred-union x z)))))
234                              (lambda (key x) (error 'pred-env-union/base "") (void))
235                              from
236                               base)
237           ret))

It looks like line 234, in this test is getting something unexpected, but I have not yet figured out why, nor do I have a good way to reproduce this quite yet.

The fact that this only showed up in one machine type of the two that run the thread tests and then only in one of the twelve runs of this tells me that something intermittent (and well buried) is going on in this code, so it might take some effort to ferret it out.

Just FYI, we should update this error message to use $oops (or $impoops if you think this should be impossible) with a who argument like compiler-internal and a little bit more meaningful error message to give us some kind of indication as to what might have failed. We try to stick to using $oops or $impoops inside the compiler, though I noticed a few error or error calls have leaked in from place to place.

The actual error I saw was in the following summary from a ta6osx run, the three errors are all in a row and the second one is almost certainly the result of the first, though the third one may or may not be related:

% cat ../mats/summary
-------- o=0 --------
-------- o=3 --------
-------- o=0 cp0=t cl=3 --------
-------- o=3 cp0=t cl=3 --------
-------- o=0 spi=t p=t --------
-------- o=3 spi=t p=t --------
-------- o=0 eval=interpret cl=6 --------
-------- o=3 eval=interpret cl=6 --------
-------- o=0 cp0=t eval=interpret --------
-------- o=3 cp0=t eval=interpret --------
0a1,3
> thread.mo:Error in mat thread clause 57: "pred-env-union/base: " at line 982, char 3 of thread.ms
> thread.mo:Error in mat thread clause 58: "extra threads running (#<thread 913> #<thread 912> #<thread 911> #<thread 910> #<thread 909> #<thread 908> ...)" at line 1035, char 3 of thread.ms
> thread.mo:Bug in mat thread clause 59 at line 1036, char 3 of thread.ms
-------- o=0 ehc=t eoc=f cl=9 --------
-------- o=3 ehc=t eval=interpret --------

akeep avatar Apr 22 '18 02:04 akeep

Oh, almost forgot. I'll continue to poke at this a bit, but any insight you can provide would be helpful.

Thanks!

akeep avatar Apr 22 '18 02:04 akeep

[I still have problem with my main computer, so I have to guess a little ...]

One important part of (pred-env-union/from from base types new-base) is it must be possible to construct from starting with base and using a few add operations. Sometimes the process is more complicated, but the operations like remove/base ensure that there is an alternative path that uses only add operations.

The error you found in the test is triggered because base has a binding that is not in from. This is impossible if you use only add operations (add actually intersects the new type with the old value, it's not a simple overwrite, it never cleans a binding). I think the error procedure should be changed to $impoops because in a bug free universe it is impossible to reach this part of the program.

I still don't understand where the problem is, but something like the following example should cause a problem. Each variable has a friendly name and also a number that is the counter that is temporally assigned to the prelex-operand field. The intmap uses only the counters, not the name or the prelex itself. I assume that there is a rogue prelex, that is shared between two threads, and the different threads reuse the numbers.

(lambda (t1 x2 y3)
  (when (box? shared4)    ;  <-- the number 4 was assigned by another thread
    (if t1
        (let ([temp4 (read)])
          (display (list (unbox temp4) (unbox temp4)))
          )    ;  <-- here temp4 is removed, but this "removes" shared4 too
        (display (list (unbox x2) (unbox y3))
          )    ;  <-- enough recovered types to make this the long branch
        )    ;  <-- here the types are merged, the missing shared4 is a problem
    ))

The error will be raised only if the numbers in each tread are luckily aligned. In other cases there may be silent errors. This code is not similar to the code in thread.ms#L979 so I can't guarantee this is the problem. My guess is that the imported identifiers in the examples are the shared prelex. This is probably not a problem in cp0 because they don't get any value overwritten in the prelex-operand field.

gus-massa avatar Apr 23 '18 02:04 gus-massa

I believe the only place more than one compiler session should have its hands on the same prelex is as part of the $cte-optimization-info related to a symbol or as part of the box in the cte-optimization-loc. cp0, depends on this being true for more than just the prelex-operand info, it is also modifying the prelex-name, which is uses to replace incoming prelex records with new prelex record. cpnanopass also relies on this when it replaces the prelex with uvar records. So, if we can have multiple compiler sessions running simultaneously with their hands on the same prelex I would expect these places to also be a problem, though the nature of race conditions is that they might not show up very often, so it is possible we've just not had quite the right conditions to expose this.

All of that said, we should be able to look into this and check for it by putting in some auxiliary information in the cptypes pass to identify a prelex we are seeing for the first time that already has its prelex-operand set. We've certainly had cases where we were allowing a single prelex record to be used at more then one binding form within the system, which led to some problems.

One other thing worth noting on this is that prelex records are created as part of the expander process, and each compiler session runs its own expander, so it is creating new prelex records for the code it produces. In order for multiple sessions to have a handle on the same prelex it would need to be the case that output from a single expander session got sent to multiple compiler instances. The only way I know of that this can happen is from a code fragment being inlined from the cte-optimization-loc box (aka the cross-library optimization information) from another library, which I don't think we should be seeing here, but that is definitely worth looking in to.

It might be a few of days before I get a chance to spend much time on this again, but hopefully we can get a reproducible, which will make me more confident in our fix :)

One question I did have in the fxmap code is why you decided to have the merge function run by side-effect instead of constructing the output functionally? Was this just to try to optimize around the O(m + n) worst case of the merge function?

akeep avatar Apr 23 '18 17:04 akeep

I made a small debugging change to prelex-counter function to remember the prelex that it has modified adding the counter, and so detect any prelex numbered by another thread and raise an error. It is in my branch 17-5-Types-Pass-Error, with two auxiliary commits to make travis-ci happy.

With the two auxiliary commits all the tests pass, but with the main commit I get an error exactly in all the threaded versions. (This change is more sensitive than the test error you found, because it will catch any bad prelex, instead of catching only the cases where the numbering and recovered types are correctly aligned.) (I couldn't find a way to see the error in travis-ci, grep is eating the output.)

I hope it's helpful.


About the side effects instead of the functional approach in the merge: To avoid the O(n+m) worst case I need that:

  • the counters of the prelex are almost continuous (or at lest that they don't increase exponentially) so the big trees look like trees instead of linked lists
  • the branches that look equal? are actualy eq?, so if one tree is constructed from nother tree and they are almost equal then the common parts can be detected with eq? and skipped in block, without looking at each leaf.

So I need some merge operations that use tree (or four?) intmaps. There was a functional merge written by @97jaz , but it was not enough because it uses two trees and the result is a combination of the two trees. But I sometime must paste the results into a third tree to ensure that the common part between the result and the third tree is eq? and not just equal?.

Once I thought about adding a functional version that calls the configurable function with an additional argument that acts like an accumulator, but I didn't understand the code well enough. (Now I think I can, but the code is still tricky.) (In case this is not clear, in the equivalent code with lists, it's like replacing a combination of a for-each and a set! by a foldr, the lambda needs an additional argument.)

gus-massa avatar Apr 25 '18 02:04 gus-massa

Sorry that I've been a little slow getting back to this, work was a bit hectic and it has taken me a little bit of time to get back to this.

Just to make sure I understand, the differences currently in this PR are with your changes that seem to run without error across the board, but if I take a look at your branch: 17-5-Types-Pass-Error it will fail reliably in the threaded tests.

Is your suspicion that the changes in the current PR still have the bug, but have chased it into hiding?

I've made some changes in a local copy of this to match recent commits, merged master into this set of changes, and I'm currently running all the tests in a6osx, i3osx, ta6osx, and ti3osx on my mac.

So far this seems to be running along fine, but if we think there are still bugs hiding at the intersection of the prelex-counter code changes and the threaded versions, I can try to dig into that a bit (though I'm not sure if I'll have a chance to finish it up this week, and I'm traveling over the long weekend).

Thanks for continuing to push this along, I feel like we're really close to getting this in.

akeep avatar May 23 '18 03:05 akeep