stanc3 AST refactor & handing user defined vs built-in functions / distributions

I've just started looking at the AST refactor and realized that the current definitions of expression patterns isn't going to play nicely with the Fixed.Make functor because we have a type parameter for function kinds:

type ('e, 'f) expression =
  | ...
  | FunApp of 'f * identifier * 'e list
  | CondDistApp of 'f * identifier * 'e list
  | ...

My suggestion for this would be:

have separate union cases for user-defined and built-in functions
provide a dictionary of built-in function names to the parser so we can syntactically decide if something is a built in function.

This would give us:

type 'e expression = 
   | ...
   | FunApp of identifier * 'e list 
   | UDFunApp of identifier * 'e list
   | ...

Since we would be making a change to expressions anyway, this would also be an opportunity to pull out distributions and make it a separate syntactic object to expressions. Again, we need to provide a dictionary of built-in distributions to the parser. Would have the following for distributions:

type 'e dist = 
  | UDDist of identifier * 'e list 
  | Dist of identifier * 'e list

and replace CondDistApp with:

type 'e expression = 
  | ... 
  | CondDistApp of 'e * 'e dist
  | ...

This design is in-line with SlicStan's syntax and would pave the way for further work on redesigning how we handle groups of functions related to the same distribution and bijectors.

@VMatthijs @seantalts @rybern what are your thoughts? This is blocker for the AST refactor so would be great to agree a way forward.

Oct 25 '19 11:10 enetsee

I think it would be slightly more complicated than specific names - iirc users can override nullary function names, and possibly there are other weird things too...

I like the separation of the AST as a purely syntactic construct. What do you think about moving the AST to be more purely syntactic and either sorting this out in AST to Mir or adding another piece of metadata to everything in the decorated AST to hold function kind (and for most expressions it would be empty).

Oct 25 '19 13:10 seantalts

I think it would be slightly more complicated than specific names - iirc users can override nullary function names, and possibly there are other weird things too...

Ah, I see. Out of interest, do you know if that is in the language spec or a 'bug' that people have made use of? I had a quick scan through but couldn't find anything on overloading other than the following:

Two user-defined functions may not have the same name even if they have different sequences of argument types.

It seems odd that users are allowed to shadow built-in functions but not their own!

What do you think about moving the AST to be more purely syntactic and either sorting this out in AST to Mir or adding another piece of metadata to everything in the decorated AST to hold function kind

The thing I had in the back of my head is that we may want to perform type checking of built-in functions (and distributions) differently from user defined functions when considering vectorization. We could handle this directly by distinguishing between user-functions and built-ins in the type definitions and only consider the vectorization relation for builtins. This way we wouldn't even need to attach any additional meta-data.

To type check a functions application (or distribution) we would maintain two type environments for builtins and user defined functions, then check against the user-defined environment first.

Does this make sense?

Oct 25 '19 15:10 enetsee

I'm not sure which things were bugs and which were features :P Would guess in Stan 3 we'd try to move towards uniform shadowing rules.

The thing I had in the back of my head is that we may want to perform type checking of built-in functions (and distributions) differently from user defined functions when considering vectorization.

Not sure I get what you're describing here. This is a hypothetical future, right? Right now we just check against the signatures dictionary and Semantic_check has no notion of vectorization, right?

Are you saying that the only reason the type checker needs to know about user vs. stan math functions is for future vectorization checks? We also need that information in code generation btw, but so it's fine to just put it on the MIR.

I would run these ideas past Matthijs - I don't know the semantic check code well enough to know if any of these schemes would work in all cases.

Oct 25 '19 15:10 seantalts

This is a hypothetical future, right?

Yes - if we are making changes it would be great to have a design that would work with some of the features we've talked about like vectorization, effects etc. and move there incrementally.

Are you saying that the only reason the type checker needs to know about user vs. stan math functions is for future vectorization checks? We also need that information in code generation btw, but so it's fine to just put it on the MIR.

No, the reason that semantic check needs to know user-defined vs builtin is because it is looking up types in different places. User functions types are retrieved from the symbol table where as the set of signatures for math functions are retrieved from Stan_math_signatures.

let semantic_check_fn ~is_cond_dist ~loc id es =
  match fn_kind_from_application id es with
  | StanLib -> semantic_check_fn_stan_math ~is_cond_dist ~loc id es
  | UserDefined -> semantic_check_fn_normal ~is_cond_dist ~loc id es

let semantic_check_fn_normal ~is_cond_dist ~loc id es =
  Validate.(
    match Symbol_table.look vm id.name with ...

let semantic_check_fn_stan_math ~is_cond_dist ~loc id es =
  match
    Stan_math_signatures.stan_math_returntype id.name (get_arg_types es)
  with ...

What I'm suggesting is to unify the representation of built in and user-defined functions then check with the user environment first then the built in environment to deal with overloading.

The vectorization part isn't necessary for this to work but it would mean you didn't need to list all possible signatures for each built-in function and instead rely on a functions to determine when the vectorized version was well typed, a la SlicStan.

Since we would still need to represent the difference between built-in and user defined functions (e.g for code gen) we would have

type t =
  | UInt
  | UReal
  | UVector
  | URowVector
  | UMatrix
  | UArray of t
  | UUserFun of (autodifftype * t) list * returntype
  | UFun of (autodifftype * t) list * returntype

This encodes the same information as having an additional 'fun_kind' field in typed meta-data but doesn't have the redundancy for non function types.

I would run these ideas past Matthijs - I don't know the semantic check code well enough to know if any of these schemes would work in all cases.

Definitely - I mentioned to Matthijs that it would be good to spec the type system out before we started implementing it!

Oct 25 '19 15:10 enetsee

Yes - if we are making changes it would be great to have a design that would work with some of the features we've talked about like vectorization, effects etc. and move there incrementally.

Sounds good - just double checking some of my assumptions about the context.

No, the reason that semantic check needs to know user-defined vs builtin is because it is looking up types in different places. User functions types are retrieved from the symbol table where as the set of signatures for math functions are retrieved from Stan_math_signatures.

Ahhh.

What I'm suggesting is to unify the representation of built in and user-defined functions then check with the user environment first then the built in environment to deal with overloading.

Why not unify the two type environments as well? Then the semantic checker really doesn't care if something is UDF or Stan Math, right? Even for vectorization, we'll have some functions that support it and some that don't, so we can represent that however (plus in the future we'd like to allow folks to use vectorized forms of functions that don't yet have efficient vectorized implementations by just doing the silly thing and generating for loops).

The vectorization part isn't necessary for this to work but it would mean you didn't need to list all possible signatures for each built-in function and instead rely on a functions to determine when the vectorized version was well typed, a la SlicStan.

Just to be clear I definitely want to move in this direction - see above and also this is very useful for code generation and optimization.

This encodes the same information as having an additional 'fun_kind' field in typed meta-data but doesn't have the redundancy for non function types.

Now that I understand more about what you're proposing I'm still feeling like we should just have the AST be about syntax and move that kind of representation to the MIR.

Oct 25 '19 15:10 seantalts

Now that I understand more about what you're proposing I'm still feeling like we should just have the AST be about syntax and move that kind of representation to the MIR.

Precisely this! We would have only one representation of functions without a function kind:

Expressions:

   module Pattern = struct
      type 'e t =
          | TernaryIf of 'e * 'e * 'e
          | BinOp of 'e * Middle.Operator.t * 'e
          | PrefixOp of Middle.Operator.t * 'e
          | PostfixOp of 'e * Middle.Operator.t
          | Variable of Identifier.t
          | IntNumeral of string
          | RealNumeral of string
          | FunApp of  Identifier.t * 'e list
          | ....

The untyped expressions are now straightforward.

Your suggestion was then to have an additional field in Expr.Typed.Meta; since this is only a property of functions it would either be optional or just ignored for non-function expressions:

module Meta = struct
  { loc: Location_span.t sexp_opaque [@compare.ignore]
  ; ad_level: UnsizedType.autodifftype
  ; type_: UnsizedType.t 
  ; kind : Fun_kind.t option
  }
end

I'm agreeing with you except that if instead you just add another union case to UnsizedType.t (as in the definition given in my previous comment) you no longer need the kind field and you don't have the redundancy for non-function expressions.

By distinguishing between built-in and user-defined function by their types we also leave the door open for all the other type checking stuff we are thinking about and can perform the translation to MIR or at code-gen time.

Oct 25 '19 16:10 enetsee

Hm, that makes sense. Seems like it doesn't make the tree incorrect at any point.

Oct 26 '19 16:10 seantalts

Can you give me access to the design-docs repo so I can create a branch for the type system work? I think it would be best to set out the big picture there then decide on the steps .

Oct 28 '19 13:10 enetsee

I want to keep to a model there where folks are operating from their own forks - could you use that for now? Thank you. I’m used to working that way in companies and it makes all of the permissions upkeep work better and has better branch hygiene.

On Mon, Oct 28, 2019 at 09:04 Michael Thomas [email protected] wrote:

Can you give me access to the design-docs repo so I can create a branch for the type system work? I think it would be best to set out the big picture there then decide on the steps .

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stanc3/issues/361?email_source=notifications&email_token=AAGET3HUJOPHO4MREUZOMLTQQ3PL5A5CNFSM4JFCBVAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECMZICQ#issuecomment-546935818, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGET3D62DURPGXB3CDWFETQQ3PL5ANCNFSM4JFCBVAA .

Oct 28 '19 13:10 seantalts

stanc3 stanc3 copied to clipboard

AST refactor & handing user defined vs built-in functions / distributions

stanc3
stanc3 copied to clipboard