MCMCChains.jl icon indicating copy to clipboard operation
MCMCChains.jl copied to clipboard

[RFC] MCMCChains Goals

Open cpfiffer opened this issue 3 years ago • 11 comments

I wanted to collate some of the longer-term goals for MCMCChains. I think many of us would agree that MCMCChains is quite stable at this point. It generally works well, has an acceptably defined set of indexing and analysis tools, and has lots of interactions with the rest of the ecosystem (not just Turing anymore!).

However, I do think it's worth putting down in words the various issues we still have with MCMCChains. This issue will be the hub for general improvements to the package, big or small, that we hope we can get some other people to help out on.

Comments are appreciated.

Improvements:

Performance, particularly for display calls.

The first time the user calls display (explicitly or through REPL usage), it takes my machine about 6.5 seconds just to handle to display call. If you're a REPL user this is absolutely brutal. Granted, display(chain) does a ton of numerical work and thus lights up a lot of method compilation, but I wonder if there's a way to make the user experience here slightly better if possible.

Adding tools to reshape parameters into their original forms

We have a recent issue on this (https://github.com/TuringLang/MCMCChains.jl/issues/231) but it has appeared consistently for as long as I have been working on MCMCChains (nearly 2 years now). For example, if you have two parameters x[1]=4 and x[2]=2, it's pretty easy for us to say that these could be repacked into their semantic form x = [4, 2] and just given to the user that way. As of right now, it's not super easy to do this kind of thing -- you typically have to roll your own code that matches your model setup. get kind of solves this for you, but it could use a lot more work to create the (x=[4,2],) named tuple I would like to see.

Visual improvements to plotting and generally better heuristics

The plotting functionality of MCMCChains is pretty good, but it suffers from a lack of polish and doesn't have as many intuitive safeguards. For example, if you plot a chain with many parameters, by default all parameters will be thrown up with their own tiny density and trace plots that are completely unreadable. Obviously you can fix this quickly by just subsetting the chain (plot(chain["param1"])), but there should be a more palatable plot if the number of parameters is obscenely large. I think rstan has a good approach to this where it throws up just a mean and 95% posterior density for each parameter -- it's very compact.

On another note, we don't seem to specify plot spacing, so the text on plots can get smushed into other subplots. I'm not sure if this is a Plots.jl issue or our issue.

Heterogeneous variable types

My least favorite feature of MCMCChains behind the scenes is that it's all just a giant array when you get down to it. Arrays are great at lots of things, but not what we're using it for -- parameters might be Int or Float64 or Missing, and so including them all in one big array either type casts everything to Float64 or Union{Float64, Missing} (thus eliminating the value of Int parameters) or sets the type to Any which is garbage from the compiler's perspective.

I'm not really sure what the fix is here, other than providing either an alternative backend (with StructArrays or something else) or providing an entirely different AbstractChains object which would be in a separate package and evolve independently from MCMCChains.

I'm also not sure how valuable it is to fix this, considering the fact that arrays are actually really good for general purpose applications, and don't need too much extra work.

cpfiffer avatar Oct 20 '20 01:10 cpfiffer

I agree, these are points we could/should probably work on.

Regarding display: A simple alternative to the current approach would be to not perform any statistical evaluations when displaying a Chains object but only information such as number of samples, chains, variable names, etc. (basically the top part of what is displayed currently). It always felt a bit weird to me that we mix statistical evaluations and printing to the REPL. Instead one could point users to the describe function for a summary.

Regarding plot: IIRC I had the impression that the code could be updated and simplified a bit - I am still a bit confused by the different recipe types but it seemed like the implementation didn't completely adhere to the convention and ideas in Plots. I tried some things in a branch somewhere, I'll have to check if I can find it again :smile:

devmotion avatar Oct 20 '20 08:10 devmotion

Do you have thoughts on providing an alternative backend for nonlinear parameters? I would prefer to move that kind of thing over to a new package entirely rather than have it inside MCMCChains, mostly because MCMCChains is too stable to make huge shifts in the representation backend. Unless of course we could just have it be NamedTuples behind the curtain and provide the exact same array output functionality we have now.

cpfiffer avatar Oct 20 '20 19:10 cpfiffer

@devmotion I wrote a lot of the recipe code and I know Plots quite well, but it's true that it uses some unideomatic tricks at times. Happy to discuss this and help finding improvements.

mkborregaard avatar Oct 20 '20 21:10 mkborregaard

We may end up having someone work on some of these issues, so I wanted to type up the two projects I think would be the most beneficial in terms of scope and difficulty.

Project 1: Agnostic Backends

MCMCChains currently enforces an AxisArray backend, which is bad for all kinds of reasons:

  • It makes hierarchical and non-linear parameter spaces difficult to work with (try reconstructing alpha[1][5] into it's original shape -- it blows).
  • It doesn't handle mixed types like Int and Float well. This is an even bigger problem if you have non-numeric data like String!
  • It hamstrings us into working with a single data structure, AxisArrays, which is bad for hackability and future growth.

For this project, I would want someone to overhaul the MCMCChains ecosystem to

  1. Add a set of API functions that MCMCChains would use internally to find the locations of individual parameters by iteration and by chain. getindex, etc.
  2. Remove the AxisArray backend by default -- you should be able to give it a Vector{NamedTuple}, a StructArray, an Array, or any other reasonable data type that implements some common API functions build in step 1.
  3. Add implementations of the API functions for a couple of common backend types: a. AxisArray b. StructArray c. Array{T, 3} d. Array{NamedTuple, 3}

This is a pretty complex project that would touch a lot of the codebase, but I think it's worth the investment. I hate the fact that it's difficult to work with vectors of parameters exported by Turing, and allowing any kind of backend would be enormously cool.

Project 2: Everything else

The second project covers lots of ground, and is mostly composed of lots of little things.

  1. Add convenient defaults when plotting lots of parameters, such as RStan's default plot.
  2. Fix plot spacing for plot. When plotting multiple parameters, the plots tend to get smushed together and overlap one another.
  3. Write a guide on how to use MCMCChains and all it's random bits. Currently we only have a README which is insufficient.

Anything else I should stick on here?

cpfiffer avatar Nov 24 '20 19:11 cpfiffer

I was thinking about the back end and wonder: what is currently included in the info and logevidence fields of Chains? Given my limited understanding of MCMCChains, I think that a DataFrame could be a nice data structure. It would allow for removal of about 20% of the code and makes the API super easy for users (but again, I have limited understanding; therefore, I might be missing something).

rikhuijzer avatar Mar 13 '21 17:03 rikhuijzer

Nah, we actually did a ton of work to remove DataFrames from MCMCChains internals to cut down on load times. The idea is generally that you can put whatever you want in there as long as it's wrapped in a NamedTuple. Turing for example will dump the model, the sampler, and some other stuff in there.

cpfiffer avatar Mar 13 '21 18:03 cpfiffer

I'm not really sure what the fix is here, other than providing either an alternative backend (with StructArrays or something else) or providing an entirely different AbstractChains object which would be in a separate package and evolve independently from MCMCChains.

Would like to say that I'd absolutely love to see a fully fleshed out AbstractChains.jl package that implements a common API across all the different Julia samplers. Trying to get ParetoSmooth to work with half a dozen different implementations has been annoying, to say the least, and if we start putting a lot more work into plotting we'll see that become a bigger issue too -- none of the work on MCMCChains' plots will be translatable to Soss or Gen or Stan.

ParadaCarleton avatar Jul 23 '21 00:07 ParadaCarleton

@cpfiffer

MCMCChains currently enforces an AxisArray backend, which is bad for all kinds of reasons: It makes hierarchical and non-linear parameter spaces difficult to work with (try reconstructing alpha[1][5] into its original shape -- it blows). It doesn't handle mixed types like Int and Float well. This is an even bigger problem if you have non-numeric data like String! It hamstrings us into working with a single data structure, AxisArrays, which is bad for hackability and future growth.

Some of these look like they might be fixable with a smaller set of changes than would be required by a fully generic interface, even if the fully-generic interface would be better for future growth. Namely, I think you could swap out AxisArrays with a DimensionalData.jl wrapper around a StructArray or a TupleVector, since DimensionalData is just a wrapper around any other kind of array. You might even be able to get away with requiring only the DimensionalData wrapper, and letting the internal type be generic, since you can index a DimensionalData wrapper the same way you can index an AxisArray.

ParadaCarleton avatar Oct 16 '21 19:10 ParadaCarleton

I thought AxisKeys or NamedDims are the most common replacements for and improvements over AxisArrays these days but I didn't keep track of recent developments. There are many different approaches and packages in this area.

devmotion avatar Oct 16 '21 19:10 devmotion

Namely, I think you could swap out AxisArrays with a DimensionalData.jl wrapper around a StructArray or a TupleVector, since DimensionalData is just a wrapper around any other kind of array. You might even be able to get away with requiring only the DimensionalData wrapper, and letting the internal type be generic, since you can index a DimensionalData wrapper the same way you can index an AxisArray.

I actually quite like this -- as long as we have some common interface to some indexable data, I am happy.

cpfiffer avatar Oct 16 '21 21:10 cpfiffer

I thought AxisKeys or NamedDims are the most common replacements for and improvements over AxisArrays these days but I didn't keep track of recent developments. There are many different approaches and packages in this area.

I believe AxisKeys (Which has NamedDims as a dependency) and DimensionalData are roughly equally common, but DimensionalData.jl has ArrayInterface support, avoids the method ambiguities introduced by indexing with round parentheses, and loads much faster (in version 1.7; in 1.6 ArrayInterface.jl takes around 1.5s to load, so both are about as fast).

ParadaCarleton avatar Oct 17 '21 14:10 ParadaCarleton