arviz Some changes to the InferenceData schema

Tell us about it

With https://github.com/arviz-devs/ArviZ.jl/pull/191, ArviZ.jl has an independent implementation of the InferenceData schema, which has previously only been implemented using xarray in Python for the storage. This motivates a few minor additions and changes to the schema:

Additions

New default attribute: arviz_language: currently either "python" or "julia". This is necessary for interpreting the "arviz_version", since the Python and Julia versions do not correspond.
Require that the arviz_version attribute is always included. Unless we intend for packages besides ArviZ to define their own implementations of the schema. In that case, there are other attributes we should add (e.g. created_by).

Changes

Remove the constraint that inference_library and inference_library_version are included. These maybe don't always make sense to include, e.g. when samples were generated exactly with scipy or maybe when multiple inference libraries were used together to generate samples. In Julia, the object storing MCMC chains is sometimes implemented in a different package from the PPL, and the actual inference algorithm might live in its own package (e.g. MCMCChains implements chains, which is used in both Turing and CmdStan/Stan), so the storage object itself is often not sufficient to infer the inference library, and it's also not clear which package the inference library should even be.
Remove the constraint that [chain, dim] be the first dimensions. In general, I think we've agreed this should not be assumed by ArviZ's functions, and plots/diagnostics should consult the axes.

It might make sense to define the schema separately from the implementation. e.g. ArviZ.jl and Python ArviZ might both implement the same schema but with different conventions. e.g. ArviZ.jl and Python ArviZ could adopt different conventions RE where [chain, dim] are placed in the dimensions list. Or maybe ArviZ.jl would use an API where the attrs keyword is replaced by metadata, though they are both serialized to NetCDF attributes. So the schema could be defined at arviz.org, and then each implementation could define any additional conventions their implementation uses.

Jul 03 '22 17:07 sethaxen

Everything sounds great. I completely agree with the non-assumption of chain, draw as first dims. I think I already said somewhere that the code doesn't really have this as an assumption. This is only used in from_dict and other converters taking numpy arrays to know how to label but once we have xarray objects, only the dimension name matters and this is already the case for plots and stats functions.

Or maybe ArviZ.jl would use an API where the attrs keyword is replaced by metadata, though they are both serialized to NetCDF attributes.

This also sounds good. I'll check that the attrs part is not part of the schema. In the python case this is hardcoded by xarray but I agree the important/necessary for compatibility is that these attributes (whatever their name in each language) serialize to netcdf/zarr attributes

Jul 06 '22 17:07 OriolAbril

Not sure if there is another issue on this or so far we have only discussed on slack, but I also think it would be good to move the schema description to arviz-project repo, and also add a jupyter notebook with a julia example in turing or soss whichever you prefer.

Jul 06 '22 17:07 OriolAbril

Require that the arviz_version attribute is always included. Unless we intend for packages besides ArviZ to define their own implementations of the schema. In that case, there are other attributes we should add (e.g. created_by).

A counterpoint to this is that in Julia land at least, we may want to define InferenceData independently of ArviZ. So then it would be better to not have arviz_version as part of the official spec but rather documented as something ArviZ does when it creates an InferenceData.

A few more suggestions/questions. The rules state

Coordinate values can be repeated and should not necessarily be numerical values.

We should document what (if any) restrictions should apply to coordinate values. For netCDF serialization, they need to be real numbers or strings, right?

log_likelihood

Pointwise log likelihood data. Samples should match with posterior ones and its variables should match observed_data variables. The observed_data counterpart variable may have a different name. Moreover, some cases such as a multivariate normal may require some dimensions or coordinates to be different.

predictions

Out of sample posterior predictive samples p(y’|y). Samples should match posterior samples. Its variables should have a counterpart in posterior_predictive. However, variables in predictions and their counterpart in posterior_predictive can have different coordinate values.

The last sentence of each if these sections could clarify what is expected here. e.g. suppose y in observed_data has dimensions (y_dim_0, y_dim_1) and y has a multivariate normal likelihood. In log_likelihood, it just has dimension (y_dim_0,). Now, do we require that y_dim_0 match in both groups? Or can y_dim_0 in log_likelihood correspond to y_dim_1 in observed_data?

Similarly, in what way might predictions have different coordinates? Do we mean it should have the same dimension but different coordinates? Or potentially a different dimension and coordinates?

Jul 30 '22 08:07 sethaxen

Not sure if there is another issue on this or so far we have only discussed on slack, but I also think it would be good to move the schema description to arviz-project repo, and also add a jupyter notebook with a julia example in turing or soss whichever you prefer.

Agreed!

Jul 30 '22 08:07 sethaxen

Also the warm-up groups are not listed in the schema so technically they violate the schema. Is this just an oversight?

Jul 30 '22 09:07 sethaxen

A counterpoint to this is that in Julia land at least, we may want to define InferenceData independently of ArviZ. So then it would be better to not have arviz_version as part of the official spec but rather documented as something ArviZ does when it creates an InferenceData.

This approach is probably better. PyMC already handles the conversion to InferenceData within its own codebase for example, so the arviz version while still added isn't very relevant.

We should document what (if any) restrictions should apply to coordinate values. For netCDF serialization, they need to be real numbers or strings, right?

I would not add this to the schema. We can add a note about being careful with the types used, but afaik, netcdf capabilities are still improving over time, and zarr is also a valid 100% compatible serialization format (which I suspect will become the default at some point) and supports more features than netcdf already.

The last sentence of each if these sections could clarify what is expected here. e.g. suppose y in observed_data has dimensions (y_dim_0, y_dim_1) and y has a multivariate normal likelihood. In log_likelihood, it just has dimension (y_dim_0,). Now, do we require that y_dim_0 match in both groups? Or can y_dim_0 in log_likelihood correspond to y_dim_1 in observed_data?

The dimensions in observed data will have a name, and one of these names should also be the one present in log_likelihood group. Which one of the two it is currently does not matter and I think should not.

Similarly, in what way might predictions have different coordinates? Do we mean it should have the same dimension but different coordinates? Or potentially a different dimension and coordinates?

This was intentionally vague because at the time no arviz function was using it, but the idea was to have plot_ts use it. I am still not sure as I didn't follow the progress on plot_ts closely and don't know what use it makes from the data available and the different groups.

Also the warm-up groups are not listed in the schema so technically they violate the schema. Is this just an oversight?

I agree we should document them, but having extra groups is not a violation of the schema. Similarly, having variables in sample_stats that are not any of the ones listed in the schema is also perfectly fine. What would be a violation would be to have the divergence info stored in a variable called diverging_sample instead of diverging

Jul 30 '22 13:07 OriolAbril

@OriolAbril the schema describes attributes as being an ordered dictionary. Should we relax the condition that it be ordered? Zarr.jl returns attributes as a plain unordered Dict. Should we also define any conditions for the keys? Not certain if Zarr or NetCDF can handle keys besides strings.

Aug 22 '22 10:08 sethaxen

I actually have no clue about why it says ordered dictionary and am quite sure it is not used anywhere. I might have copy pasted too aggressively. I can remove it in the PR once I get to it again (1-2 weeks time)

Not sure about conditions for keys 🤔

Aug 23 '22 22:08 OriolAbril