sbi zuko_maf versus maf

❓ Question

I have been using zuko_maf instead of maf as suggested. In identical setup, I have found that the posteriors of some parameters from using zuko_maf were significantly narrower than when using maf. I can compute my posteriors with traditional MCMC and only the maf ones were consistent with the MCMC. Has this problem been encountered elsewhere?

Thanks a lot ! ...

fic_0001_10.0s_src_grp_opt_aec64_zuko_maf_hf100_nt10_nc20_MRI_5_10000_round_5_posteriors.pdf

fic_0001_10.0s_src_grp_opt_aec64_maf_hf100_nt10_nc20_MRI_5_10000_round_5_posteriors.pdf

Jul 15 '25 06:07 dbxifu

Hi there!

Individual training runs can indeed differ between these two backends. However, my guess would be that this is largely due to the random seed. To verify this, you could build an ensemble of several training runs (with varying seeds) and check whether the results differ consistently.

I.e., train mulitple posteriors, and then:

from sbi.inference import EnsemblePosterior

posterior1 = inference1.build_posterior()
posterior2 = inference2.build_posterior()
posteriors = [posterior1, posterior2]

ensemble = EnsemblePosterior(posteriors)

Michael

Jul 15 '25 08:07 michaeldeistler

Dear Michael,

Thanks for your quick answer. I should have added that this is a very consistent finding across very different setups and models. It affects one or more parameters but I don’t have yet a clue on why it affects this parameter and not the other one.

Regards, Didier

Le 15 juil. 2025 à 10:24, Michael Deistler @.***> a écrit :

michaeldeistler left a comment (sbi-dev/sbi#1620) https://github.com/sbi-dev/sbi/issues/1620#issuecomment-3072638924 Hi there!

Individual training runs can indeed differ between these two backends. However, my guess would be that this is largely due to the random seed. To verify this, you could build an ensemble of several training runs (with varying seeds) and check whether the results differ consistently.

I.e., train mulitple posteriors, and then:

from sbi.inference import EnsemblePosterior

posterior1 = inference1.build_posterior() posterior2 = inference2.build_posterior() posteriors = [posterior1, posterior2]

ensemble = EnsemblePosterior(posteriors) Michael

— Reply to this email directly, view it on GitHub https://github.com/sbi-dev/sbi/issues/1620#issuecomment-3072638924, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALFF7NIFGPTWOO7C2DTKDML3IS3DLAVCNFSM6AAAAACBQ7G2NGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTANZSGYZTQOJSGQ. You are receiving this because you authored the thread.

Jul 15 '25 13:07 dbxifu

If it is a consistent finding, then I have to admit that I am not sure what the reason could be. Any ideas @janfb @manuelgloeckler @gmoss13? Is there any chance you could provide a minimal reproducible example? How many simulations are you using?

Michael

Jul 15 '25 13:07 michaeldeistler

Indeed. I typically run with 10000 samples per round. I have tried different hyper-parameter setups but the same is found. I have also increased the number of samples by up to a factor of 10, but the issue remains. Strange. Didier

Le 15 juil. 2025 à 15:56, Michael Deistler @.***> a écrit :

michaeldeistler left a comment (sbi-dev/sbi#1620) https://github.com/sbi-dev/sbi/issues/1620#issuecomment-3073706131 If it is a consistent finding, then I have to admit that I am not sure what the reason could be. Any ideas @janfb https://github.com/janfb @manuelgloeckler https://github.com/manuelgloeckler @gmoss13 https://github.com/gmoss13? Is there any chance you could provide a minimal reproducible example? How many simulations are you using?

Michael

— Reply to this email directly, view it on GitHub https://github.com/sbi-dev/sbi/issues/1620#issuecomment-3073706131, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALFF7NNU3SBXOQU52BJRYND3IUCARAVCNFSM6AAAAACBQ7G2NGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTANZTG4YDMMJTGE. You are receiving this because you authored the thread.

Jul 15 '25 17:07 dbxifu

Could you run expected coverage for both setups and see how they perform?

Jul 15 '25 17:07 michaeldeistler

Does this issue remain once you increase the number of simulations (i.e., give it more data). This can have various reasons, e.g. on how the flow is exactly implemented. There are differences between Zuko and NFlows, which are the two backends.

For example, in the "affine" transform, Zuko uses an exponential reparameterization, i.e., exp(a)*x + b, whereas nflows uses a softplus parameterization, softplus(a)*x + b. These minor differences can have a pretty big effect on what is in the end learned, given a specific initialization of the network on a finite training dataset.

Jul 17 '25 08:07 manuelgloeckler

Yes the issue remains if I increase the number of simulations to 50000 instead of 10000 (10000 is already on the high side for me). I have run SBC which shows indeed the issues with two model parameters (sigma and redshift). This is something I get consistently across different models and setups. The maf backend doesn't have similar issues.

fic_0001_10.0s_src_grp_opt_aec64_hd1221_zuko_maf_hf100_nt10_nc20_MRI_5_50000_round_5_sbc_hist.pdf

Jul 17 '25 16:07 dbxifu

Mhh, I would have expected that with the exponential parametrization it's much easier to get more extreme values. Although yeah this should have gotten better with more samples. I can't really see what can go wrong in sbi (we just wrap zuko). This could indicate some problem within zuko, or is just a very unlucky combination of your problem and the inductive biases in Zukos parametrization (resulting in a bad convergence rate). To debug this one would probably need to setup a simpler minimal reproducing example that is somewhat similar to your problem.

To make things more similar to nflows, you can try to turn on permutations here and if this doesn't help clone zuko and try to change exp to softplus.

Jul 18 '25 17:07 manuelgloeckler

Are you using the default zuko_maf and maf builders? I am asking because the problematic parameters seem to be the ones where the parameter range is narrowest - I am wondering if there is a problem with z-standardizing the parameters (which should be the same but is implemented through different functions for maf and zuko_maf). Second the points from above that a minimal reproducing example would help in debugging.

Jul 22 '25 09:07 gmoss13

My two cents:

Good point by @gmoss13 , I remember we had a z-scoring bug in the zuko flow builders, see https://github.com/sbi-dev/sbi/pull/1492. This is not released yet, so maybe a installation from most recent main fixes the issue?
One could quickly test this for other tasks as well using the pytest --bm mini-sbibm? see https://github.com/sbi-dev/sbi/pull/1335

Jul 24 '25 09:07 janfb

I have basically run all the tests I could. I always use the default zuko_maf and maf builders. Everything else being equal in my runs of multiple round inference, the best way for me to show that zuko_maf behaves very differently from maf is looking at the training history. Here you see the erratic behavior of zuko_maf. maf is more stable clearly and coupled with importance sampling afterwards yields much better results that zuko_maf. Thanks for your investigations !

SIXSA MAF MRI ROUND 5_training_history.pdf

SIXSA ZUKO_MAF MRI ROUND 5_training_history.pdf

The reason I believe this is noteworthy is that the documentation invites to consider zuko_maf here : https://sbi.readthedocs.io/en/latest/advanced_tutorials/03_density_estimators.html

Sep 16 '25 06:09 dbxifu

Hi all 👋 Zuko's code is well tested, so this is quite surprising. I would like to fix this issue.

I think @manuelgloeckler has a good point that Zuko's exponential parametrization could explain the difference. softplus prevents very "sharp" distributions, which could mean more "stable" training. There is actually a hyper-parameter in MonotonicAffineTransform to set the maximum/minimum slope (jacobian) of the transform, which could make it more stable. However, the z-scoring bug mentioned by @gmoss13 is likely to lead to training issues too.

I will do some tests on my side, but let me know if you find the issue!

Oct 01 '25 22:10 francois-rozet

I ran some experiments on simple datasets, and could not reproduce this issue. Zuko's and nflows's MAF (without using sbi) have very similar training dynamics.

It should be noted however that nflows uses residual hyper networks by default, which could be another reason for the behavior observed by @dbxifu.

>>> flow = zuko.flows.MAF(features=2, transforms=3, hidden_features=(64, 64))
>>> sum(p.numel() for p in flow.parameters())
13836
>>> flow = nflows.flows.MaskedAutoregressiveFlow(features=2, num_layers=3, hidden_features=64, num_blocks_per_layer=2)
>>> sum(p.numel() for p in flow.parameters())
51276

@dbxifu Could you ensure that sbi is up to date? As mentioned by @janfb, some bugs were fixed recently.

Oct 03 '25 18:10 francois-rozet