nnpdf
nnpdf copied to clipboard
Produce experimental covmat as a pandas dataframe
On doing #1528 we've discovered that things are not always loaded in the order one necessarily expects. In addition during Friday meeting we have been told that this has been already a problem in the past.
At the moment (in #1528) the theory covmat is loaded as a pandas dataframe and afterwards reordered according to dataset_inputs. I believe both the theory covmat and the experimental covmat should stay as pandas dataframe throughout.
Note that for covmat this is way less problematic than it is for fktable, since covmats are 2D anyway so there's no advantage representation-wise to having numpy arrays, it can be numpyed at any point.
see https://github.com/NNPDF/nnpdf/issues/1547#issuecomment-1091965421 for a course of action
I am a bit confused as to what is going on, but I guess will have to make some time tomorrow to follow the discussion. That said, I thought we had all that figured out in 2020. Is there no code that saves and loads the covmats with their indices properly?
E.g. I remember this was discussed quite a bit at the time https://github.com/NNPDF/nnpdf/blob/c4084ded70721dbe550d2bce47df9cb8c8978838/validphys2/src/validphys/theorycovariance/construction.py#L418 has it been considered as a solution to this issue? Sorry if it was discussed somewhere.
I am a bit confused as to what is going on, but I guess will have to make some time tomorrow to follow the discussion
Tomorrow?
Is there no code that saves and loads the covmats with their indices properly?
The experimental covmat was never loaded or saved was it? And anyway, we always have it as a numpy array https://github.com/NNPDF/nnpdf/blob/c4084ded70721dbe550d2bce47df9cb8c8978838/validphys2/src/validphys/covmats.py#L230 I don't even know if it gets made into a dataframe at any point.
I think it would be nice to, instead of making the thcovmat into a numpy array with the right ordering, make the exp covmat into a dataframe and only the final product be made into a numpy array.
That said, I thought we had all that figured out in 2020
Maybe it was the phone call you were sick? But we were told that there had been many problems with the ordering of the covmat in the past. Tbh didn´t know that function existed but seems too complicated given that we have all the information we need for a (two) liner (like in #1528)
There is groups_covmat which s a dataframe. And a bunch of stuff in results.py. I thought these were all to remove the indexing problems, but can't say I remember how it worked...
I guess the solution to this issue then is to find the right covmat function then and use that for the fit and the replicas.
FWIW there are some relevant docs here
https://docs.nnpdf.science/tutorials/general_th_covmat.html?highlight=external
In particular it seems to me that there is the issue of how to wire the existing covmats using the functionality described there rather than building new stuff.
Those are only for nnfit I'm afraid (for vp they are either outdated like bigexp or are not used at all, like the sampling flag)
Isn't the indexing stuff taken care of? I guess what I am asking is what do you think needs to be (re)implemented and what needs rewiring. Mostly on account of not remembering the state of affairs at this point.
For the thcovmat that's cared for in #1528, @andreab1997 didn't have to reinvent (many) wheels, only implement all the sampling flags that were actually only used in nnfit. The reordering is so trivial I can understand it (which talking about pandas df it must mean it's very trivial)
For the experimental covmat (which is what this issue is about) we are currently use numpy arrays with no indices. I would like to instead keep an indexed covmat all the way until it is inverted. Since you pointed me to functions already doing that and it is a QoL thing for me I'll try to do it t some point after the thcovmat is merged and well tested (don't want to add more sources of uncertainties at this point :laughing:)
FWIW I wholly agree that covmats, particularly group ones should always fly around as dataframes.