nnpdf icon indicating copy to clipboard operation
nnpdf copied to clipboard

Produce experimental covmat as a pandas dataframe

Open scarlehoff opened this issue 3 years ago • 10 comments

On doing #1528 we've discovered that things are not always loaded in the order one necessarily expects. In addition during Friday meeting we have been told that this has been already a problem in the past.

At the moment (in #1528) the theory covmat is loaded as a pandas dataframe and afterwards reordered according to dataset_inputs. I believe both the theory covmat and the experimental covmat should stay as pandas dataframe throughout. Note that for covmat this is way less problematic than it is for fktable, since covmats are 2D anyway so there's no advantage representation-wise to having numpy arrays, it can be numpyed at any point.

see https://github.com/NNPDF/nnpdf/issues/1547#issuecomment-1091965421 for a course of action

scarlehoff avatar Apr 04 '22 08:04 scarlehoff

I am a bit confused as to what is going on, but I guess will have to make some time tomorrow to follow the discussion. That said, I thought we had all that figured out in 2020. Is there no code that saves and loads the covmats with their indices properly?

Zaharid avatar Apr 07 '22 16:04 Zaharid

E.g. I remember this was discussed quite a bit at the time https://github.com/NNPDF/nnpdf/blob/c4084ded70721dbe550d2bce47df9cb8c8978838/validphys2/src/validphys/theorycovariance/construction.py#L418 has it been considered as a solution to this issue? Sorry if it was discussed somewhere.

Zaharid avatar Apr 07 '22 16:04 Zaharid

I am a bit confused as to what is going on, but I guess will have to make some time tomorrow to follow the discussion

Tomorrow?

Is there no code that saves and loads the covmats with their indices properly?

The experimental covmat was never loaded or saved was it? And anyway, we always have it as a numpy array https://github.com/NNPDF/nnpdf/blob/c4084ded70721dbe550d2bce47df9cb8c8978838/validphys2/src/validphys/covmats.py#L230 I don't even know if it gets made into a dataframe at any point.

I think it would be nice to, instead of making the thcovmat into a numpy array with the right ordering, make the exp covmat into a dataframe and only the final product be made into a numpy array.

That said, I thought we had all that figured out in 2020

Maybe it was the phone call you were sick? But we were told that there had been many problems with the ordering of the covmat in the past. Tbh didn´t know that function existed but seems too complicated given that we have all the information we need for a (two) liner (like in #1528)

scarlehoff avatar Apr 07 '22 16:04 scarlehoff

There is groups_covmat which s a dataframe. And a bunch of stuff in results.py. I thought these were all to remove the indexing problems, but can't say I remember how it worked...

Zaharid avatar Apr 07 '22 16:04 Zaharid

I guess the solution to this issue then is to find the right covmat function then and use that for the fit and the replicas.

scarlehoff avatar Apr 07 '22 16:04 scarlehoff

FWIW there are some relevant docs here

https://docs.nnpdf.science/tutorials/general_th_covmat.html?highlight=external

In particular it seems to me that there is the issue of how to wire the existing covmats using the functionality described there rather than building new stuff.

Zaharid avatar Apr 08 '22 08:04 Zaharid

Those are only for nnfit I'm afraid (for vp they are either outdated like bigexp or are not used at all, like the sampling flag)

scarlehoff avatar Apr 08 '22 08:04 scarlehoff

Isn't the indexing stuff taken care of? I guess what I am asking is what do you think needs to be (re)implemented and what needs rewiring. Mostly on account of not remembering the state of affairs at this point.

Zaharid avatar Apr 08 '22 08:04 Zaharid

For the thcovmat that's cared for in #1528, @andreab1997 didn't have to reinvent (many) wheels, only implement all the sampling flags that were actually only used in nnfit. The reordering is so trivial I can understand it (which talking about pandas df it must mean it's very trivial)

For the experimental covmat (which is what this issue is about) we are currently use numpy arrays with no indices. I would like to instead keep an indexed covmat all the way until it is inverted. Since you pointed me to functions already doing that and it is a QoL thing for me I'll try to do it t some point after the thcovmat is merged and well tested (don't want to add more sources of uncertainties at this point :laughing:)

scarlehoff avatar Apr 08 '22 08:04 scarlehoff

FWIW I wholly agree that covmats, particularly group ones should always fly around as dataframes.

Zaharid avatar May 05 '22 07:05 Zaharid