chainladder-python Exception raised with .valuation_correlation().z

Hello,

I received the following traceback when trying to run the .valuation_correlation method:

Traceback (most recent call last):
  File "/usr/lib/python3.10/code.py", line 90, in runcode
    exec(code, self.locals)
  File "<input>", line 1, in <module>
  File "/home/ubuntu/chainladder-python/chainladder/core/display.py", line 18, in __repr__
    data = self._repr_format()
  File "/home/ubuntu/chainladder-python/chainladder/core/display.py", line 87, in _repr_format
    return pd.DataFrame(out, index=origin, columns=development)
  File "/home/ubuntu/chainladder-python/venv/lib/python3.10/site-packages/pandas/core/frame.py", line 694, in __init__
    mgr = ndarray_to_mgr(
  File "/home/ubuntu/chainladder-python/venv/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 351, in ndarray_to_mgr
    _check_values_indices_shape_match(values, index, columns)
  File "/home/ubuntu/chainladder-python/venv/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 422, in _check_values_indices_shape_match
    raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
ValueError: Shape of passed values is (1, 10), indices imply (1, 9)

The code used to trigger the exception is below, although I did have to execute the last line twice to see the output in the console:

import chainladder as cl
import pandas as pd

df_xyz = pd.read_csv('friedland_xyz_auto_bi.csv')

tri_xyz = cl.Triangle(
    data=df_xyz,
    origin='Accident Year',
    development='Calendar Year',
    columns=['Paid Claims'],
    cumulative=True
)

res = tri_xyz.valuation_correlation(p_critical=0.1, total=False).z_critical

The data are from the XYZ example in the Friedland paper, a csv of which can be found here:

https://github.com/casact/FASLR/blob/main/faslr/samples/friedland_xyz_auto_bi.csv

Notably, this is an incomplete triangle as the first two diagonals are missing, which I think may be the reason for the exception:

         12       24       36       48       60       72       84       96       108      120      132
1998     NaN      NaN   6309.0   8521.0  10082.0  11620.0  13242.0  14419.0  15311.0  15764.0  15822.0
1999     NaN   4666.0   9861.0  13971.0  18127.0  22032.0  23511.0  24146.0  24592.0  24817.0      NaN
2000  1302.0   6513.0  12139.0  17828.0  24030.0  28853.0  33222.0  35902.0  36782.0      NaN      NaN
2001  1539.0   5952.0  12319.0  18609.0  24387.0  31090.0  37070.0  38519.0      NaN      NaN      NaN
2002  2318.0   7932.0  13822.0  22095.0  31945.0  40629.0  44437.0      NaN      NaN      NaN      NaN
2003  1743.0   6240.0  12683.0  22892.0  34505.0  39320.0      NaN      NaN      NaN      NaN      NaN
2004  2221.0   9898.0  25950.0  43439.0  52811.0      NaN      NaN      NaN      NaN      NaN      NaN
2005  3043.0  12219.0  27073.0  40026.0      NaN      NaN      NaN      NaN      NaN      NaN      NaN
2006  3531.0  11778.0  22819.0      NaN      NaN      NaN      NaN      NaN      NaN      NaN      NaN
2007  3529.0  11865.0      NaN      NaN      NaN      NaN      NaN      NaN      NaN      NaN      NaN
2008  3409.0      NaN      NaN      NaN      NaN      NaN      NaN      NaN      NaN      NaN      NaN

I'm wondering, what should the output be in this case? I'm not too familiar with the details of the method, so I'm not sure if we should expect the whole thing to fail, or just have NaNs for the years or years containing missing data points. If it's the former, maybe we could have the error message state that the test cannot be done with missing diagonals or if the latter, output for the years which can be computed and a warning saying that not all years can be computed due to missing data.

Thanks,

Gene

Jul 30 '22 20:07 genedan

I did some digging and I think the following line directly causes the exception:

https://github.com/casact/chainladder-python/blob/3ef523ab2c4e041b4665e07b4de9fcb4c258b205/chainladder/core/correlation.py#L148

The test results in an array of 10 boolean values which we'd expect, but because of the missing data in the triangle, this line drops two years instead of one, and results in a single-row triangle with 9 development periods. The values of that triangle get reassigned this 10-element array so now we have a dimension mismatch.

This raises another question I have - this code executes without raising an exception the first time, and it's only when I try to access the attribute .z_critical that I raise the exception. I think it's not triggered the first time because assigning an array to the triangle's .values attribute, even when it it has more values than the number of development periods seems to be a valid operation - should this be the case?

If we were to rewrite this line so that it drops just 1 year, a result can be obtained without a value error. However, I'm not sure if this would be semantically correct until I read the Mack 1997 paper myself.

Jul 30 '22 22:07 genedan

Thanks for finding this and researching it. I too would need to go back to the paper. Actuarial theory aside, executing twice with different response is a problem. The state of a triangle must be consistent after every transaction and it sounds like that is being violated here.

Jul 31 '22 00:07 jbogaardt

I think I can take this one on with a PR, the correlation issue. I'll keep an eye out for values assignment issue and may open up another issue if I can pinpoint what's going on.

Aug 01 '22 14:08 genedan

According to the paper (starting from Appendix H), we're checking each diagonal:

Therefore, in order to check for such calendar year influences we only have to subdivide all development factors into 'smaller' and 'larger' ones and then to examine whether there are diagonals where the small development factors or the large ones clearly prevail

So if we have two missing diagonals, as is the case with our example, technically we are still able to carry out the remaining calculations, as we can still classify the available link ratios as large or small, similar to the example from the paper, but with some blank values in the upper-left hand corner:

Selection_018

But whether we should is another question - for our first two columns, the median link ratio size may be altered in the sense that it could be different had the missing data been available - potentially changing which ratios are "S", or "L".

However, the same could be said even if we had a full triangle, such as if we willingly excluded accident years older than what's on the triangle. For this reason, I believe we should choose option 2 of the following proposals for dealing with the issue:

Throw an exception, notify the user that the test cannot be performed due to missing data
Perform the calculation, with CY 1999 being labeled as "NA", and warn the user about the missing data.

Oct 08 '22 00:10 genedan

Just catching up on this, and I agree with you, we should go with option 2.

Oct 12 '22 14:10 kennethshsu

Selection_036

One special case we have to look out for is a diagonal filled either entirely or almost entirely with median cases. In this case, I don't think we will be able to carry out the hypothesis test and will need to have an NA value for the result and/or issue a warning.

I'll try to construct a test triangle for this.

Oct 14 '22 22:10 genedan

chainladder-python
chainladder-python copied to clipboard

Exception raised with .valuation_correlation().z_critical

chainladder-python chainladder-python copied to clipboard

Exception raised with .valuation_correlation().z_critical

chainladder-python
chainladder-python copied to clipboard