chainladder-python
chainladder-python copied to clipboard
Exception raised with .valuation_correlation().z_critical
Hello,
I received the following traceback when trying to run the .valuation_correlation method:
Traceback (most recent call last):
File "/usr/lib/python3.10/code.py", line 90, in runcode
exec(code, self.locals)
File "<input>", line 1, in <module>
File "/home/ubuntu/chainladder-python/chainladder/core/display.py", line 18, in __repr__
data = self._repr_format()
File "/home/ubuntu/chainladder-python/chainladder/core/display.py", line 87, in _repr_format
return pd.DataFrame(out, index=origin, columns=development)
File "/home/ubuntu/chainladder-python/venv/lib/python3.10/site-packages/pandas/core/frame.py", line 694, in __init__
mgr = ndarray_to_mgr(
File "/home/ubuntu/chainladder-python/venv/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 351, in ndarray_to_mgr
_check_values_indices_shape_match(values, index, columns)
File "/home/ubuntu/chainladder-python/venv/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 422, in _check_values_indices_shape_match
raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
ValueError: Shape of passed values is (1, 10), indices imply (1, 9)
The code used to trigger the exception is below, although I did have to execute the last line twice to see the output in the console:
import chainladder as cl
import pandas as pd
df_xyz = pd.read_csv('friedland_xyz_auto_bi.csv')
tri_xyz = cl.Triangle(
data=df_xyz,
origin='Accident Year',
development='Calendar Year',
columns=['Paid Claims'],
cumulative=True
)
res = tri_xyz.valuation_correlation(p_critical=0.1, total=False).z_critical
The data are from the XYZ example in the Friedland paper, a csv of which can be found here:
https://github.com/casact/FASLR/blob/main/faslr/samples/friedland_xyz_auto_bi.csv
Notably, this is an incomplete triangle as the first two diagonals are missing, which I think may be the reason for the exception:
12 24 36 48 60 72 84 96 108 120 132
1998 NaN NaN 6309.0 8521.0 10082.0 11620.0 13242.0 14419.0 15311.0 15764.0 15822.0
1999 NaN 4666.0 9861.0 13971.0 18127.0 22032.0 23511.0 24146.0 24592.0 24817.0 NaN
2000 1302.0 6513.0 12139.0 17828.0 24030.0 28853.0 33222.0 35902.0 36782.0 NaN NaN
2001 1539.0 5952.0 12319.0 18609.0 24387.0 31090.0 37070.0 38519.0 NaN NaN NaN
2002 2318.0 7932.0 13822.0 22095.0 31945.0 40629.0 44437.0 NaN NaN NaN NaN
2003 1743.0 6240.0 12683.0 22892.0 34505.0 39320.0 NaN NaN NaN NaN NaN
2004 2221.0 9898.0 25950.0 43439.0 52811.0 NaN NaN NaN NaN NaN NaN
2005 3043.0 12219.0 27073.0 40026.0 NaN NaN NaN NaN NaN NaN NaN
2006 3531.0 11778.0 22819.0 NaN NaN NaN NaN NaN NaN NaN NaN
2007 3529.0 11865.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2008 3409.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
I'm wondering, what should the output be in this case? I'm not too familiar with the details of the method, so I'm not sure if we should expect the whole thing to fail, or just have NaNs for the years or years containing missing data points. If it's the former, maybe we could have the error message state that the test cannot be done with missing diagonals or if the latter, output for the years which can be computed and a warning saying that not all years can be computed due to missing data.
Thanks,
Gene
I did some digging and I think the following line directly causes the exception:
https://github.com/casact/chainladder-python/blob/3ef523ab2c4e041b4665e07b4de9fcb4c258b205/chainladder/core/correlation.py#L148
The test results in an array of 10 boolean values which we'd expect, but because of the missing data in the triangle, this line drops two years instead of one, and results in a single-row triangle with 9 development periods. The values of that triangle get reassigned this 10-element array so now we have a dimension mismatch.
This raises another question I have - this code executes without raising an exception the first time, and it's only when I try to access the attribute .z_critical that I raise the exception. I think it's not triggered the first time because assigning an array to the triangle's .values attribute, even when it it has more values than the number of development periods seems to be a valid operation - should this be the case?
If we were to rewrite this line so that it drops just 1 year, a result can be obtained without a value error. However, I'm not sure if this would be semantically correct until I read the Mack 1997 paper myself.
Thanks for finding this and researching it. I too would need to go back to the paper. Actuarial theory aside, executing twice with different response is a problem. The state of a triangle must be consistent after every transaction and it sounds like that is being violated here.
I think I can take this one on with a PR, the correlation issue. I'll keep an eye out for values assignment issue and may open up another issue if I can pinpoint what's going on.
According to the paper (starting from Appendix H), we're checking each diagonal:
Therefore, in order to check for such calendar year influences we only have to subdivide all development factors into 'smaller' and 'larger' ones and then to examine whether there are diagonals where the small development factors or the large ones clearly prevail
So if we have two missing diagonals, as is the case with our example, technically we are still able to carry out the remaining calculations, as we can still classify the available link ratios as large or small, similar to the example from the paper, but with some blank values in the upper-left hand corner:
But whether we should is another question - for our first two columns, the median link ratio size may be altered in the sense that it could be different had the missing data been available - potentially changing which ratios are "S", or "L".
However, the same could be said even if we had a full triangle, such as if we willingly excluded accident years older than what's on the triangle. For this reason, I believe we should choose option 2 of the following proposals for dealing with the issue:
- Throw an exception, notify the user that the test cannot be performed due to missing data
- Perform the calculation, with CY 1999 being labeled as "NA", and warn the user about the missing data.
Just catching up on this, and I agree with you, we should go with option 2.
One special case we have to look out for is a diagonal filled either entirely or almost entirely with median cases. In this case, I don't think we will be able to carry out the hypothesis test and will need to have an NA value for the result and/or issue a warning.
I'll try to construct a test triangle for this.