Handling of observations with birth_time==death_time
kmf = lifelines.KaplanMeierFitter()
kmf.fit([1, 2], event_observed=[1, 0], entry=[1, 0])
print(kmf.survival_function_)
Expected:
KM_estimate
timeline
0.0 1.0
1.0 0.5
2.0 0.5
Actual:
KM_estimate
timeline
0.0 1.0
1.0 0.0
2.0 0.0
I've read https://github.com/CamDavidsonPilon/lifelines/issues/497 and the corresponding comments
# Why subtract entrants like this? see https://github.com/CamDavidsonPilon/lifelines/issues/497
# specifically, we kill people, compute the ratio, and then "add" the entrants.
# This can cause a problem if there are late entrants that enter but population=0, as
# then we have log(0 - 0). We later ffill to fix this.
# The only exception to this rule is the first period, where entrants happen _prior_ to deaths.
But I can't wrap my head around what this is saying. How could entrants not happen prior to deaths? If I have an observation with birth_time==death_time does that mean that it died before it was born?
I thought that the likelihood is
- $P(T = d | T \ge b)$ for observed events
- $P(T > d | T \ge b)$ for unobserved events
This is an interesting issue, and I want to agree with your expected case. However, I'm also inclined to reject the case birth_time==death_time as pathological to lifelines. Based on that highlighted comment, it sounds like birth_times is actually birth_time + \epsilon. So if you want a true birth_time==death_time, you would add an epsilon to the death time:
kmf = lifelines.KaplanMeierFitter()
kmf.fit([1+1e-10, 2], event_observed=[1, 0], entry=[1, 0])
print(kmf.survival_function_)
KM_estimate
timeline
0.0 1.0
1.0 1.0
1.0 0.5
2.0 0.5
This is terrible and not at all how I expect users to fix this. I'll have to think more about this.