cv19index
cv19index copied to clipboard
inpatient days mismatch
Hello, when running the results, I found that the value of inpatient days is not aligned with what I observed in the original claim input file, e.g. patients having no inpatient visits but have inpatient days of 24, or vice versa. Upon debugging, it seems it lines in the part where the inpatient_days is created with index using claim_df, this actually chose only value of date_diff where index == personId.
preprocessed_df['# of Admissions (12M)'] = inpatient_rows.groupby('personId').admitDate.nunique()
date_diff = pd.to_timedelta(inpatient_rows['dischargeDate'].dt.date - inpatient_rows['admitDate'].dt.date)
inpatient_days = pd.Series(date_diff.dt.days, index=claim_df['personId'])
preprocessed_df['Inpatient Days'] = inpatient_days.groupby('personId').sum()
Example of date_diff: date_diff.dt.days 10 8 29 2 53 2 56 9 60 2 .. 1333281 3 1333325 2 --> if there was a personid == 1333325, then there inpatient days is 2, while this is the index of the claim_df, not related to personId. 1333336 10 1333337 5 1333340 5 Length: 74609, dtype: int64
The claim_df and demo_df were set up as suggested:
- demo_df has unique row for each patient with age and gender
- claim_df has one or multiple rows for each patient (only patient with claims are included). Please let me know if you have any suggestion? Thank you.
IF you make this change, does it work correctly?
inpatient_days = pd.Series(date_diff.dt.days, index=inpatient_rows['personId'])
Thanks, I already modified the code to work meanwhile, but was wondering if there is any potential impact on the way the test set "inpatient days" feature was created (if it was created using the same way) and used to generate the risk_score distribution, as from here:
risk_score - This percentile which indicates where this prediction lies in the distribution of predictinos on the test set. A value of 95 indicates that the prediction was higher than 95% of the test population, which was designed to be representative of the overall US population.
Additionally, we observed this difference but just to confirm, the xgboost_all_age model will give higher risk_score to compared to xgboost model which was trained on Medicare member only? Have you compared between the 2 models about the difference in risk_score on same population, Medicares for example?
Thank you.