wtte-rnn
wtte-rnn copied to clipboard
Question about preprocessing functions
Hi,
I've two questions regarding the preprocessing functions:
- Regarding
prep_tensors- the lines
y = y[:,1:]
x = np.roll(x, shift=1, axis=1)[:,1:,]
Simply throw away the first event, right? Is this a necessity? In my data, a significant portion of the chruners churn at the beginning, and I'd be happy to try and predict these, as well.
- Regarding the
nanmask_to_keras_maskfunction: As far as I understand, theyvariable returned by this function is of dimension(n_subjects,t_timesteps,2), such thaty[i]is the matrix whose rows are the different times and its columns are time-to-event and censoring indicator, respectively, for subjecti. In my data, each subject is either churned or not churned (no recurrent events). This means that for each subject, the second column is either all ones (if it's a churned subject) or all zeros (if it's a censored subject); this, of course, without taking into account the 0.95 mask. Is this the correct input format for training the model?
Hi, great questions. You understood it right, throw away the first timestep. There's alternatives but I think this was the most generally safe.
From the data pipeline template:
# 1. Disalign features and targets otherwise truth is leaked.
# 2. drop first timestep (that we now dont have features for)
# 3. nan-mask the last timestep of features. (that we now don't have targets for)
events = events[:,1:,]
y = y[:,1:]
x = np.roll(x, shift=1, axis=1)[:,1:,]
x = x + 0*np.expand_dims(events,-1)
The most thorough explanation can be found here
- If a customer purchases something ("event") at 13.30 I can use this as feature input for the 23.59 batch job of predicting when customers purchases again (i.e tomorrow, day after tomorrow, ...) so we always need to disalign i.e roll the features.
- If we leave an empty feature at first step we have a target value and can train, but in cases when
event <-> datapointi.e sequence birth comes fromeventit's alwaysTTE=0so it'll overfit. - If we also track clicks, logins, language etc
event -> datapointbutdatapoint -/-> eventso now there's uncertainty about tte and you could probably use the first timestep.
So TL:DR, in your case (non-recurrent events) it might be safe, but does it make sense for inference? I.e, when does your data arrive?
I guess you want to predict will there be an event today? But if at signup 13.30 we get language, region, signup method etc this query is going to be tainted with the time of arrival of the data. (Things like less likelihood of event the later data arrives that day). I'm not saying it doesn't make sense, I'm saying it adds things to think about 😄
About question 2: Yes this sounds correct!