Question about preprocessing functions

Open adam-haber opened this issue 7 years ago • 1 comments

Hi,

I've two questions regarding the preprocessing functions:

Regarding prep_tensors - the lines

y  = y[:,1:]
x  = np.roll(x, shift=1, axis=1)[:,1:,]

Simply throw away the first event, right? Is this a necessity? In my data, a significant portion of the chruners churn at the beginning, and I'd be happy to try and predict these, as well.

Regarding the nanmask_to_keras_mask function: As far as I understand, the y variable returned by this function is of dimension (n_subjects,t_timesteps,2), such that y[i] is the matrix whose rows are the different times and its columns are time-to-event and censoring indicator, respectively, for subject i. In my data, each subject is either churned or not churned (no recurrent events). This means that for each subject, the second column is either all ones (if it's a churned subject) or all zeros (if it's a censored subject); this, of course, without taking into account the 0.95 mask. Is this the correct input format for training the model?

Dec 28 '17 14:12 adam-haber

Hi, great questions. You understood it right, throw away the first timestep. There's alternatives but I think this was the most generally safe.

From the data pipeline template:

    # 1. Disalign features and targets otherwise truth is leaked.
    # 2. drop first timestep (that we now dont have features for)
    # 3. nan-mask the last timestep of features. (that we now don't have targets for)
    events = events[:,1:,]
    y  = y[:,1:]
    x  = np.roll(x, shift=1, axis=1)[:,1:,]
    x  = x + 0*np.expand_dims(events,-1)

The most thorough explanation can be found here

If a customer purchases something ("event") at 13.30 I can use this as feature input for the 23.59 batch job of predicting when customers purchases again (i.e tomorrow, day after tomorrow, ...) so we always need to disalign i.e roll the features.
If we leave an empty feature at first step we have a target value and can train, but in cases when event <-> datapoint i.e sequence birth comes from event it's always TTE=0 so it'll overfit.
If we also track clicks, logins, language etc event -> datapoint but datapoint -/-> event so now there's uncertainty about tte and you could probably use the first timestep.

So TL:DR, in your case (non-recurrent events) it might be safe, but does it make sense for inference? I.e, when does your data arrive?

I guess you want to predict will there be an event today? But if at signup 13.30 we get language, region, signup method etc this query is going to be tainted with the time of arrival of the data. (Things like less likelihood of event the later data arrives that day). I'm not saying it doesn't make sense, I'm saying it adds things to think about 😄

About question 2: Yes this sounds correct!

Dec 29 '17 02:12 ragulpr