EasyTemporalPointProcess icon indicating copy to clipboard operation
EasyTemporalPointProcess copied to clipboard

[Question] How to sample an event time delta range within [0, 10,000]

Open SiriusHou opened this issue 10 months ago • 6 comments

In your example data, time_since_last_event is always within the range [0, 10]. If my sampled time_since_last_event can range from [0, 10,000], can you guide me on how to sample it?

SiriusHou avatar Feb 03 '25 00:02 SiriusHou

Hi, this is a 'dtime_max' in the thinning algo params that determine the range of the

model_config:
.....

    thinning:

      .....

      dtime_max: 5.    <-------------------- HERE
....

iLampard avatar Feb 03 '25 03:02 iLampard

Thank you for your answer. In fact, I adjusted this dtime_max but it didn't help.

SiriusHou avatar Feb 03 '25 03:02 SiriusHou

Even after adjusting dtime_max in reproducing retweet results #49, the event type prediction accuracy remains lower than when normalizing the data delta time to the range [0, 10].

SiriusHou avatar Feb 03 '25 03:02 SiriusHou

let me have a look

iLampard avatar Feb 03 '25 04:02 iLampard

I'm not sure I understand your code correctly. Here I found you used pad_token_id to pad time_delta_sequence. Suppose we have 10 event types, but the delta time can be as large as 100. Should we use 100 to pad the time_delta_sequence?

SiriusHou avatar Feb 03 '25 05:02 SiriusHou

I'm not sure I understand your code correctly. Here I found you used pad_token_id to pad time_delta_sequence. Suppose we have 10 event types, but the delta time can be as large as 100. Should we use 100 to pad the time_delta_sequence?

Hi,

The perfect case is indeed to use a different pad token for time_delta_sequence.

The current implementation of using type pad token is a simple workaround. When computing loss, we use masks from type_sequence to eliminate padded events, and therefore, the pad tokens for time_delta_sequence are not used.

see https://github.com/ant-research/EasyTemporalPointProcess/blob/main/easy_tpp/model/torch_model/torch_basemodel.py#L110

Another reason is there is not a straightforward way to determine the pad token id for time sequences. One way is to compute the statistics of the time delta sequences and then choose a large number. But this causes computations and not very friendly for users.

iLampard avatar Feb 05 '25 05:02 iLampard