pyod icon indicating copy to clipboard operation
pyod copied to clipboard

does any of the algorithms support unequal length input?

Open Okroshiashvili opened this issue 2 years ago • 5 comments

I have time series data with unequal length rows. There is no missing points and hence no need to handle them. I want to perform outlier detection.

Here is a small example of the structure of my data:

timestamp feature_1 feature_2 feature_3 feature_4 feature_5
timestamp _1 34 112 24 46 87
timestamp_2 99 12 66 16
timestamp _3 54 19
timestamp _4 1100

Does any of the algorithms implemented in PyOD support such input data? If not can you help me by suggesting possible solutions?

Okroshiashvili avatar Oct 04 '22 09:10 Okroshiashvili

I am not an expert on this, but personally I feel like you have to first answer this question: "From an outlier point of perspective, what does it mean to not have a value at a certain feature ?" For example, if you don't care that time_stamp_4 has no values for features 2 - 5, then you can also impute all values with the average. The average will be seen by most outlier detection methods as a "normal" value. The problem however with this approach, is that this could "dilute" potential abnormal values in features you do care about.

Another approach would be to use multiple different detectors. Each detector is trained on a subset of samples that share the same features (with values).

I hope this helped a little bit.

mbongaerts avatar Oct 14 '22 13:10 mbongaerts

@mbongaerts Thanks a lot for your suggestions.

I think it won't be good idea to impute any values because these numbers are "strictly discrete" meaning that each number has its own definition and imputation by average or any method won't work.

Using multiple different detectors seems good. However, I'm not sure in case of having lots of data how this will perform in production.

Anyway, thanks for your input 👍

Okroshiashvili avatar Oct 17 '22 06:10 Okroshiashvili

I think this is typical in time series or sequence data. Have you tried https://github.com/datamllab/tods for this? Likely you need some sliding window or padding.

yzhao062 avatar Oct 17 '22 18:10 yzhao062

Thanks @yzhao062 for your input. will try that definitely

Okroshiashvili avatar Oct 17 '22 20:10 Okroshiashvili

I am almost positive all these algorithms will fail with jagged feature vectors as input. I had the same issue initially, and had to make all my input data equal in length. I am curious though if someone has used any of them with unequal length vectors.

RyanZurrin avatar Nov 09 '22 21:11 RyanZurrin