pyod
pyod copied to clipboard
does any of the algorithms support unequal length input?
I have time series data with unequal length rows. There is no missing points and hence no need to handle them. I want to perform outlier detection.
Here is a small example of the structure of my data:
timestamp | feature_1 | feature_2 | feature_3 | feature_4 | feature_5 |
---|---|---|---|---|---|
timestamp _1 | 34 | 112 | 24 | 46 | 87 |
timestamp_2 | 99 | 12 | 66 | 16 | |
timestamp _3 | 54 | 19 | |||
timestamp _4 | 1100 |
Does any of the algorithms implemented in PyOD support such input data? If not can you help me by suggesting possible solutions?
I am not an expert on this, but personally I feel like you have to first answer this question: "From an outlier point of perspective, what does it mean to not have a value at a certain feature ?" For example, if you don't care that time_stamp_4 has no values for features 2 - 5, then you can also impute all values with the average. The average will be seen by most outlier detection methods as a "normal" value. The problem however with this approach, is that this could "dilute" potential abnormal values in features you do care about.
Another approach would be to use multiple different detectors. Each detector is trained on a subset of samples that share the same features (with values).
I hope this helped a little bit.
@mbongaerts Thanks a lot for your suggestions.
I think it won't be good idea to impute any values because these numbers are "strictly discrete" meaning that each number has its own definition and imputation by average or any method won't work.
Using multiple different detectors seems good. However, I'm not sure in case of having lots of data how this will perform in production.
Anyway, thanks for your input 👍
I think this is typical in time series or sequence data. Have you tried https://github.com/datamllab/tods for this? Likely you need some sliding window or padding.
Thanks @yzhao062 for your input. will try that definitely
I am almost positive all these algorithms will fail with jagged feature vectors as input. I had the same issue initially, and had to make all my input data equal in length. I am curious though if someone has used any of them with unequal length vectors.