AnomalyDetection icon indicating copy to clipboard operation
AnomalyDetection copied to clipboard

AnomalyDetectionTs drops timezone from POSIXct objects and converts POSIXlt to POSIXct

Open zachmayer opened this issue 10 years ago • 29 comments

AnomalyDetectionTs should keep the timestamp column of the output dataset as-is, rather than converting to POSIXct and dropping the timezone attribute:

library(AnomalyDetection)
data(data_raw)
data <- raw_data
data$timestamp <- as.POSIXct(data$timestamp)
attr(data$timestamp, "tzone") 
attr(data$timestamp, "tzone") <- "America/New_York"

res = AnomalyDetectionTs(data, max_anoms=0.002, direction='both', plot=FALSE)
attr(res$anoms, 'tzone')

Dropping the timezone is problematic if you wish to merge the anomalies back to the main dataset:

> merge(data, res$anoms, by='timestamp')
             timestamp   count    anoms
1  1980-09-28 22:40:00 114.308 193.1036
2  1980-09-30 12:26:00 130.222 180.8990
3  1980-09-30 12:30:00 126.721 178.8220
4  1980-09-30 12:31:00 152.956 198.3260
5  1980-09-30 12:32:00 136.004 203.9010
6  1980-09-30 12:33:00 134.589 200.3090
7  1980-09-30 12:34:00 122.490 178.4910
8  1980-09-30 12:36:00 126.806 183.0180
9  1980-09-30 12:38:00 117.334 186.8230
10 1980-09-30 12:39:00 121.061 183.6600
11 1980-09-30 12:40:00 116.924 179.2760
12 1980-09-30 12:41:00 129.097 197.2830
13 1980-09-30 12:42:00 119.566 191.0970
14 1980-09-30 12:43:00 137.694 194.6700
15 1980-09-30 12:46:00 136.876 200.8160
16 1980-09-30 12:47:00 125.126 186.2350
17 1980-09-30 12:48:00 122.008 185.4210
18 1980-09-30 12:49:00 127.935 178.9580
19 1980-09-30 12:51:00 138.159 203.2310
20 1980-09-30 12:52:00 130.939 181.3540
21 1980-09-30 12:53:00 122.351 186.7780
22 1980-09-30 12:55:00 121.120 176.1250
23 1980-09-30 12:56:00 122.707 181.5140
24 1980-09-30 12:57:00 118.378 175.2610
25 1980-10-05 05:18:00 101.332  40.0000
26 1980-10-05 05:28:00 103.798 250.0000
27 1980-10-05 05:38:00 100.839  40.0000

zachmayer avatar Jan 06 '15 19:01 zachmayer

Thanks, Owen and I will look into it Zach!

jhochenbaum avatar Jan 07 '15 01:01 jhochenbaum

Thanks!

zachmayer avatar Jan 07 '15 01:01 zachmayer

This looks like a useful package, thanks!

Perhaps the 'timestamp' column should be required to be of class POSIXct? I don't believe base::merge works when the 'by' column is of class POSIXlt. For example:

x <- Sys.time()
test1 <- data.frame(timestamp = as.POSIXct(x:(x + 9), origin = "1970-01-01"),
                    y = rnorm(1:10))
test2 <- data.frame(timestamp = test1$timestamp, z = rnorm(1:10))

## ok when timestamp is POSIXct
merge(test1, test2, by = "timestamp")

## not ok when timestamp is POSIXlt
test1$timestamp <- test2$timestamp <- as.POSIXlt(test1$timestamp)
merge(test1, test2, by = "timestamp")

With the timezone issue, if the 'timestamp' column is POSIXct, the timezone on the object returned can be made to match that of the input data. If the timestamp column is not POSIXct, it could be converted using as.POSIXct instead of strptime in format_timestamp, and given a default time zone of UTC possibly.

erikriverson avatar Jan 07 '15 05:01 erikriverson

@erikeldridge There's really 2 issues here:

  1. POSIXlt is silently converted to POSIXct. Perhaps forcing the use of POSIXct is the way to go, but we should at least get a warning before conversion. Personally, I'm in favor of the output using the exact same data type as the input and letting the user worry about how to merge a POSIXlt object.
  2. Even if the user provided a POSIXct column, the timezone is silently dropped, which means the return data has the wrong timezone and merges incorrectly with the original data (see my example above: the data returned by AnomalyDetectionTs is off by 4 hours).

zachmayer avatar Jan 07 '15 18:01 zachmayer

@erikeldridge I didn't see the last paragraph of your comment. I agree, if the timestamp is of class POSIXct, the timezone should be kept, and if the timestamp must be converted the timezone should be preserved if possible.

zachmayer avatar Jan 07 '15 18:01 zachmayer

Thanks for the comments guys -- Owen and I will look into this later today, but on glance you're right, we should preserve the timestamp/timezone. Stay tuned...

jhochenbaum avatar Jan 07 '15 20:01 jhochenbaum

@jhochenbaum Awesome, thank you!

zachmayer avatar Jan 07 '15 20:01 zachmayer

@zachmayer, would you like to submit a patch and @owenvallis and I can review?

jhochenbaum avatar Jan 09 '15 23:01 jhochenbaum

@jhochenbaum I haven't had time to take a crack on this yet, but if I do I'll submit a PR.

zachmayer avatar Jan 12 '15 14:01 zachmayer

Thanks @zachmayer

jhochenbaum avatar Jan 15 '15 06:01 jhochenbaum

@zachmayer After my PR https://github.com/twitter/AnomalyDetection/pull/92, the timestamp column is now stored as POSIXct rather than POSIXlt and the timezone attribute is kept. Now your code demo produces following output

> library(AnomalyDetection)
> data <- raw_data
> str(data)
'data.frame':	14398 obs. of  2 variables:
 $ timestamp: POSIXct, format: "1980-09-25 14:01:00" "1980-09-25 14:02:00" "1980-09-25 14:03:00" ...
 $ count    : num  182 176 184 178 165 ...
> attr(data$timestamp, "tzone")
[1] "UTC"
> 
> res <- AnomalyDetectionTs(data, max_anoms = 0.002, direction = 'both', plot = FALSE)
> attr(res$anoms$timestamp, 'tzone')
[1] "UTC"

You should check timezone attribute for res$anoms$timestamp instead of res$anoms. Therefore, the timezone attribute was not dropped.

However, your code snippet exposed another problem of AnomalyDetection package, which is the input timestamp with non UTC timezone. The problem is explained by following code output.

> attr(data$timestamp, "tzone") <- "America/New_York"
> attr(data$timestamp, "tzone")
[1] "America/New_York"
> res <- AnomalyDetectionTs(data, max_anoms = 0.002, direction = 'both', plot = FALSE)
> attr(res$anoms$timestamp, 'tzone')
[1] "UTC"
> head(res$anoms)
            timestamp    anoms
1 1980-09-25 12:05:00  21.3510
2 1980-09-29 02:40:00 193.1036
3 1980-09-30 16:26:00 180.8990
4 1980-09-30 16:30:00 178.8220
5 1980-09-30 16:31:00 198.3260
6 1980-09-30 16:32:00 203.9010
> data[120:130, ]
              timestamp   count
120 1980-09-25 12:00:00 134.646
121 1980-09-25 12:01:00 157.175
122 1980-09-25 12:02:00 150.374
123 1980-09-25 12:03:00 151.579
124 1980-09-25 12:04:00 133.844
125 1980-09-25 12:05:00  21.351
126 1980-09-25 12:06:00 120.827
127 1980-09-25 12:07:00 144.293
128 1980-09-25 12:08:00 137.726
129 1980-09-25 12:09:00 131.608
130 1980-09-25 12:10:00 135.320

As you see, the 1st row of res$anoms is corresponding to the 125th row of data. The timestamp columns are of the same value and the only difference is the timezone attribute.

I will try to fix the problem.

caijun avatar Dec 20 '17 07:12 caijun

I have fixed it by PR https://github.com/twitter/AnomalyDetection/pull/92

Now the AnomalyDetectionTs keeps the original timezone attr as the input timestamp, see following output.

> attr(data$timestamp, "tzone") <- "America/New_York"
> attr(data$timestamp, "tzone")
[1] "America/New_York"
> res <- AnomalyDetectionTs(data, max_anoms = 0.002, direction = 'both', plot = TRUE)
> attr(res$anoms$timestamp, 'tzone')
[1] "America/New_York"
> head(res$anoms)
            timestamp    anoms
1 1980-09-25 12:05:00  21.3510
2 1980-09-29 02:40:00 193.1036
3 1980-09-30 16:26:00 180.8990
4 1980-09-30 16:30:00 178.8220
5 1980-09-30 16:31:00 198.3260
6 1980-09-30 16:32:00 203.9010
> data[120:130, ]
              timestamp   count
120 1980-09-25 12:00:00 134.646
121 1980-09-25 12:01:00 157.175
122 1980-09-25 12:02:00 150.374
123 1980-09-25 12:03:00 151.579
124 1980-09-25 12:04:00 133.844
125 1980-09-25 12:05:00  21.351
126 1980-09-25 12:06:00 120.827
127 1980-09-25 12:07:00 144.293
128 1980-09-25 12:08:00 137.726
129 1980-09-25 12:09:00 131.608
130 1980-09-25 12:10:00 135.320

caijun avatar Dec 20 '17 12:12 caijun

Hi team,

Sorry for this up, I mean really.

But with this script :

library(AnomalyDetection) data(raw_data) data <- raw_data data$timestamp <- as.POSIXct(data$timestamp) attr(data$timestamp, "tzone") <- "America/New_York" attr(data$timestamp, "tzone") res <- AnomalyDetectionTs(data, max_anoms = 0.002, direction = 'both', plot = TRUE) attr(res$anoms$timestamp, 'tzone') res$plot

Here what I have :

library(AnomalyDetection)

data(raw_data) data <- raw_data data$timestamp <- as.POSIXct(data$timestamp) attr(data$timestamp, "tzone") <- "America/New_York" attr(data$timestamp, "tzone") [1] "America/New_York" res <- AnomalyDetectionTs(data, max_anoms = 0.002, direction = 'both', plot = TRUE) attr(res$anoms$timestamp, 'tzone') [1] "UTC" res$plot Error: Column x is a date/time and must be stored as POSIXct, not POSIXlt

I cannot understand why, I've checked all what you've said, but no result... May I have some help ?

Maryoda2 avatar Jul 05 '18 07:07 Maryoda2

@Maryoda2 The Error: Column x is a date/time and must be stored as POSIXct, not POSIXlt has been fixed by my PR https://github.com/twitter/AnomalyDetection/pull/92, which has not been merged into the master branch (It seems that the twitter/AnomalyDetection repository has not been maintained since Sep, 2015). You could install AnomalyDetection package that I modified.

devtools::install_github("caijun/AnomalyDetection")

Then rerunning your script will produce the expected results without errors (I tested your script).

caijun avatar Jul 06 '18 03:07 caijun

Oh thanks a lot !

Maryoda2 avatar Jul 06 '18 06:07 Maryoda2

Well... in fact sometimes it writes :

Error in aggregate.data.frame(x[2], format(x[1], "%Y-%m-%d %H:%M:00"), : 'by' must be a list

Maryoda2 avatar Jul 06 '18 12:07 Maryoda2

@Maryoda2 could you give an example reproducing the error?

caijun avatar Jul 06 '18 16:07 caijun

library(AnomalyDetection) data <-dems data$timestamp <- as.POSIXct(data$timestamp) attr(data$timestamp, "tzone") <- "America/New_York" attr(data$timestamp, "tzone") res <- AnomalyDetectionTs(data, max_anoms = 0.02, direction = 'both', plot = TRUE) attr(res$anoms$timestamp, 'tzone') res$plot

With another dataset, I've tried to translate your script. I created a dataset called dems and attempted to format it exactly like raw_data, but it doesn't work

Maryoda2 avatar Jul 06 '18 17:07 Maryoda2

Please run str(data) to see the structure of data and post the result. Otherwise without your data, I cannot reproduce the error to provide help.

caijun avatar Jul 07 '18 00:07 caijun

str(raw_data) 'data.frame': 14398 obs. of 2 variables: $ timestamp: POSIXct, format: "1980-09-25 10:01:00" "1980-09-25 10:02:00" ... $ count : num 182 176 184 178 165 ... str(dems) Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1127 obs. of 2 variables: $ timestamp: POSIXct, format: "2018-07-04 09:51:41" "2018-07-04 09:51:51" ... $ count : num 2.6 2.6 2.6 2.6 2.5 2.4 2.4 2.4 2.5 2.7 ...

Maryoda2 avatar Jul 07 '18 07:07 Maryoda2

This is because your timestamp is not in the format of "%Y-%m-%d %H:%M:00", which means AnomalyDetectionTs() detect anomalies every minute, while your timestamp appears to be every 10 seconds. If you really need to detect anomalies at a time resolution of less than a minute, then the code of twitter/AnomalyDetection has to be modified to support such a feature.

caijun avatar Jul 08 '18 01:07 caijun

Oh I see, How can I edit the code of twitter/AnomalyDetection, in fact the caijun/Twitter (yours) ?

Maryoda2 avatar Jul 08 '18 10:07 Maryoda2

@Maryoda2 Could you share a sample of your data, which I can use for reproducing the error and testing?

I think the problem you encountered has been reported in issue https://github.com/twitter/AnomalyDetection/issues/77, which has been resolved by PRs https://github.com/twitter/AnomalyDetection/pull/98 and https://github.com/twitter/AnomalyDetection/pull/79

Although these PRs have not been merged into the twitter/AnomalyDetection master branch, but has been merged into isalgueiro/AnomalyDetection. You can try this fork of AnomalyDetection by

devtools::install_github("isalgueiro/AnomalyDetection")

caijun avatar Jul 10 '18 09:07 caijun

Well thanks for your answer With isalgueiro it doesn't work, same error. Here is a sample

Maryoda2 avatar Jul 10 '18 09:07 Maryoda2

@Maryoda2 the data you provided is not enough to reproduce the error. Please provide data with at least 20000 rows.

caijun avatar Jul 10 '18 10:07 caijun

20.000 is no more a sample, but we can't detect anomalies on dataset less than 20.000 ?

Maryoda2 avatar Jul 10 '18 11:07 Maryoda2

@Maryoda2 Using the data you provided, I encountered the error "Anom detection needs at least 2 periods worth of data". Hence, I need more data to make sure this error was not because of limited data.

caijun avatar Jul 10 '18 11:07 caijun

In fact for the moment this is all what I have... So game over then ?

Maryoda2 avatar Jul 10 '18 12:07 Maryoda2

@Maryoda2 this error has been discussed in issues https://github.com/twitter/AnomalyDetection/issues/15 and https://github.com/twitter/AnomalyDetection/issues/42, perhaps they can provide some useful information for you to fix this error by yourself, e.g. fork this repository and modify the source code.

caijun avatar Jul 10 '18 14:07 caijun