xts icon indicating copy to clipboard operation
xts copied to clipboard

to.period is not working as expected for the first record

Open cloudcell opened this issue 8 years ago • 10 comments

The first endpoint is being assigned to the first record, regardless of time. In the following example the record with time 04:01 ends up not aggregated in the processed xts records.

Reproduction of the error:

  1. get this data
  2. use this code
xx <- to.period(GBPUSD,period = 'minutes', k=2)
head(GBPUSD)

Open High Low Close Volume 2002-10-21 04:01:00 1.5501 1.5501 1.5501 1.5501 22.2181704 2002-10-21 04:02:00 1.5501 1.5501 1.5501 1.5501 93.3404328 2002-10-21 04:03:00 1.5501 1.5501 1.5501 1.5501 25.7178698 2002-10-21 04:04:00 1.5501 1.5501 1.5501 1.5501 8.0730374 2002-10-21 04:05:00 1.5500 1.5500 1.5500 1.5500 1.9565426 2002-10-21 04:06:00 1.5493 1.5497 1.5493 1.5497 39.7676101

head(xx)

GBPUSD.Open GBPUSD.High GBPUSD.Low GBPUSD.Close GBPUSD.Volume 2002-10-21 04:01:00 1.5501 1.5501 1.5501 1.5501 22.218170 2002-10-21 04:03:00 1.5501 1.5501 1.5501 1.5501 119.058303 2002-10-21 04:05:00 1.5501 1.5501 1.5500 1.5500 10.029580 2002-10-21 04:07:00 1.5493 1.5498 1.5493 1.5498 122.681942 2002-10-21 04:09:00 1.5498 1.5498 1.5492 1.5492 62.382992 2002-10-21 04:11:00 1.5492 1.5492 1.5491 1.5491 63.479716

cloudcell avatar Aug 24 '16 08:08 cloudcell

I'm not convinced this is a bug. If you look at the output of endpoints, you will see why to.period is returning the first row unchanged.

head(endpoints(GBPUSD, 'minutes', k=2))
#[1] 0 1 3 5 7 9

This is because the index for GBPUSD contains values at the beginning of the minute, not the end. So the end point for the first two minutes of 2002-10-21T04:00:00 is 2002-10-21T04:01:59.999.

If you subtract a small amount from each index value, you get the behavior you seem to expect. This is because the endpoints output changed to reflect the index value changes.

.index(GBPUSD) <- .index(GBPUSD) - 0.0001
options(digits.secs=6, width=120)
head(xx <- to.period(GBPUSD,period = 'minutes', k=2))
#                         GBPUSD.Open GBPUSD.High GBPUSD.Low GBPUSD.Close GBPUSD.Volume
#2002-10-20 19:01:59.9998      1.5501      1.5501     1.5501       1.5501     115.55860
#2002-10-20 19:03:59.9998      1.5501      1.5501     1.5501       1.5501      33.79091
#2002-10-20 19:05:59.9998      1.5500      1.5500     1.5493       1.5497      41.72415
#2002-10-20 19:07:59.9998      1.5498      1.5498     1.5497       1.5498     144.68929
#2002-10-20 19:09:59.9998      1.5494      1.5494     1.5492       1.5492      41.73473
#2002-10-20 19:11:59.9998      1.5491      1.5493     1.5491       1.5493     113.86047
head(endpoints(GBPUSD, 'minutes', k=2))
#[1]  0  2  4  6  8 10

joshuaulrich avatar Aug 24 '16 13:08 joshuaulrich

...the index for GBPUSD contains values at the beginning of the minute, not the end. So the end point for the first two minutes of 2002-10-21T04:00:00 is 2002-10-21T04:01:59.999.

Note that there is no data point for T= 2002-10-21T04:00:00. Therefore, the first two minutes must be in the range of 2002-10-21T04:01:00 and 2002-10-21T04:02:59.999.

cloudcell avatar Aug 24 '16 15:08 cloudcell

endpoints does not produce output based on first observed time in the data passed to it. The first element of endpoints' output is always zero, and in your example the second element will always be the location of the observation with an index value at or before 04:01:59.999, whether your data start at 04:00:00 or 04:01:59.998.

joshuaulrich avatar Aug 24 '16 15:08 joshuaulrich

My question is how does endpoints find the first element? From what I read above, it seems that endpoints chooses 04:00:00 as a baseline (rounding to the beginning of the first hour value), regardless of what the time (in minutes) of our first record is. So if first record had a timestamp, say, 06:53:48, the baseline for endpoints would be 06:00:00.000. Then it would increment 2 minutes from there and the records would be 'sifted' through the "grid" that endpoints generates regardless of what data has to be 'sifted' through. Is that correct?

cloudcell avatar Aug 25 '16 07:08 cloudcell

endpoints finds the first element the same way it finds all the elements. It doesn't actively "choose" any time from the input data as a baseline.

The baseline is determined by the period you choose. If you choose on = "hours", then endpoints will use XX:59:59.999 as the cutoff. For on = "seconds", the cutoff is XX:XX:XX.999. For on = "months", the cutoff is the first day of each month.

joshuaulrich avatar Aug 25 '16 13:08 joshuaulrich

So the cutoff on on = "minutes" is the same as on = "hours", that is XX:59:59.999, is that correct ?

cloudcell avatar Aug 25 '16 13:08 cloudcell

No, the cutoff for on = "minutes" would be XX:XX:59.999. I.e. the end of every minute (assuming k = 1).

joshuaulrich avatar Aug 25 '16 13:08 joshuaulrich

Ok, we're getting somewhere). So if k = 2 as in our case, then the cutoff will be what?

cloudcell avatar Aug 25 '16 13:08 cloudcell

Every 2 minutes. So, xx:x1:59.999, xx:x3:59.999, etc.

The first 2 minutes of every hour are from xx:00:00.000-xx:01:59.999. The next two minutes are xx:02:00.000-xx:03:59.999... and the last 2 minutes are xx:58:00.000-xx:59:59.999.

joshuaulrich avatar Aug 25 '16 13:08 joshuaulrich

So, in a way, you can state that for periods larger than 1 minute, the cutoff is between the end of one hour and the beginning of the next hour, or XX:59:59.999, just like I wrote above. Well, it's not a bug then, but endpoints' help would be clearer if there was a mentioning of these cutoff rules.

cloudcell avatar Aug 25 '16 14:08 cloudcell