xts
xts copied to clipboard
to.period is not working as expected for the first record
The first endpoint is being assigned to the first record, regardless of time. In the following example the record with time 04:01 ends up not aggregated in the processed xts records.
Reproduction of the error:
- get this data
- use this code
xx <- to.period(GBPUSD,period = 'minutes', k=2)
head(GBPUSD)
Open High Low Close Volume 2002-10-21 04:01:00 1.5501 1.5501 1.5501 1.5501 22.2181704 2002-10-21 04:02:00 1.5501 1.5501 1.5501 1.5501 93.3404328 2002-10-21 04:03:00 1.5501 1.5501 1.5501 1.5501 25.7178698 2002-10-21 04:04:00 1.5501 1.5501 1.5501 1.5501 8.0730374 2002-10-21 04:05:00 1.5500 1.5500 1.5500 1.5500 1.9565426 2002-10-21 04:06:00 1.5493 1.5497 1.5493 1.5497 39.7676101
head(xx)
GBPUSD.Open GBPUSD.High GBPUSD.Low GBPUSD.Close GBPUSD.Volume 2002-10-21 04:01:00 1.5501 1.5501 1.5501 1.5501 22.218170 2002-10-21 04:03:00 1.5501 1.5501 1.5501 1.5501 119.058303 2002-10-21 04:05:00 1.5501 1.5501 1.5500 1.5500 10.029580 2002-10-21 04:07:00 1.5493 1.5498 1.5493 1.5498 122.681942 2002-10-21 04:09:00 1.5498 1.5498 1.5492 1.5492 62.382992 2002-10-21 04:11:00 1.5492 1.5492 1.5491 1.5491 63.479716
I'm not convinced this is a bug. If you look at the output of endpoints
, you will see why to.period
is returning the first row unchanged.
head(endpoints(GBPUSD, 'minutes', k=2))
#[1] 0 1 3 5 7 9
This is because the index for GBPUSD
contains values at the beginning of the minute, not the end. So the end point for the first two minutes of 2002-10-21T04:00:00 is 2002-10-21T04:01:59.999.
If you subtract a small amount from each index value, you get the behavior you seem to expect. This is because the endpoints
output changed to reflect the index value changes.
.index(GBPUSD) <- .index(GBPUSD) - 0.0001
options(digits.secs=6, width=120)
head(xx <- to.period(GBPUSD,period = 'minutes', k=2))
# GBPUSD.Open GBPUSD.High GBPUSD.Low GBPUSD.Close GBPUSD.Volume
#2002-10-20 19:01:59.9998 1.5501 1.5501 1.5501 1.5501 115.55860
#2002-10-20 19:03:59.9998 1.5501 1.5501 1.5501 1.5501 33.79091
#2002-10-20 19:05:59.9998 1.5500 1.5500 1.5493 1.5497 41.72415
#2002-10-20 19:07:59.9998 1.5498 1.5498 1.5497 1.5498 144.68929
#2002-10-20 19:09:59.9998 1.5494 1.5494 1.5492 1.5492 41.73473
#2002-10-20 19:11:59.9998 1.5491 1.5493 1.5491 1.5493 113.86047
head(endpoints(GBPUSD, 'minutes', k=2))
#[1] 0 2 4 6 8 10
...the index for GBPUSD contains values at the beginning of the minute, not the end. So the end point for the first two minutes of 2002-10-21T04:00:00 is 2002-10-21T04:01:59.999.
Note that there is no data point for T= 2002-10-21T04:00:00. Therefore, the first two minutes must be in the range of 2002-10-21T04:01:00 and 2002-10-21T04:02:59.999.
endpoints
does not produce output based on first observed time in the data passed to it. The first element of endpoints
' output is always zero, and in your example the second element will always be the location of the observation with an index value at or before 04:01:59.999, whether your data start at 04:00:00 or 04:01:59.998.
My question is how does endpoints
find the first element? From what I read above, it seems that endpoints
chooses 04:00:00 as a baseline (rounding to the beginning of the first hour value), regardless of what the time (in minutes) of our first record is. So if first record had a timestamp, say, 06:53:48, the baseline for endpoints
would be 06:00:00.000. Then it would increment 2 minutes from there and the records would be 'sifted' through the "grid" that endpoints
generates regardless of what data has to be 'sifted' through. Is that correct?
endpoints
finds the first element the same way it finds all the elements. It doesn't actively "choose" any time from the input data as a baseline.
The baseline is determined by the period you choose. If you choose on = "hours"
, then endpoints
will use XX:59:59.999 as the cutoff. For on = "seconds"
, the cutoff is XX:XX:XX.999. For on = "months"
, the cutoff is the first day of each month.
So the cutoff on on = "minutes"
is the same as on = "hours"
, that is XX:59:59.999, is that correct ?
No, the cutoff for on = "minutes"
would be XX:XX:59.999. I.e. the end of every minute (assuming k = 1
).
Ok, we're getting somewhere). So if k = 2
as in our case, then the cutoff will be what?
Every 2 minutes. So, xx:x1:59.999, xx:x3:59.999, etc.
The first 2 minutes of every hour are from xx:00:00.000-xx:01:59.999. The next two minutes are xx:02:00.000-xx:03:59.999... and the last 2 minutes are xx:58:00.000-xx:59:59.999.
So, in a way, you can state that for periods larger than 1 minute, the cutoff is between the end of one hour and the beginning of the next hour, or XX:59:59.999, just like I wrote above. Well, it's not a bug then, but endpoints
' help would be clearer if there was a mentioning of these cutoff rules.