Frequency Offset Aliases - Future proofing for the unknown
Hi!
As you may know, tickers off of yfinance are structured as so:
YYYY-MM-DD --> 2019-07-08
Figuring out frequency for this would be business days : B
ticker_series = TimeSeries.from_dataframe(
data,
time_col="date",
value_cols=["Close"],
freq="B",
fill_missing_dates=True,
)
But what if a user were to upload a timeseries that does not have inherent frequency as listed by pandas?: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
For instance: This timeseries seems to include Daylight Savings time.
2021-11-04 04:00:00,60394613.0
2021-11-08 05:00:00,55020868.0
2021-11-08 05:00:00,55020868.0
2021-11-09 05:00:00,55020868..0
.
.
.
2022-11-04 04:00:00,60394453.0
8/10 times setting freq=None and filling_missing_dates=True does not work and cannot work out the frequency on new timeseries. You can try to inspect seasonality but this does help as much as you would like.
One could work around this by ignoring the datetime and using the index:
ticker_series = TimeSeries.from_values(data["Close"].values)
But this removes the important element of time, especially when training Global models with covariants that are all different lengths and time stamps. Some of those covariates can be daily while others will be daily with minute-by-minute resolution. You want times to link up with each other.
I can also do some fancy string manipulation BUT that only works if I know what type of timestamping I will be working with.
So this brings up the bigger question:
- What would the folks at Darts do in case they do not know the offset Alias BUT have timestamps that are important in providing insight to the data that is being processed especially for global models. Timeseries freq can be ms/sec/min/hour/day/every-other_day/every_third_day/every_N_days/quarter/year/every_N_unit/ etc.
- Is there something I am completely overlooking that will sort this all out automatically? 😄
Update: the only other thing I thought of was converting all incoming timeseries to a unix timestamp in units of seconds. This will in theory, standardize all indexes of the covariates and objective outputs in real world timeframe all relative to one another.
new_date_in_sec = pd.Timestamp('2021-11-08 05:00:00').timestamp()
new_date_in_sec = pd.Timestamp('2021-11-08').timestamp()
That way, no matter what time series, all dates are still unique and now I can just forecast based on these unique timestamp indexes.
The only problem with this approach is converting it into a Timeseries object.
pd.Timestamp("2019-05-14").timestamp() ==> 1557792000.0 (float)
For instance:
date Open
0 2019-05-14 8.320000
1 2019-05-15 8.440000
2 2019-05-16 8.740000
3 2019-05-17 8.540000
4 2019-05-20 8.460000
.. ... ...
754 2022-05-10 102.980003
755 2022-05-11 93.470001
756 2022-05-12 83.040001
757 2022-05-13 99.000000
758 2022-05-16 98.800003
If I then apply this to my dataframe:
def convert_str_to_sec(x):
# convert string --> Timestamp
x = pd.Timestamp(x)
# convert timestring --> unix sec
x = x.timestamp()
return x
ticker_df = pd.read_csv("ticker.csv")
ticker_df['date'] = ticker_df['date'].apply(convert_str_to_sec)
print(ticker_df.iloc[:,:2])
Results in:
date Open
0 1.557792e+09 8.320000
1 1.557878e+09 8.440000
2 1.557965e+09 8.740000
3 1.558051e+09 8.540000
4 1.558310e+09 8.460000
.. ... ...
754 1.652141e+09 102.980003
755 1.652227e+09 93.470001
756 1.652314e+09 83.040001
757 1.652400e+09 99.000000
758 1.652659e+09 98.800003
print(type(ticker_df['date'].iloc[0])) --> <class 'numpy.float64'>
How can I then create a timeseries from this resultant df? X = ["date"], Y = ["Open"]
Nothing seems to like the date column format I have made.
ticker_series = TimeSeries.??( )
Note:
- I cannot convert column to
intvalues because the decimals are important to keep.
Hi @martinb-bb, that's not a very easy question in fact. A solution might be to use UTC datetimes - those are not affected by daylight saving and are a good universal ways to represent dates without ambiguity. You can do all the processing using UTC, and then "translate back" to whatever time zone you were using. In your example I think you could probably translate the timestamps into UTC datetimes to have more readable dates. Let me know if that solves your problem.
We need to work with known offsets in Darts because we need ways to extrapolate dates of future data points (when forecasting, or when doing "dates arithmetic" e.g. to align covariates with targets). Using pandas offsets is by far the most convenient because it's kind of universal, but it might be too limited to address all use-cases. We may consider at some point in the future to use another date system if we find a better one... We would need a very strong motivation though.