DataFrames.jl
DataFrames.jl copied to clipboard
time sequences with holes --> no holes in times, missings for data in introduced rows
This comes up as a question too often. Given an ordered N sequence of, say, DateTimes and data vectors each of N data values, DataFrames makes forming a frame of time-indexed rows of data easy-peasy.
Often, the sequence of [increasing] DateTimes are not complete within themselves. For e.g. daily price, volume values for some easily traded financial market asset, the weekends and political holidays occur and no corresponding entry exists in the original sequence of "daily" time with price and volume. Series that report weekly may be used to fill in some detail for a more beadrock monthy series.
The data frame that is constructed has gaps in its timestamp sequence. Analyzing the information is made more direct (and, often faster -- if less veridical inside the imputed 20% of the observations). by advancing the initial frame to a fully periodic (Daily, Weekly, Monthly) derived/calculated/estimated second frame where any introduced row (e.g. each Sunday) is populated with missing
values [except for the introduced time values].
This has been a driver of the design for R's zoo
, we should live at least as large 🏖️
How does this requirement fit into the scope of https://github.com/xKDR/TSx.jl?
CC @chiraganand
How does this requirement fit into the scope of https://github.com/xKDR/TSx.jl?
Yes, it does. There are plans to incorporate features related to regularity of series similar to R zoo package. For now, some workaround will be required to fill up missing dates for example using ts.index
.
Inserting implicitly missing values in a sequence is more general than just time. Tidyr has a nice function full_seq
for this. https://tidyr.tidyverse.org/reference/full_seq.html
full_seq(c(1, 2, 4, 5, 10), 1)
#> [1] 1 2 3 4 5 6 7 8 9 10
Such full_seq
function makes sense, but it should be rather added to some utility package rather than DataFrames.jl I think (as it is not related to DataFrame
).
Just as an additional comment, currently the easiest way to get what you want with DataFrames.jl is leftjoin!
:
julia> df_gaps = DataFrame(stamp = [2, 4, 5, 7], id=1:4)
4×2 DataFrame
Row │ stamp id
│ Int64 Int64
─────┼──────────────
1 │ 2 1
2 │ 4 2
3 │ 5 3
4 │ 7 4
julia> leftjoin!(DataFrame(stamp=1:8), df_gaps, on=:stamp)
8×2 DataFrame
Row │ stamp id
│ Int64 Int64?
─────┼────────────────
1 │ 1 missing
2 │ 2 1
3 │ 3 missing
4 │ 4 2
5 │ 5 3
6 │ 6 missing
7 │ 7 4
8 │ 8 missing
where in the stamp=1:8
you explicitly specify what sequence you consider to be complete.
Such
full_seq
function makes sense, but it should be rather added to some utility package rather than DataFrames.jl I think (as it is not related toDataFrame
).Just as an additional comment, currently the easiest way to get what you want with DataFrames.jl is
leftjoin!
: ...
This is how I did this when I needed to (but outerjoin instead of leftjoin). Unfortunately I was unaware of TSx.jl so I made my own small extension package for working with DateTime "indexed" DataFrames (it is a WIP): DateTimeDataFrames. The relevant function is called expandindex
.
TSx, currently, has DataFrames.join()
wrappers for four types of joins: JoinInner
, JoinOuter
, JoinLeft
, and JoinRight
, these handle missing
values as DataFrames does based on the join type used.
API reference: https://xkdr.github.io/TSx.jl/dev/api/#Base.join-Tuple{TS,%20TS}
The reason why we prefer such extensions to be in add-on packages is that DataFrames.jl core is already complex enough to maintain. The issue is that even a small change/decision in DataFrames.jl is usually hard and requires a lot of discussion. So it is better to have such things outside as then the progress can be faster. Additionally having such modularity helps with package load time (i.e. users load only what they need). But this is a minor issue.