DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

time sequences with holes --> no holes in times, missings for data in introduced rows

Open JeffreySarnoff opened this issue 2 years ago • 7 comments

This comes up as a question too often. Given an ordered N sequence of, say, DateTimes and data vectors each of N data values, DataFrames makes forming a frame of time-indexed rows of data easy-peasy.

Often, the sequence of [increasing] DateTimes are not complete within themselves. For e.g. daily price, volume values for some easily traded financial market asset, the weekends and political holidays occur and no corresponding entry exists in the original sequence of "daily" time with price and volume. Series that report weekly may be used to fill in some detail for a more beadrock monthy series.

The data frame that is constructed has gaps in its timestamp sequence. Analyzing the information is made more direct (and, often faster -- if less veridical inside the imputed 20% of the observations). by advancing the initial frame to a fully periodic (Daily, Weekly, Monthly) derived/calculated/estimated second frame where any introduced row (e.g. each Sunday) is populated with missing values [except for the introduced time values].

This has been a driver of the design for R's zoo, we should live at least as large 🏖️

JeffreySarnoff avatar Jun 09 '22 18:06 JeffreySarnoff

How does this requirement fit into the scope of https://github.com/xKDR/TSx.jl?

CC @chiraganand

bkamins avatar Jun 09 '22 19:06 bkamins

How does this requirement fit into the scope of https://github.com/xKDR/TSx.jl?

Yes, it does. There are plans to incorporate features related to regularity of series similar to R zoo package. For now, some workaround will be required to fill up missing dates for example using ts.index.

chiraganand avatar Jun 17 '22 10:06 chiraganand

Inserting implicitly missing values in a sequence is more general than just time. Tidyr has a nice function full_seq for this. https://tidyr.tidyverse.org/reference/full_seq.html

full_seq(c(1, 2, 4, 5, 10), 1)
#>  [1]  1  2  3  4  5  6  7  8  9 10

jariji avatar Aug 31 '22 07:08 jariji

Such full_seq function makes sense, but it should be rather added to some utility package rather than DataFrames.jl I think (as it is not related to DataFrame).

Just as an additional comment, currently the easiest way to get what you want with DataFrames.jl is leftjoin!:

julia> df_gaps = DataFrame(stamp = [2, 4, 5, 7], id=1:4)
4×2 DataFrame
 Row │ stamp  id
     │ Int64  Int64
─────┼──────────────
   1 │     2      1
   2 │     4      2
   3 │     5      3
   4 │     7      4

julia> leftjoin!(DataFrame(stamp=1:8), df_gaps, on=:stamp)
8×2 DataFrame
 Row │ stamp  id
     │ Int64  Int64?
─────┼────────────────
   1 │     1  missing
   2 │     2        1
   3 │     3  missing
   4 │     4        2
   5 │     5        3
   6 │     6  missing
   7 │     7        4
   8 │     8  missing

where in the stamp=1:8 you explicitly specify what sequence you consider to be complete.

bkamins avatar Aug 31 '22 08:08 bkamins

Such full_seq function makes sense, but it should be rather added to some utility package rather than DataFrames.jl I think (as it is not related to DataFrame).

Just as an additional comment, currently the easiest way to get what you want with DataFrames.jl is leftjoin!: ...

This is how I did this when I needed to (but outerjoin instead of leftjoin). Unfortunately I was unaware of TSx.jl so I made my own small extension package for working with DateTime "indexed" DataFrames (it is a WIP): DateTimeDataFrames. The relevant function is called expandindex.

kpa28-git avatar Sep 27 '22 01:09 kpa28-git

TSx, currently, has DataFrames.join() wrappers for four types of joins: JoinInner, JoinOuter, JoinLeft, and JoinRight, these handle missing values as DataFrames does based on the join type used.

API reference: https://xkdr.github.io/TSx.jl/dev/api/#Base.join-Tuple{TS,%20TS}

chiraganand avatar Sep 27 '22 06:09 chiraganand

The reason why we prefer such extensions to be in add-on packages is that DataFrames.jl core is already complex enough to maintain. The issue is that even a small change/decision in DataFrames.jl is usually hard and requires a lot of discussion. So it is better to have such things outside as then the progress can be faster. Additionally having such modularity helps with package load time (i.e. users load only what they need). But this is a minor issue.

bkamins avatar Sep 27 '22 06:09 bkamins