ggplotnim icon indicating copy to clipboard operation
ggplotnim copied to clipboard

Date example

Open lf-araujo opened this issue 3 years ago • 2 comments

Hi Vindaar,

First thank you for this excellent module.

For future reference, this is a common example of use of Date types in R that should improve the plotting ggplotnim capabilities. I believe this is perhaps more suited to be managed in a separate data management module.

So, for the data set:

https://gist.github.com/lf-araujo/5da7266c44b3824b824d578411dee73c

Where dates are arbitrarily entered, but one would usually want to print this out standardised as year, typical R code would use as.Date() and look like the following:

read.csv("finances.csv") %>% 
	ggplot(aes( x = as.Date(date), y = yrate)) +
		 geom_line()+
		 geom_point()

Resulting in:

image

However, current ggplotnim code would look like:

proc plot() = 
  let df = toDf(readCsv("./finances.csv"))
  ggplot(df, aes(x = "date")) + 
    geom_line(aes( y = "yrate")) + 
    ylab("Rate") +
    xlab(rotate = -45.0, margin = 1.75, alignTo = "right") +
    ggtitle("Rate evolution") +
    ggsave("finances.png")

and finances.png looks rather confusing:

finances

It is not handling negative values correctly and the dates are not standardised by year. This is currently not possible and not supported. This example is just for future testing.

lf-araujo avatar Nov 08 '20 14:11 lf-araujo

Hey!

Thanks for the example. From this I can deduce two typical usage examples:

  • have dates as a string and allow for parsing to Time (your example)
  • have dates as timestamp and allow for parsing to Time

with useful choice of ticks / formatting.

I will probably first try to find a way to accomplish this in a way that doesn't require the data frame to be able to handle a new data type with Time. While that would make the logic for the plotting easier it would result in a significantly more complex implementation (having to extend the Column and Value variant types). Instead I envision to handle this implicitly either using parse or fromUnix. The internal data will still be treated as float / string.

and finances.png looks rather confusing: ... It is not handling negative values correctly and the dates are not standardised by year. This is currently not possible and not supported.

The issue you're seeing here both for the x and y axis is a simple one.

For x: your dates are given as strings. This means the data will be interpreted as discrete data (continuous string data is not a useful concept). For a more sane handling converting the dates to timestamps and formatting the labels manually is the correct solution. As timestamps the dates will be treated as continuous data and the number of ticks will be a reasonable value. Custom formatting then allows for nice labels. Conversion to timestamps has to be done manually for the moment of course (which can be a bit ugly, but we can easily add sugar for this).

For y: The y axis is really simple. Your data is not "continuous" for the heuristics used by ggplotnim. This is an unfortunate side effect of "trying to do the right thing". Arguably for float data it might be a good idea in general to always treat the data as continuous. Essentially ggplotnim looks at a subset of 100 rows of a column (random indices) and checks if the number of different values is larger than a certain percentage. If not the data is treated as discrete data. Since most of your entries are 0.019 that threshold is not crossed. And because of string comparison of the labels the negative values suddenly appear at the top and the distance between values does not correspond to their numerical difference anymore. One can easily force the scale to be continuous using scale_y_continuous().

Full example with a few comments:

import ggplotnim, times

let df = toDf(readCsv("./data/finances.csv"))
  # perform calculation of the timestamp using `parseTime`
  # have to give type hints
  .mutate(f{string -> int64: "timestamp" ~ parseTime(df["date"][idx],
                                                     "YYYY-MM-dd",
                                                     utc()).toUnix})
# alternatively we could do (if `df` is mutable):
# df["timestamp"] = df["date"].toTensor(string).map_inline(
#   parseTime(x, "YYYY-MM-dd", utc()).toUnix)

proc formatDate(f: float): string =
  ## format timestamp to YYYY-MM-dd. We do not only format via
  ## YYYY, because that will fool us. The ticks will ``not`` be placed
  ## at year change (31/12 -> 01/01), but rather at the "sensible" positions
  ## in unix timestamp space. So if you only format via YYYY we end up with
  ## years not sitting where we expect! That's the major downside of having
  ## no "understanding" of what dates mean.
  result = fromUnix(f.int).format("YYYY-MM-dd")

ggplot(df, aes(x = timestamp, y = yrate)) + # can use raw identifiers if not ambiguous
  geom_line() +
  geom_point() +
  xlab("Date") + ylab("Rate") +
  scale_y_continuous() + # force y continuous
  # use `formatDate` to format the dates from timestamp
  scale_x_continuous(labels = formatDate) +
  ggtitle("Rate evolution") +
  ggsave("finances.png")

This issue will remain open until the handling of such things is more convenient.

finances

Vindaar avatar Nov 09 '20 10:11 Vindaar

As I said, I'll leave this open as a reminder for myself (and for others to find it easier) until a cleaner solution with less manual work is available.

Vindaar avatar Nov 09 '20 13:11 Vindaar