readxl icon indicating copy to clipboard operation
readxl copied to clipboard

Allow user to specify col type is just date or time (vs full datetime)

Open billdenney opened this issue 5 years ago • 4 comments

https://github.com/tidyverse/readxl/blob/5fbe9975d3afb97224332489f05d6688974e79a8/src/ColSpec.h#L129-L134

In the above lines of code, the numFmt values, in most cases, specify dates without times. For example:

  • numFmtId 14: mm-dd-yy only has date components, and numFmtId 15 as well
  • 16 and 17 are imprecise dates without all components (16 is missing year; 17 is missing day)
  • 18-21 are time-only formats with no dates
  • 22 is a date-time

All of the above, and their equivalents in other languages, should be returned to the R user with the correct date precision.

For the formats that do not include times, they should be returned as a class Date object while formats that do include times should return as a class POSIXct object. Returning everything as a POSIXct object gives an inaccurate picture of the precision.

Returning more than what is visually presented (especially the time-only formats showing as being on the day 1899-12-31) misrepresents the available data.

The representation of more data than are provided is also related to tidyverse/lubridate#690.

billdenney avatar Aug 31 '18 03:08 billdenney

As I've been thinking about this more, it may be simpler to use a heuristic like the following:

  • If all values in a column are >=0 and <1, assume that they are time-only values.
    • This heuristic would cause a bug if all values in the column where for the date of 1899-12-31; that seems sufficiently rare compared to a time-only value. It should probably issue a warning for this reason.
    • Time-only values seem best represented to me as a difftime or lubridate period class.
  • If all values in a column are integers, make them date columns (not date-time). This should cause no issue.
  • If any value in the column is >=1 or <0 and any value is not an integer, make it a date-time column.

billdenney avatar Sep 01 '18 01:09 billdenney

This is tied up with other issue clusters around column typing, col spec, formats.

  • #198 Add column specification as in readr
  • #118 Convert character <--> date

Returning everything as a POSIXct object gives an inaccurate picture of the precision. Returning more than what is visually presented (especially the time-only formats showing as being on the day 1899-12-31) misrepresents the available data.

I know what you're getting at, but this isn't technically true. Excel stores all of this as a floating point, serial date time. Full stop. The formats only control what is presented to the user visually. You can switch between all the formats listed above and it does not change the numeric value stored for a cell.

But yes it would be nice for a user to be able to specify they expect a date, with no time, for example. We can't guess this automatically because this would actually throw away data. From my experience in general, I will say that lots of people are deeply confused about and not intentionally managing their cell formats. So guessing col type based on format will cause new problems.

jennybc avatar Dec 14 '18 00:12 jennybc

@jennybc, I know that Excel stores everything as a floating point, serial number, but the file specification provides an indication of intent via the format (as defined on page 1777 pdf page 1787 of the standard).

From my experience in general, I will say that lots of people are deeply confused about and not intentionally managing their cell formats. So guessing col type based on format will cause new problems.

My experience is the same as yours, but what I see is that most people expect what is read into R (or more exactly what they are sending me to read into R) is what they see in Excel. If they see a floating point value represented by number format 18, 19, 20, 21, 45, 46, or maybe 47, they think that I will receive something that is a time without a date.

A few things that occur to me for brainstorming ways to handle formats:

  • Do number to string conversion based on the format.
    • With that, number formats that are explicitly just re-representations of numbers (0-4, 9-11, 37-40, and 48) would come to R as numbers and everything else would be formatted as in Excel? Then, the R user could convert to what they want based on the text.
  • Allow the user an option for how to interpret formatting by format code (though that requires some deep knowledge by the user and some very flexible format options in readxl).
  • Provide an option not to interpret number formats and have all columns return with an attribute of the format value.

billdenney avatar Dec 14 '18 01:12 billdenney

I'm sorry for necroposting, but I wanted to add that currently, col_types=c("date") will not return a date but a datetime. Even the import dialog in RStudio explicitly lists "Date" when clicking on a column header in the preview, but it will still generate a datetime column. Sure it's no problem to convert it with e.g. as.Date() after importing, but it seems confusing (a bug?) so I thought I'd mention it.

cbrnr avatar Nov 30 '22 07:11 cbrnr