dataverse icon indicating copy to clipboard operation
dataverse copied to clipboard

Date columns ingest incorrectly

Open adam3smith opened this issue 4 years ago • 2 comments

When importing an Excel sheet (.xlsx) with a date column, the dates in the ingested .tab file are converted to large numbers (roughly corresponding to the days since January 1, 1900). That's regardless of how the dates are stored in Excel (I tried both YYYY-MM-DD and MM/DD/YYYY)

You can see this in this deposit: https://demo.dataverse.org/dataset.xhtml?persistentId=doi%3A10.70122%2FFK2%2FFL31GB

Here's the Excel image

And here's the preview of the .tab (same behavior when downloaded) image

I think it's pretty clear that this is undesirable. FWIW, Excel itself exports the dates as formatted to CSV when using Save as...

adam3smith avatar Apr 21 '21 21:04 adam3smith

Some notes: Looks like Excel does store all dates (and date/times) as # of days since Jan 1 1900 and the fact that it's a date isn't obvious from the value itself. However, some of the xml attributes for the cell do connect to the format. Specifically, the attribute t="s" implies a string whereas no t attribute is numeric. An s= <#> is a reference to the set of cellXfs entries in the styles.xml subdocument. Those entries have a numFmtId value that may reference an internal date format or a custom one listed in the numFmts element. I found a list of internal date format numbers: @date [14,15,16,17,18,19,20,21,22,27,30,36,45,46,47,50,57] but can't vouch for accuracy beyond the fact that the example file above uses numFmtId=14 for one of the two columns. For custom numFmts, there is a formatCode attribute (="m/d/yy;@" for the right column in the sample). I don't see any way to decide if that's a date format beyond looking for the yyyy mm dd m d yy standard values. With all that, I think it would be possible to catch the attributes in the current xslx parser and decide whether the values are dates.

qqmyers avatar Apr 22 '21 22:04 qqmyers

Reproduced to date on V5.3 and the Dataverse demo site (V5.11). When ingesting a file in xlsx format, the values in the file in DATE format are incorrectly ingested: when preview or download .tab format. But when convert to CSV, the values in the file in DATE format are correctly ingested.

sirineREKIK avatar Aug 01 '22 13:08 sirineREKIK