ReadStat icon indicating copy to clipboard operation
ReadStat copied to clipboard

CSV -> DTA Conversion Fails to Format Dates Properly

Open FlipperPA opened this issue 1 year ago • 3 comments

I'm attempting to convert a csv file to STATA dta format using the JSON metadata format.

I've put together an example with just a few columns:

permno,date,eom
10107,2020-01-31,2020-01-31
10107,2020-02-28,2020-02-29
[...]

This is the JSON file I've generated to provide the metadata for the conversion:

{
    "type": "STATA",
    "variables": [
        {
            "type": "NUMERIC",
            "name": "permno",
            "label": "CRSP permno (permno)",
            "format": "UNSPECIFIED"
        },
        {
            "type": "NUMERIC",
            "name": "date",
            "label": "Day of last price observation (date)",
            "format": "DATE"
        },
        {
            "type": "NUMERIC",
            "name": "eom",
            "label": "End of month (eom)",
            "format": "DATE"
        }
    ]
}

The conversion seems to run properly: Converted 3 variables and 120 rows in 0.01 seconds

However, when I open up the file in STATA v18, it doesn't properly format the dates. I've read through the code, and I seem to be triggering the proper type and format selection, as if I change them, it triggers an error. I'm a STATA novice, but here's what the CLI is reporting for the generated file:

. use "~/test.dta"
((null))

. describe

Contains data from ~/test.dta
 Observations:           120                  (null)
    Variables:             3                  16 Apr 2024 14:04
-------------------------------------------------------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------------------------------------------------------
permno          double  %10.0g                CRSP permno (permno)
date            long    %12.0g                Day of last price observation (date)
eom             long    %12.0g                End of month (eom)
-------------------------------------------------------------------------------------------------------------------------------
Sorted by: 

. tabulate eom

     End of |
month (eom) |      Freq.     Percent        Cum.
------------+-----------------------------------
      21945 |          1        2.50        2.50
      21974 |          1        2.50        5.00
[...]

As you can see, it doesn't seem to be properly applying type or format, and when I tabulate, it displays as a number rather than a date. Have I uncovered a bug, or am I doing something wrong? I tried to follow the README and the spec.

FlipperPA avatar Apr 16 '24 18:04 FlipperPA

Stata date variables are just longs with a certain formatting applied. Try running format %tdCCYY-NN-DD date eom on your imported data.

NilsEnevoldsen avatar Apr 16 '24 19:04 NilsEnevoldsen

@NilsEnevoldsen Thanks so much for the quick response, and further information. I'll give it a whirl and come back to close the ticket. 💯

FlipperPA avatar Apr 17 '24 11:04 FlipperPA

We're continuing to see this issue, and several more experienced STATA users have told me that it is likely the format should be applied during the conversion process, since we are explicitly setting the DATE and DATE_TIME format in the JSON map. Could someone confirm for me that the format should be applied during the readstat conversion process? Am I doing something wrong during my conversion attempt?

When I use SAS to export the same data, the display format seems to be properly set to %td instead of %12.0g:

-----------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-----------------------------------------------------------
permno          double  %8.0g                 PERMNO
date            long    %td                   Names Date
eom             long    %td                   Names Ending Date

And it shows up properly in the data:

     +---------------------------------+
     | permno        date   eom        |
     |---------------------------------|
  1. |  10107   31jan2020          .   |
  2. |  10107   03feb2020          .   |
  3. |  10107   04feb2020          .   |
[...]

FlipperPA avatar Oct 09 '24 15:10 FlipperPA