haven icon indicating copy to clipboard operation
haven copied to clipboard

inflated .dta files due to data storage type

Open maxecharel opened this issue 3 years ago • 5 comments

I often have to export dataframe in .dta format, which I do using haven::write_dta. The problem is that the Stata users with whom I share these files often have to use the Stata compress command in order to reduce their size. Would it be possible to include a feature in write_dta which allows to adjust the storage type, mimicking what compress does in Stata?

maxecharel avatar Nov 25 '21 11:11 maxecharel

Hi!

Thanks for the feature request. Currently we only have a simple mapping from R to Stata types, and I suspect that allowing custom mappings would make our file writing code significantly more complex, but I'll have a look into it.

gorcha avatar Nov 29 '21 02:11 gorcha

Hi @gorcha, thanks for considering this request. I think it could be a great feature, especially for organizations in which some employees cannot even imagine working with something else than Stata and take whichever argument they find to stick with it ;)

maxecharel avatar Nov 30 '21 09:11 maxecharel

Thanks @gorcha for looking into this. It would be really great if this could be fixed. I just had a case in which Stata's compress reduced the size of the .dta-file from 190 MB to 14 MB (and this was only a small snapshot of the full data set) - probably because this dataset included a lot of strings which very exported as huge strings (Stata type str2045) although they actually were pretty small (often with a length of less than 10).

maraab23 avatar Feb 22 '22 18:02 maraab23

Hi @maraab23, string widths should be set to the maximum length of any character value in the vector by default so this seems like it might be a different issue.

Are the character vectors in question completely blank by any chance? (See #558)

gorcha avatar Feb 27 '22 04:02 gorcha

I hate to toot our own horn, but if this is important to you and you are willing to spend some money, Stat/Transfer automatically optimizes the output such that Stata files are as small as those compressed by Stata. It also will put variables into strl's with a user-settable length threshold. (stattransfer.com)

sjdubnoff avatar Apr 29 '22 21:04 sjdubnoff