haven
haven copied to clipboard
inflated .dta files due to data storage type
I often have to export dataframe in .dta format, which I do using haven::write_dta
. The problem is that the Stata users with whom I share these files often have to use the Stata compress
command in order to reduce their size. Would it be possible to include a feature in write_dta
which allows to adjust the storage type, mimicking what compress
does in Stata?
Hi!
Thanks for the feature request. Currently we only have a simple mapping from R to Stata types, and I suspect that allowing custom mappings would make our file writing code significantly more complex, but I'll have a look into it.
Hi @gorcha, thanks for considering this request. I think it could be a great feature, especially for organizations in which some employees cannot even imagine working with something else than Stata and take whichever argument they find to stick with it ;)
Thanks @gorcha for looking into this. It would be really great if this could be fixed. I just had a case in which Stata's compress
reduced the size of the .dta-file from 190 MB to 14 MB (and this was only a small snapshot of the full data set) - probably because this dataset included a lot of strings which very exported as huge strings (Stata type str2045
) although they actually were pretty small (often with a length of less than 10).
Hi @maraab23, string widths should be set to the maximum length of any character value in the vector by default so this seems like it might be a different issue.
Are the character vectors in question completely blank by any chance? (See #558)
I hate to toot our own horn, but if this is important to you and you are willing to spend some money, Stat/Transfer automatically optimizes the output such that Stata files are as small as those compressed by Stata. It also will put variables into strl's with a user-settable length threshold. (stattransfer.com)