sas7bdat
sas7bdat copied to clipboard
New code for reading compressed SAS data
Just found this: http://ggasoftware.com/opensource/parso
I haven't tried it yet, but knew they'd been working on it for some time.
Thanks for the link Harry. I couldn't find the source code on their web page. Judging by their file naming, it looks like they may have used the information that we gathered about the file format, which is great. There are several other sas7bdat reader implementations that are faster than mine. My focus has been mostly in the reverse-engineering effort, rather than the reader-implementation effort.
Best, Matt
On Mon, Apr 14, 2014 at 7:05 AM, Harry Southworth [email protected]:
Just found this: http://ggasoftware.com/opensource/parso
I haven't tried it yet, but knew they'd been working on it for some time.
— Reply to this email directly or view it on GitHubhttps://github.com/BioStatMatt/sas7bdat/issues/5 .
They started with something called SassyReader which is a Java library reverse engineered from your code. That library had trouble dealing with format catalogues (or at least broken ones) and compressed SAS files so GGA put some extra effort in and got it working much more generally.
I got parso running via a Java wrapper with some help from a friend. I've transformed a few dozen SAS datasets now, including compressed ones, and including at least one that had newlines in character fields that caused the SAS csv writer to create garbage, and parso has done the trick every time. If you want to figure out how to add it as an alternative engine to your sas7bdat package, let me know and I'll help out as much as I'm able.
Harry
Harry,
Thanks for you offer to help. I am familiar with Parso. However, I'm not sure it would be worth the effort to interface the sas7bdat package with the Java code (unless you are very familiar with the that mechanism). It might be best implemented as a separate package.
Ultimately, I'd like to incorporate the extra work those authors had done (compression, etc) into the sas7bdat package in a more natural way. Unfortunately, adding new functionality to the package is a low priority at the moment.
If you are willing and able to help, I think the biggest contribution that you could make is to either wrap the Parso library into an R package, if its license permits it, or to help improve the sas7bdat file format documentation in the sas7bdat package (i.e., the 'sas7bdat' vignette). I'm sure that we can learn a lot from reading their source code. I would, of course, acknowledge any effort you provide within the 'sas7bdat' documentation.
Thanks for you interest!
Best, Matt
On Sun, Sep 7, 2014 at 7:13 AM, Harry Southworth [email protected] wrote:
They started with something called SassyReader which is a Java library reverse engineered from your code. That library had trouble dealing with format catalogues (or at least broken ones) and compressed SAS files so GGA put some extra effort in and got it working much more generally.
I got parso running via a Java wrapper with some help from a friend. I've transformed a few dozen SAS datasets now, including compressed ones, and including at least one that had newlines in character fields that caused the SAS csv writer to create garbage, and parso has done the trick every time. If you want to figure out how to add it as an alternative engine to your sas7bdat package, let me know and I'll help out as much as I'm able.
Harry
— Reply to this email directly or view it on GitHub https://github.com/BioStatMatt/sas7bdat/issues/5#issuecomment-54744972.
OK. I went ahead and wrote an R package that wraps the Parso library. It's a bit slow though.
https://github.com/biostatmatt/sas7bdat.parso
On Sun, Sep 7, 2014 at 8:43 AM, Matthew Shotwell [email protected] wrote:
Harry,
Thanks for you offer to help. I am familiar with Parso. However, I'm not sure it would be worth the effort to interface the sas7bdat package with the Java code (unless you are very familiar with the that mechanism). It might be best implemented as a separate package.
Ultimately, I'd like to incorporate the extra work those authors had done (compression, etc) into the sas7bdat package in a more natural way. Unfortunately, adding new functionality to the package is a low priority at the moment.
If you are willing and able to help, I think the biggest contribution that you could make is to either wrap the Parso library into an R package, if its license permits it, or to help improve the sas7bdat file format documentation in the sas7bdat package (i.e., the 'sas7bdat' vignette). I'm sure that we can learn a lot from reading their source code. I would, of course, acknowledge any effort you provide within the 'sas7bdat' documentation.
Thanks for you interest!
Best, Matt
On Sun, Sep 7, 2014 at 7:13 AM, Harry Southworth <[email protected]
wrote:
They started with something called SassyReader which is a Java library reverse engineered from your code. That library had trouble dealing with format catalogues (or at least broken ones) and compressed SAS files so GGA put some extra effort in and got it working much more generally.
I got parso running via a Java wrapper with some help from a friend. I've transformed a few dozen SAS datasets now, including compressed ones, and including at least one that had newlines in character fields that caused the SAS csv writer to create garbage, and parso has done the trick every time. If you want to figure out how to add it as an alternative engine to your sas7bdat package, let me know and I'll help out as much as I'm able.
Harry
— Reply to this email directly or view it on GitHub https://github.com/BioStatMatt/sas7bdat/issues/5#issuecomment-54744972.
It /is/ slow, but it works where read.sas7bdat fails due to data compression! I had to give that .parso repo its first star for that!
I just added a function s7b2csv that should be a bit faster in converting the file, since the full read occurs without switching control back and forth between the Java and R code. It's still a sequential process (i.e., read a bit of the sas7bdat file, then write a bit of the CSV file). This is a safety play, since some files may not fit completely into memory. Ideally, the code would look at the size of the data (from the header information) and then decide whether to do a sequential read, or to read everything into memory and then write a CSV, say if the data size was less than 2GB.
On Tue, Sep 9, 2014 at 12:22 PM, Jason Miller [email protected] wrote:
It /is/ slow, but it works where read.sas7bdat fails due to data compression! I had to give that .parso repo its first star for that!
— Reply to this email directly or view it on GitHub https://github.com/BioStatMatt/sas7bdat/issues/5#issuecomment-55004691.
Great, duly noted
FWIW, I could not find the source code on their website, but it is apparently accessible from: http://search.maven.org/remotecontent?filepath=com/ggasoftware/parso/1.2.1/parso-1.2.1-sources.jar