ReadStatTables.jl icon indicating copy to clipboard operation
ReadStatTables.jl copied to clipboard

Reading specific rows from a large `sas7bdat` file

Open BERENZ opened this issue 1 year ago • 5 comments

Is there a way to add functionality to read specific rows from a large sas7bdat file? The issue I'm facing is that I have large SAS files (around 10GB) along with text files (an exact, flat copy of the SAS file). Based on the text file, I can specify the subset of rows that I'm interested in (around 10% of the file).

Another option is to specify a filter while reading, for example, reading rows based on a column. However, I understand that this may be more challenging to implement.

BERENZ avatar Sep 10 '24 09:09 BERENZ

Hi! Have you tried the keyword arguments row_limit and row_offset? They should allow reading just a portion of the file.

junyuan-chen avatar Sep 11 '24 14:09 junyuan-chen

Hi, yes, but it would only work if the rows I want to select are in order. In my case, they're spread out over the dataset.

BERENZ avatar Sep 11 '24 14:09 BERENZ

@BERENZ All right. Now I see your point. Filtering the rows of the data file with a general condition is not something that is built into the parser. However, a work around could be that you try to cut the file into partitions of consecutive rows that are small enough to be fit into the memory and then filter each partition one by one. The entire file is therefore still read into the memory at some point.

junyuan-chen avatar Sep 11 '24 15:09 junyuan-chen

Sure, this is what I actually do nowadays (split data into chunks). I understand that to make this possible is to make changes to the underlying ReadStat C library?

BERENZ avatar Sep 12 '24 06:09 BERENZ

Yes. For reading the files, the iteration across rows is handled within the C library and there is no such an interface to skip rows depending on the values.

junyuan-chen avatar Sep 12 '24 06:09 junyuan-chen