fst icon indicating copy to clipboard operation
fst copied to clipboard

Reading subsets of fst-stored data.tables

Open statquant opened this issue 8 years ago • 3 comments

Something I would find incredibly useful is to be able to run select-like queries when reading from fst. Given that data.tables have keys I was thinking that either this data.table feature could be leverage, or re-using some of its code we might get this feature.

Just to be clear say we have a data.table with date,id,col1,col2,col3 saved as a fst file. I'd like to be able to do something like read.fst(path=myPath, columns=myColumns, select="date==2017-01-01 & id %like% 'fst*'")

I realize that this almost make fst a database, and I do not know if this is doable, but that's my 2 cents. You might ask what is this bringing over loading the whole file and sub-selecting, I was thinking that for people like me remotely working and using networks that could make sense.

Regards and thanks

statquant avatar Mar 30 '17 11:03 statquant

Hi @statquant , thanks for the feature request! Your request is related to issues #16 and #30. As you say, for sorted table's, we can implement a binary search to retrieve a range of rows depending of some specified key range. A binary search is very fast, for example with only 30 seek operations on the fst file, you can scan a billion records. For selections which are not related to a stored key, we could use the selection mechanism from data.table, but on chunks of data instead of the whole table. The problem however is that you can't use aggregate statements for selection in that case, for example:

dt <- data.table(X = 1:10, Y = 10:1)
dt[X < mean(Y)]

   X  Y
1: 1 10
2: 2  9
3: 3  8
4: 4  7
5: 5  6

This works for a complete table, but it won't work when the data is chunked into multiple subsets (in that case the mean is not calculated correctly). So that is a problem. Possible solutions might be:

  • A selection requires the specification of a grouping variable. So the selection is done per group. If the groups are small enough, there will be no problems for large data sets.
  • No aggregate selections are allowed, only simple operators. The advantage of this solution is that we can program these simple operators in C++, increasing performance.
  • A more elaborate framework where we allow custom methods as operators on the data. These methods should have a map reduce-like character, for example for the above example:
# Two chunks
dt1 <- data.table(X = sample(1:20, 10), Y = sample(1:20, 10))
dt2 <- data.table(X = sample(1:20, 10), Y = sample(1:20, 10))

# Calculate sums and counts
r1 <- dt1[, .(Sum = sum(Y), Count = .N)]
r2 <- dt2[, .(Sum = sum(Y), Count = .N)]

# Combine results and calculate mean
rTot <- rbindlist(list(r1, r2))
rTot[, sum(Sum) / sum(Count)]

[1] 10.35

So we calculated a mean by using sum and counting per chunk. The fst package could provide methods like fst.sum, fst.mean etc. to perform these operations.

For your use-case I think that option 2 is probably enough?

MarcusKlik avatar Mar 30 '17 12:03 MarcusKlik

@MarcusKlik thanks for the prompt reply, indeed 2) is enough for me.

Honestly I think it would be for most people, as when you want to aggregate in some sense I'd guess you would still want the whole data to check what you've done, to change what you've done etc...

statquant avatar Mar 30 '17 12:03 statquant

Nice, I will make sure that your feature is on the list for one of the next versions of fst.

MarcusKlik avatar Mar 30 '17 13:03 MarcusKlik