fluo
fluo copied to clipboard
Need to prevent GC for lifetime of M/R job
Work done on #8 made the Fluo GC iter collect based on what Transactions are currently active. When using this Fluo input format all mappers use the same timestamp for reading data, and therefore all read from the same snapshot. Currently nothing is done to ensure data is not GCed for this snapshot.
#33 is a possible solution to this
AFAIK M/R has no mechanism for the input format to clean up after the entire job is completed. The solution for this issue should avoid an issue like ACCUMULO-829, where killing the process that started the M/R job borks the entire job.
Thats why I think #33 may be a good solution. A user could do something like the following.
- create named snapshot
- run job against named snapshot
- delete named snapshot
It should be easy for a user to do the above steps in a single process, but we should not require that it be done in a single process. I think #33 will be fairly easy to implement now that #8 is done.
It sounds like the easiest to use approach might be something in the middle.
- Automatically created a snapshot (named, or otherwise) in
InputFormat.getSplits
and make w/e necessary advertisement to ensure that snapshot isn't deleted. You don't try to solve any automatic cleanup. - Provide some sort of API in which users can call to instrument their Tool to create a snapshot, pass that snapshot into the
Configuration
for the InputFormat to use (instead of making a new snapshot) and then let the Tool clean up after the job finishes (assuming that the process that's running the Tool is still alive).
Long term, you can then try to think about something more fancy that can ensure snapshots aren't leaked.
@joshelser I was thinking of only providing option 2 that you described. The user would be required to pass a named snapshot to FluoInputFormat. For Option 1 it seems like it would leave a snapshot around forever that would prevent GC?
Sorry, I thought you meant option 1 as the default where they could come back "later" and clean it up -- e.g. some CLI tool. I guess if the Tool dies, it would be nice to provide some way that doesn't force the user to write some 5 line java class to clean up the snapshot and let GC happen again.
I guess if the Tool dies, it would be nice to provide some way that doesn't force the user to write some 5 line java class to clean up the snapshot and let GC happen again
Right, that would be nice. If we don't do something, then the burden will be placed on every user to find a solution. Maybe named snapshots could optionally be created w/ a TTL? The users tool could do the following.
//not sure about the API or where it would live, in API.. just picked fluoClient
//maybe should be admin?
NamedSnapshotOptions nso = new NamedSnapshotOptions();
nso.generateUniqueName(); //or could give it a name
nso.setTTL(1, TimeUnit.DAYS);
String namedSnap = fluoClient.createNamedSnap(nso);
// configure FluoInputFormat to read from namedSnap
// submit job & wait
//if process that started M/R job dies, then something will delete named snap after 1 day
fluoClient.deleteNamedSnapshot(namedSnap);
If named snapshots can expire, then StaleScanException becomes something that can and happen and that exception should be moved to public API.