robinhood icon indicating copy to clipboard operation
robinhood copied to clipboard

Scan and Garbage collection

Open dtcray opened this issue 8 years ago • 1 comments

Would it be difficult to modify GC code to be more parallel ? workflow would be (might be missing some details)

  • Define new flavor of worker or stage that manages GC (STAGE_GC)

  • (stage gc) push not dir entries in SOFT_RM as needed and remove from other tables

  • (stage gc) push dir entries in gc temp table

  • (gc thread) at end of scan:

  1. (gc thread) create gc temp table
  2. get list of "old" file ids (serial)
  3. (gc thread) push entries in stage queue
  4. (gc thread) wait for STAGE_GC to be empty
  5. get entries from gc temp table, push to SOFT_RM as needed and remove from other tables

dtcray avatar Jul 24 '17 16:07 dtcray

Define new flavor of worker or stage that manages GC (STAGE_GC)

It is easy to define a new pipeline stage, but probably a simple queue with a configurable number of workers would be enough to manage this case.

(stage gc) push not dir entries in SOFT_RM as needed and remove from other tables (stage gc) push dir entries in gc temp table

Do that mean there would be 2 GC requests at the end of the scan, 1 to select dir and 1 to select non dirs? or are they split by the GC thread you mention below?

(gc thread) at end of scan:

Previous tasks are also run at the end of the scan. Aren't they?

(gc thread) create gc temp table get list of "old" file ids (serial) (gc thread) push entries in stage queue (gc thread) wait for STAGE_GC to be empty get entries from gc temp table, push to SOFT_RM as needed and remove from other tables

So to summarize, if I correctly understand, this would parallelize the steps of inserting entry in SOFT_RM and dropping from other tables. According to your experience, is it the longest operation? I guess creating the temp table is also a long step...

tl-cea avatar Sep 05 '17 14:09 tl-cea