fairseq2 icon indicating copy to clipboard operation
fairseq2 copied to clipboard

Dynamic bucketing

Open artemru opened this issue 2 months ago • 0 comments

Is your feature request related to a problem? Please describe: Samples participating in data pipeline can possess different characteristics (like length, number of tokens, ...). Currently we can only bucket a fixed number of elements with .bucket(...) which may be not well adapted to more complex situations. Thus, the idea is to offer a new method (to challenge) like this :

.dynamic_bucket(threshold: float =1000, 
                cost_fn=lambda sample: len(sample["tokens"]),
                nb_min:Optional[int] = 2,
                nb_max:Optional[int] = 20)

which would have the following behavior:

  • cost_fn is callable that for each element of pipeline returns a positive float/int
  • we bucket several consecutive elements until their total cost becomes bigger than threshold
  • we also optionally control the min/max number of element to put into a bucket.

Describe the solution you would like: There should be native implementation in cpp of this dynamic buckets creation.

Describe the alternatives you have considered: I don't know any simple workarounds (may be except maybe creating very large buckets first and next doing smaller dynamic buckets with yield_from).

Additional Context: I would like to keep cost_fn function very opened so it could fit various situations.

artemru avatar Apr 15 '24 10:04 artemru