fairseq2
fairseq2 copied to clipboard
Dynamic bucketing
Is your feature request related to a problem? Please describe:
Samples participating in data pipeline can possess different characteristics (like length, number of tokens, ...).
Currently we can only bucket a fixed number of elements with .bucket(...)
which may be not well adapted to more complex situations. Thus, the idea is to offer a new method (to challenge) like this :
.dynamic_bucket(threshold: float =1000,
cost_fn=lambda sample: len(sample["tokens"]),
nb_min:Optional[int] = 2,
nb_max:Optional[int] = 20)
which would have the following behavior:
-
cost_fn
is callable that for each element of pipeline returns a positive float/int - we bucket several consecutive elements until their total cost becomes bigger than
threshold
- we also optionally control the min/max number of element to put into a bucket.
Describe the solution you would like: There should be native implementation in cpp of this dynamic buckets creation.
Describe the alternatives you have considered:
I don't know any simple workarounds (may be except maybe creating very large buckets first and next doing smaller dynamic buckets with yield_from
).
Additional Context:
I would like to keep cost_fn
function very opened so it could fit various situations.