data
data copied to clipboard
[BE] Unify `buffer_size` across datapipes
The buffer_size
parameter is currently fairly inconsistent across datapipes:
name | default buffer_size |
infinite buffer_size |
warn on infinite |
---|---|---|---|
Demultiplexer | 1e3 | -1 | yes |
Forker | 1e3 | -1 | yes |
Grouper | 1e4 | N/A | N/A |
Shuffler | 1e4 | N/A | N/A |
MaxTokenBucketizer | 1e3 | N/A | N/A |
UnZipper | 1e3 | -1 | yes |
IterKeyZipper | 1e4 | None | no |
Here are my suggestion on how to unify this:
- Use the same default
buffer_size
everywhere. It makes little difference whether we use1e3
or1e4
given that it is tightly coupled with the data we know nothing about. Given today's hardware / datasets, I would go with 1e4, but no strong opinion. - Give every datapipe with buffer the ability for an infinite buffer. Otherwise users will just be annoyed and use a workaround. For example,
torchvision
simply usesINFINITE_BUFFER_SIZE = 1_000_000_000
, which for all intents and purposes lives up to its name. Which sentinel we use, i.e.-1
orNone
, again makes little difference. I personally would useNone
to have a clear separation, but again no strong opinion other than being consistent. - Do not warn on infinite buffer sizes. Especially since infinite buffer is not the default behavior, the user is expected to know what they are doing when setting
buffer_size=None
. I'm all for having a warning like this in the documentation, but I'm strongly against a runtime warning. For example,torchvision
datasets need to use an infinite buffer everywhere. Thus, by using the infinite buffer sentinel, users would always get runtime warnings although neither them nor we did anything wrong.
@NivekT does the :+1: mean you agree with my suggestions and I can send a PR?
@pmeier Sorry I missed this issue before. We are currently exploring a plan to add a BufferSpec
class (configuration) for buffer. We should provide an instance for default value of all DataPipe
. And, the infinite_buffer
should be another special instance of such class. WDYT about this idea
We are currently exploring a plan to add a
BufferSpec
class (configuration) for buffer
Is there a design document / issue for this? If not, what would a an instance of this class contain?
We are currently exploring a plan to add a
BufferSpec
class (configuration) for bufferIs there a design document / issue for this? If not, what would a an instance of this class contain?
Not for now. This is a requirement from internal usage that they want to provide some special buffer instance because they would be able to replace a certain instance of DataPipe in the graph but using the BufferSpec
instance. For OSS side, this class is really straight forward.
class BufferSpec:
buffer_size: int
If users use an integer as an argument, we should automatically change it into an instance of BufferSpec
.
I can't comment on the internal use case, but if the public API still accepts int
's and any sentinel for the infinite buffer, I'm ok with that.
Yeah. And, for this issue, we should definitely unify the buffer_size.
For the infinite
one, I agree None
is better than -1
. WDYT whether or not we should raise Error for negative value. If not raising value, we are kind supporting both sentinels for infinite.
WDYT whether or not we should raise Error for negative value. If not raising value, we are kind supporting both sentinels for infinite.
Let's not have two ways to do the same thing. If we decide on a sentinel, everything else should error if not isinstance(buffer_size, int) and buffer_size > 1
.
Makes sense to me