text
text copied to clipboard
Experimental bucket by sequence length sampler
@zhangguanheng66 I'm proposing a sampler class with similar functionality as the BucketIterator. Let me know what you think of this. Thanks!
Codecov Report
Merging #859 into master will increase coverage by
0.09%. The diff coverage is91.66%.
@@ Coverage Diff @@
## master #859 +/- ##
==========================================
+ Coverage 76.99% 77.09% +0.09%
==========================================
Files 44 45 +1
Lines 3052 3065 +13
==========================================
+ Hits 2350 2363 +13
Misses 702 702
| Impacted Files | Coverage Δ | |
|---|---|---|
| torchtext/experimental/utils/samplers.py | 91.17% <91.17%> (ø) |
|
| torchtext/experimental/utils/__init__.py | 100.00% <100.00%> (ø) |
|
| torchtext/data/batch.py | 70.37% <0.00%> (-1.06%) |
:arrow_down: |
| torchtext/data/iterator.py | 62.16% <0.00%> (-0.51%) |
:arrow_down: |
| torchtext/data/field.py | 92.69% <0.00%> (-0.15%) |
:arrow_down: |
| torchtext/experimental/vocab.py | 100.00% <0.00%> (ø) |
|
| torchtext/nn/modules/multiheadattention.py | ||
| torchtext/nn/modules/__init__.py | ||
| torchtext/nn/__init__.py | ||
| ... and 4 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update a93cad5...19f07b8. Read the comment docs.
@zhangguanheng66 I'm proposing a sampler class with similar functionality as the BucketIterator. Let me know what you think of this. Thanks!
Thanks @akurniawan . Since there have been several issues related to sampler, I agree it's worthy some investigation. And Sampling by lengths is an important tool for NLP problem. Could you add an simple example in the test file? Also, for the location of the func, should it be `torchtext.experimental.utils"?
@zhangguanheng66 I initially put the seqlen_sampler under the dataset folder as I was following torchvision. But it has now moved to utils.samplers
By that you mean with a real dataset? Because I have already put a dummy dataset with DataLoader on the test file
By that you mean with a real dataset? Because I have already put a dummy dataset with DataLoader on the test file
Right. I missed that. Then, we are fine with that.
thanks! let me know if you some other thoughts, otherwise this is ready for review.
@zhangguanheng66 are we still interested in continuing this PR?
@zhangguanheng66 are we still interested in continuing this PR?
Since we are going to remove Field class and associated legacy code in v0.8.0 release, I'm wondering if we can make this Sampler more generic. What I mean here is a sampler which groups the examples with similar lengths together. This is useful for some NLP problems by padding similar length sentence together. How do you feel about the idea?
@zhangguanheng66 Thank you for the idea. If you don't mind, could you help to elaborate more on the groups the examples with similar lengths together? I would like to understand the difference between your idea and the current implementation on this PR
@zhangguanheng66 Thank you for the idea. If you don't mind, could you help to elaborate more on the
groups the examples with similar lengths together? I would like to understand the difference between your idea and the current implementation on this PR
They are very similar. One difference is the current implementation in this PR requires a boundary or "bucket". What happen if users don't provide the buckets?
They are very similar. One difference is the current implementation in this PR requires a boundary or "bucket". What happen if users don't provide the buckets?
Got it. So basically you're thinking to instead of users providing the "bucket" we provide it to them like the one that has already implemented on the legacy code?
They are very similar. One difference is the current implementation in this PR requires a boundary or "bucket". What happen if users don't provide the buckets?
Got it. So basically you're thinking to instead of users providing the "bucket" we provide it to them like the one that has already implemented on the legacy code?
Kind of. I think this is more useful for the text problems. How do you think about it?
They are very similar. One difference is the current implementation in this PR requires a boundary or "bucket". What happen if users don't provide the buckets?
Got it. So basically you're thinking to instead of users providing the "bucket" we provide it to them like the one that has already implemented on the legacy code?
Kind of. I think this is more useful for the text problems. How do you think about it?
Makes sense. I have updated the implementation to use the sorting approach and ready for another review. Please let me know if you have some other concerns or ideas
If I understand this correct, this approach will return the smaller sequence lengths first even with shuffle=True. That may not the the ideal behavior?
@nth-attempt depends on the number of the data. If the dataset is small (<= batch_size * 100) then yes, using shuffle=True will not changing anything. However, with dataset in bigger size (> batchsize * 100) shuffling will work since currently we are sorting within batch and not the whole dataset