text icon indicating copy to clipboard operation
text copied to clipboard

Experimental bucket by sequence length sampler

Open akurniawan opened this issue 5 years ago • 15 comments

@zhangguanheng66 I'm proposing a sampler class with similar functionality as the BucketIterator. Let me know what you think of this. Thanks!

akurniawan avatar Jun 29 '20 12:06 akurniawan

Codecov Report

Merging #859 into master will increase coverage by 0.09%. The diff coverage is 91.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #859      +/-   ##
==========================================
+ Coverage   76.99%   77.09%   +0.09%     
==========================================
  Files          44       45       +1     
  Lines        3052     3065      +13     
==========================================
+ Hits         2350     2363      +13     
  Misses        702      702              
Impacted Files Coverage Δ
torchtext/experimental/utils/samplers.py 91.17% <91.17%> (ø)
torchtext/experimental/utils/__init__.py 100.00% <100.00%> (ø)
torchtext/data/batch.py 70.37% <0.00%> (-1.06%) :arrow_down:
torchtext/data/iterator.py 62.16% <0.00%> (-0.51%) :arrow_down:
torchtext/data/field.py 92.69% <0.00%> (-0.15%) :arrow_down:
torchtext/experimental/vocab.py 100.00% <0.00%> (ø)
torchtext/nn/modules/multiheadattention.py
torchtext/nn/modules/__init__.py
torchtext/nn/__init__.py
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update a93cad5...19f07b8. Read the comment docs.

codecov[bot] avatar Jun 29 '20 12:06 codecov[bot]

@zhangguanheng66 I'm proposing a sampler class with similar functionality as the BucketIterator. Let me know what you think of this. Thanks!

Thanks @akurniawan . Since there have been several issues related to sampler, I agree it's worthy some investigation. And Sampling by lengths is an important tool for NLP problem. Could you add an simple example in the test file? Also, for the location of the func, should it be `torchtext.experimental.utils"?

zhangguanheng66 avatar Jun 29 '20 14:06 zhangguanheng66

@zhangguanheng66 I initially put the seqlen_sampler under the dataset folder as I was following torchvision. But it has now moved to utils.samplers

akurniawan avatar Jun 29 '20 23:06 akurniawan

By that you mean with a real dataset? Because I have already put a dummy dataset with DataLoader on the test file

akurniawan avatar Jul 01 '20 14:07 akurniawan

By that you mean with a real dataset? Because I have already put a dummy dataset with DataLoader on the test file

Right. I missed that. Then, we are fine with that.

zhangguanheng66 avatar Jul 01 '20 15:07 zhangguanheng66

thanks! let me know if you some other thoughts, otherwise this is ready for review.

akurniawan avatar Jul 02 '20 06:07 akurniawan

@zhangguanheng66 are we still interested in continuing this PR?

akurniawan avatar Jul 29 '20 10:07 akurniawan

@zhangguanheng66 are we still interested in continuing this PR?

Since we are going to remove Field class and associated legacy code in v0.8.0 release, I'm wondering if we can make this Sampler more generic. What I mean here is a sampler which groups the examples with similar lengths together. This is useful for some NLP problems by padding similar length sentence together. How do you feel about the idea?

zhangguanheng66 avatar Jul 29 '20 14:07 zhangguanheng66

@zhangguanheng66 Thank you for the idea. If you don't mind, could you help to elaborate more on the groups the examples with similar lengths together? I would like to understand the difference between your idea and the current implementation on this PR

akurniawan avatar Jul 29 '20 14:07 akurniawan

@zhangguanheng66 Thank you for the idea. If you don't mind, could you help to elaborate more on the groups the examples with similar lengths together? I would like to understand the difference between your idea and the current implementation on this PR

They are very similar. One difference is the current implementation in this PR requires a boundary or "bucket". What happen if users don't provide the buckets?

zhangguanheng66 avatar Jul 29 '20 15:07 zhangguanheng66

They are very similar. One difference is the current implementation in this PR requires a boundary or "bucket". What happen if users don't provide the buckets?

Got it. So basically you're thinking to instead of users providing the "bucket" we provide it to them like the one that has already implemented on the legacy code?

akurniawan avatar Jul 30 '20 00:07 akurniawan

They are very similar. One difference is the current implementation in this PR requires a boundary or "bucket". What happen if users don't provide the buckets?

Got it. So basically you're thinking to instead of users providing the "bucket" we provide it to them like the one that has already implemented on the legacy code?

Kind of. I think this is more useful for the text problems. How do you think about it?

zhangguanheng66 avatar Jul 30 '20 14:07 zhangguanheng66

They are very similar. One difference is the current implementation in this PR requires a boundary or "bucket". What happen if users don't provide the buckets?

Got it. So basically you're thinking to instead of users providing the "bucket" we provide it to them like the one that has already implemented on the legacy code?

Kind of. I think this is more useful for the text problems. How do you think about it?

Makes sense. I have updated the implementation to use the sorting approach and ready for another review. Please let me know if you have some other concerns or ideas

akurniawan avatar Jul 31 '20 02:07 akurniawan

If I understand this correct, this approach will return the smaller sequence lengths first even with shuffle=True. That may not the the ideal behavior?

nth-attempt avatar Aug 03 '20 14:08 nth-attempt

@nth-attempt depends on the number of the data. If the dataset is small (<= batch_size * 100) then yes, using shuffle=True will not changing anything. However, with dataset in bigger size (> batchsize * 100) shuffling will work since currently we are sorting within batch and not the whole dataset

akurniawan avatar Aug 05 '20 09:08 akurniawan