@zhangguanheng66 I'm proposing a sampler class with similar functionality as the BucketIterator. Let me know what you think of this. Thanks!

Jun 29 '20 12:06 akurniawan

Codecov Report

Merging #859 into master will increase coverage by 0.09%. The diff coverage is 91.66%.

@@            Coverage Diff             @@
##           master     #859      +/-   ##
==========================================
+ Coverage   76.99%   77.09%   +0.09%     
==========================================
  Files          44       45       +1     
  Lines        3052     3065      +13     
==========================================
+ Hits         2350     2363      +13     
  Misses        702      702

Impacted Files	Coverage Δ
torchtext/experimental/utils/samplers.py	`91.17% <91.17%> (ø)`
torchtext/experimental/utils/__init__.py	`100.00% <100.00%> (ø)`
torchtext/data/batch.py	`70.37% <0.00%> (-1.06%)`	:arrow_down:
torchtext/data/iterator.py	`62.16% <0.00%> (-0.51%)`	:arrow_down:
torchtext/data/field.py	`92.69% <0.00%> (-0.15%)`	:arrow_down:
torchtext/experimental/vocab.py	`100.00% <0.00%> (ø)`
torchtext/nn/modules/multiheadattention.py
torchtext/nn/modules/__init__.py
torchtext/nn/__init__.py
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update a93cad5...19f07b8. Read the comment docs.

Jun 29 '20 12:06 codecov[bot]

@zhangguanheng66 I'm proposing a sampler class with similar functionality as the BucketIterator. Let me know what you think of this. Thanks!

Thanks @akurniawan . Since there have been several issues related to sampler, I agree it's worthy some investigation. And Sampling by lengths is an important tool for NLP problem. Could you add an simple example in the test file? Also, for the location of the func, should it be `torchtext.experimental.utils"?

Jun 29 '20 14:06 zhangguanheng66

@zhangguanheng66 I initially put the seqlen_sampler under the dataset folder as I was following torchvision. But it has now moved to utils.samplers

Jun 29 '20 23:06 akurniawan

By that you mean with a real dataset? Because I have already put a dummy dataset with DataLoader on the test file

Jul 01 '20 14:07 akurniawan

By that you mean with a real dataset? Because I have already put a dummy dataset with DataLoader on the test file

Right. I missed that. Then, we are fine with that.

Jul 01 '20 15:07 zhangguanheng66

thanks! let me know if you some other thoughts, otherwise this is ready for review.

Jul 02 '20 06:07 akurniawan

@zhangguanheng66 are we still interested in continuing this PR?

Jul 29 '20 10:07 akurniawan

@zhangguanheng66 are we still interested in continuing this PR?

Since we are going to remove Field class and associated legacy code in v0.8.0 release, I'm wondering if we can make this Sampler more generic. What I mean here is a sampler which groups the examples with similar lengths together. This is useful for some NLP problems by padding similar length sentence together. How do you feel about the idea?

Jul 29 '20 14:07 zhangguanheng66

@zhangguanheng66 Thank you for the idea. If you don't mind, could you help to elaborate more on the groups the examples with similar lengths together? I would like to understand the difference between your idea and the current implementation on this PR

Jul 29 '20 14:07 akurniawan

@zhangguanheng66 Thank you for the idea. If you don't mind, could you help to elaborate more on the groups the examples with similar lengths together? I would like to understand the difference between your idea and the current implementation on this PR

They are very similar. One difference is the current implementation in this PR requires a boundary or "bucket". What happen if users don't provide the buckets?

Jul 29 '20 15:07 zhangguanheng66

They are very similar. One difference is the current implementation in this PR requires a boundary or "bucket". What happen if users don't provide the buckets?

Got it. So basically you're thinking to instead of users providing the "bucket" we provide it to them like the one that has already implemented on the legacy code?

Jul 30 '20 00:07 akurniawan

They are very similar. One difference is the current implementation in this PR requires a boundary or "bucket". What happen if users don't provide the buckets?

Got it. So basically you're thinking to instead of users providing the "bucket" we provide it to them like the one that has already implemented on the legacy code?

Kind of. I think this is more useful for the text problems. How do you think about it?

Jul 30 '20 14:07 zhangguanheng66

They are very similar. One difference is the current implementation in this PR requires a boundary or "bucket". What happen if users don't provide the buckets?

Got it. So basically you're thinking to instead of users providing the "bucket" we provide it to them like the one that has already implemented on the legacy code?

Kind of. I think this is more useful for the text problems. How do you think about it?

Makes sense. I have updated the implementation to use the sorting approach and ready for another review. Please let me know if you have some other concerns or ideas

Jul 31 '20 02:07 akurniawan

If I understand this correct, this approach will return the smaller sequence lengths first even with shuffle=True. That may not the the ideal behavior?

Aug 03 '20 14:08 nth-attempt

@nth-attempt depends on the number of the data. If the dataset is small (<= batch_size * 100) then yes, using shuffle=True will not changing anything. However, with dataset in bigger size (> batchsize * 100) shuffling will work since currently we are sorting within batch and not the whole dataset

Aug 05 '20 09:08 akurniawan

text
text copied to clipboard

Experimental bucket by sequence length sampler

Codecov Report

text text copied to clipboard

Experimental bucket by sequence length sampler

Codecov Report

text
text copied to clipboard