Avoid unnecessary memory allocations in language modeling datasets.

Open cpuhrsch opened this issue 5 years ago • 1 comments

Language modeling datasets construct all datasets even if only a subset is constructed. It also stores the fully numericalized version of the dataset if it's stored as "a single line" (word by word), but not otherwise

Sep 19 '20 20:09 cpuhrsch

Codecov Report

Merging #992 into master will increase coverage by 0.27%. The diff coverage is 86.95%.

@@            Coverage Diff             @@
##           master     #992      +/-   ##
==========================================
+ Coverage   77.70%   77.98%   +0.27%     
==========================================
  Files          44       44              
  Lines        3100     3102       +2     
==========================================
+ Hits         2409     2419      +10     
+ Misses        691      683       -8

Impacted Files	Coverage Δ
...rchtext/experimental/datasets/language_modeling.py	`84.84% <86.95%> (+11.41%)`	:arrow_up:
...ext/experimental/datasets/raw/language_modeling.py	`80.00% <0.00%> (+1.53%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 618795d...cba98e9. Read the comment docs.

Sep 20 '20 22:09 codecov[bot]

closing stale PR

Mar 14 '23 18:03 rshraga

text text copied to clipboard

Avoid unnecessary memory allocations in language modeling datasets.

Codecov Report

text
text copied to clipboard