text icon indicating copy to clipboard operation
text copied to clipboard

Avoid unnecessary memory allocations in language modeling datasets.

Open cpuhrsch opened this issue 5 years ago • 1 comments

Language modeling datasets construct all datasets even if only a subset is constructed. It also stores the fully numericalized version of the dataset if it's stored as "a single line" (word by word), but not otherwise

cpuhrsch avatar Sep 19 '20 20:09 cpuhrsch

Codecov Report

Merging #992 into master will increase coverage by 0.27%. The diff coverage is 86.95%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #992      +/-   ##
==========================================
+ Coverage   77.70%   77.98%   +0.27%     
==========================================
  Files          44       44              
  Lines        3100     3102       +2     
==========================================
+ Hits         2409     2419      +10     
+ Misses        691      683       -8     
Impacted Files Coverage Δ
...rchtext/experimental/datasets/language_modeling.py 84.84% <86.95%> (+11.41%) :arrow_up:
...ext/experimental/datasets/raw/language_modeling.py 80.00% <0.00%> (+1.53%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 618795d...cba98e9. Read the comment docs.

codecov[bot] avatar Sep 20 '20 22:09 codecov[bot]

closing stale PR

rshraga avatar Mar 14 '23 18:03 rshraga