machinelearning icon indicating copy to clipboard operation
machinelearning copied to clipboard

Prototype using spans in Model

Open stephentoub opened this issue 1 year ago • 1 comments

@tarekgh, this isn't for merging, but it shows appx what I had in mind for incorporating spans into Model (I know you're currently revising the surface area, so take this with a grain of salt). This eliminates a majority of the remaining allocation that occurs when using Tokenizer.CountTokens/EncodeToIds, as it avoids allocating strings for each token that's already in the cache.

Feel free to crib liberally from the second commit and close this PR. Ignore the first commit, which I submitted separately.

stephentoub avatar Feb 20 '24 04:02 stephentoub

Codecov Report

Attention: 69 lines in your changes are missing coverage. Please review.

Comparison is base (f976424) 68.81% compared to head (e78ab0f) 68.81%. Report is 6 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #7018   +/-   ##
=======================================
  Coverage   68.81%   68.81%           
=======================================
  Files        1258     1259    +1     
  Lines      250643   250665   +22     
  Branches    25606    25608    +2     
=======================================
+ Hits       172479   172501   +22     
+ Misses      71540    71534    -6     
- Partials     6624     6630    +6     
Flag Coverage Δ
Debug 68.81% <62.50%> (+<0.01%) :arrow_up:
production 63.28% <62.50%> (+<0.01%) :arrow_up:
test 88.44% <ø> (+<0.01%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...rc/Microsoft.ML.Tokenizers/Model/EnglishRoberta.cs 67.36% <100.00%> (ø)
...crosoft.ML.Tokenizers/Utils/Helpers.netstandard.cs 75.00% <100.00%> (+15.00%) :arrow_up:
src/Microsoft.ML.Tokenizers/Model/BPE.cs 75.29% <75.00%> (+0.29%) :arrow_up:
...rosoft.ML.Tokenizers/Utils/StringSpanOrdinalKey.cs 94.44% <94.44%> (ø)
src/Microsoft.ML.Tokenizers/Model/Model.cs 10.00% <50.00%> (+10.00%) :arrow_up:
src/Microsoft.ML.Tokenizers/Model/Cache.cs 75.00% <76.92%> (+34.01%) :arrow_up:
src/Microsoft.ML.Tokenizers/Tokenizer.cs 82.64% <52.17%> (-0.97%) :arrow_down:
src/Microsoft.ML.Tokenizers/Utils/LruCache.cs 77.77% <64.70%> (+11.11%) :arrow_up:
src/Microsoft.ML.Tokenizers/Model/Tiktoken.cs 54.92% <48.00%> (-0.64%) :arrow_down:

... and 6 files with indirect coverage changes

codecov[bot] avatar Feb 20 '24 05:02 codecov[bot]

Closing this in favor of the following: https://github.com/dotnet/machinelearning/pull/7035

tarekgh avatar Feb 28 '24 18:02 tarekgh