machinelearning
machinelearning copied to clipboard
Prototype using spans in Model
@tarekgh, this isn't for merging, but it shows appx what I had in mind for incorporating spans into Model (I know you're currently revising the surface area, so take this with a grain of salt). This eliminates a majority of the remaining allocation that occurs when using Tokenizer.CountTokens/EncodeToIds, as it avoids allocating strings for each token that's already in the cache.
Feel free to crib liberally from the second commit and close this PR. Ignore the first commit, which I submitted separately.
Codecov Report
Attention: 69 lines in your changes are missing coverage. Please review.
Comparison is base (
f976424) 68.81% compared to head (e78ab0f) 68.81%. Report is 6 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #7018 +/- ##
=======================================
Coverage 68.81% 68.81%
=======================================
Files 1258 1259 +1
Lines 250643 250665 +22
Branches 25606 25608 +2
=======================================
+ Hits 172479 172501 +22
+ Misses 71540 71534 -6
- Partials 6624 6630 +6
| Flag | Coverage Δ | |
|---|---|---|
| Debug | 68.81% <62.50%> (+<0.01%) |
:arrow_up: |
| production | 63.28% <62.50%> (+<0.01%) |
:arrow_up: |
| test | 88.44% <ø> (+<0.01%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Files | Coverage Δ | |
|---|---|---|
| ...rc/Microsoft.ML.Tokenizers/Model/EnglishRoberta.cs | 67.36% <100.00%> (ø) |
|
| ...crosoft.ML.Tokenizers/Utils/Helpers.netstandard.cs | 75.00% <100.00%> (+15.00%) |
:arrow_up: |
| src/Microsoft.ML.Tokenizers/Model/BPE.cs | 75.29% <75.00%> (+0.29%) |
:arrow_up: |
| ...rosoft.ML.Tokenizers/Utils/StringSpanOrdinalKey.cs | 94.44% <94.44%> (ø) |
|
| src/Microsoft.ML.Tokenizers/Model/Model.cs | 10.00% <50.00%> (+10.00%) |
:arrow_up: |
| src/Microsoft.ML.Tokenizers/Model/Cache.cs | 75.00% <76.92%> (+34.01%) |
:arrow_up: |
| src/Microsoft.ML.Tokenizers/Tokenizer.cs | 82.64% <52.17%> (-0.97%) |
:arrow_down: |
| src/Microsoft.ML.Tokenizers/Utils/LruCache.cs | 77.77% <64.70%> (+11.11%) |
:arrow_up: |
| src/Microsoft.ML.Tokenizers/Model/Tiktoken.cs | 54.92% <48.00%> (-0.64%) |
:arrow_down: |
Closing this in favor of the following: https://github.com/dotnet/machinelearning/pull/7035