machinelearning
machinelearning copied to clipboard
Implement DataFrameColumn Apply and DropNulls methods
Fixes #7107 as was asked in https://github.com/dotnet/machinelearning/issues/6144#issuecomment-2018430389
Additionaly:
- Fix incorrect IsNumeric method
- Fix error FillNulls crashes with NotImplemented exception on DataFrame with VBufferDataFrameColumn
- Improve speed and redesign API (legacy API marked as obsolete), internal methods are rewritten to use new API
- Add extra unit tests
Reasons for refactoring:
- Legacy implementation of ApplyElementwise is written not only to apply function on each element in the column, but also to be used for fast columns iteration, that's why it has rowIndex as one of it's arguments. This duplication of responsibilities is very confusing and absolutely not straightforward. More over, it is slow, as instead of only reading values, ApplyElementwise writes function results into into memory, also it converts read only columns to editable (and that again involves memory copy). So in some circumstances memory can be excessively copied even twice.
- Legacy method requires to provide function, that takes into account null values - this is not userfriendly. For working with Nulls there are already FillNulls and DropNulls methods. New method applies user function only to valuable values, this also leads to better performance
cc @JakeRadMSFT
Codecov Report
Attention: Patch coverage is 75.00000%
with 82 lines
in your changes missing coverage. Please review.
Project coverage is 68.67%. Comparing base (
4bc753a
) to head (9af789d
). Report is 1 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #7123 +/- ##
========================================
Coverage 68.66% 68.67%
========================================
Files 1262 1263 +1
Lines 257774 257955 +181
Branches 26660 26698 +38
========================================
+ Hits 176991 177140 +149
- Misses 73971 74000 +29
- Partials 6812 6815 +3
Flag | Coverage Δ | |
---|---|---|
Debug | 68.67% <75.00%> (+<0.01%) |
:arrow_up: |
production | 62.93% <61.86%> (-0.01%) |
:arrow_down: |
test | 88.88% <100.00%> (+0.02%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
Files | Coverage Δ | |
---|---|---|
src/Microsoft.Data.Analysis/DataFrame.cs | 85.67% <100.00%> (+0.25%) |
:arrow_up: |
src/Microsoft.ML.Tokenizers/Model/Tiktoken.cs | 75.84% <ø> (ø) |
|
...st/Microsoft.Data.Analysis.Tests/DataFrameTests.cs | 99.90% <100.00%> (+<0.01%) |
:arrow_up: |
...ta.Analysis.Tests/PrimitiveDataFrameColumnTests.cs | 99.74% <100.00%> (+0.03%) |
:arrow_up: |
....Data.Analysis.Tests/StringDataFrameColumnTests.cs | 100.00% <100.00%> (ø) |
|
test/Microsoft.ML.Tokenizers.Tests/TitokenTests.cs | 99.55% <100.00%> (ø) |
|
...icrosoft.Data.Analysis/PrimitiveColumnContainer.cs | 86.52% <96.96%> (+0.43%) |
:arrow_up: |
...icrosoft.Data.Analysis/PrimitiveDataFrameColumn.cs | 72.53% <97.29%> (+0.09%) |
:arrow_up: |
src/Microsoft.Data.Analysis/DataFrameColumn.cs | 65.46% <0.00%> (+2.28%) |
:arrow_up: |
...nalysis/DataFrameColumns/VBufferDataFrameColumn.cs | 44.57% <30.30%> (-0.04%) |
:arrow_down: |
... and 2 more |
/azp run
Azure Pipelines successfully started running 2 pipeline(s).
TestTokenizerUsingExternalVocab test fails, because external vocabulary is not available by https://pythia.blob.core.windows.net/public/encoding/gpt2.tiktoken url
TestTokenizerUsingExternalVocab test fails, because external vocabulary is not available by https://pythia.blob.core.windows.net/public/encoding/gpt2.tiktoken url
@tarekgh ?
TestTokenizerUsingExternalVocab test fails, because external vocabulary is not available by https://pythia.blob.core.windows.net/public/encoding/gpt2.tiktoken url
Yes, the gpt-2 link became invalid. I'll submit a PR soon to fix that.
@asmirnov82 if you are intersted, you may add the fix to this PR.
- yield return new object[] { GPT2, @"https://pythia.blob.core.windows.net/public/encoding/gpt2.tiktoken" };
+ yield return new object[] { GPT2, @"https://fossies.org/linux/misc/whisper-20231117.tar.gz/whisper-20231117/whisper/assets/gpt2.tiktoken?m=b" };
otherwise, I can submit PR for that.
@tarekgh, I implemented the fix. Now all tokenizer tests passed. Thank you for your help