machinelearning icon indicating copy to clipboard operation
machinelearning copied to clipboard

Implement DataFrameColumn Apply and DropNulls methods

Open asmirnov82 opened this issue 10 months ago • 2 comments

Fixes #7107 as was asked in https://github.com/dotnet/machinelearning/issues/6144#issuecomment-2018430389

Additionaly:

  1. Fix incorrect IsNumeric method
  2. Fix error FillNulls crashes with NotImplemented exception on DataFrame with VBufferDataFrameColumn
  3. Improve speed and redesign API (legacy API marked as obsolete), internal methods are rewritten to use new API
  4. Add extra unit tests

Reasons for refactoring:

  1. Legacy implementation of ApplyElementwise is written not only to apply function on each element in the column, but also to be used for fast columns iteration, that's why it has rowIndex as one of it's arguments. This duplication of responsibilities is very confusing and absolutely not straightforward. More over, it is slow, as instead of only reading values, ApplyElementwise writes function results into into memory, also it converts read only columns to editable (and that again involves memory copy). So in some circumstances memory can be excessively copied even twice.
  2. Legacy method requires to provide function, that takes into account null values - this is not userfriendly. For working with Nulls there are already FillNulls and DropNulls methods. New method applies user function only to valuable values, this also leads to better performance

asmirnov82 avatar Apr 09 '24 11:04 asmirnov82

cc @JakeRadMSFT

asmirnov82 avatar Apr 09 '24 11:04 asmirnov82

Codecov Report

Attention: Patch coverage is 75.00000% with 82 lines in your changes missing coverage. Please review.

Project coverage is 68.67%. Comparing base (4bc753a) to head (9af789d). Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff            @@
##             main    #7123    +/-   ##
========================================
  Coverage   68.66%   68.67%            
========================================
  Files        1262     1263     +1     
  Lines      257774   257955   +181     
  Branches    26660    26698    +38     
========================================
+ Hits       176991   177140   +149     
- Misses      73971    74000    +29     
- Partials     6812     6815     +3     
Flag Coverage Δ
Debug 68.67% <75.00%> (+<0.01%) :arrow_up:
production 62.93% <61.86%> (-0.01%) :arrow_down:
test 88.88% <100.00%> (+0.02%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
src/Microsoft.Data.Analysis/DataFrame.cs 85.67% <100.00%> (+0.25%) :arrow_up:
src/Microsoft.ML.Tokenizers/Model/Tiktoken.cs 75.84% <ø> (ø)
...st/Microsoft.Data.Analysis.Tests/DataFrameTests.cs 99.90% <100.00%> (+<0.01%) :arrow_up:
...ta.Analysis.Tests/PrimitiveDataFrameColumnTests.cs 99.74% <100.00%> (+0.03%) :arrow_up:
....Data.Analysis.Tests/StringDataFrameColumnTests.cs 100.00% <100.00%> (ø)
test/Microsoft.ML.Tokenizers.Tests/TitokenTests.cs 99.55% <100.00%> (ø)
...icrosoft.Data.Analysis/PrimitiveColumnContainer.cs 86.52% <96.96%> (+0.43%) :arrow_up:
...icrosoft.Data.Analysis/PrimitiveDataFrameColumn.cs 72.53% <97.29%> (+0.09%) :arrow_up:
src/Microsoft.Data.Analysis/DataFrameColumn.cs 65.46% <0.00%> (+2.28%) :arrow_up:
...nalysis/DataFrameColumns/VBufferDataFrameColumn.cs 44.57% <30.30%> (-0.04%) :arrow_down:
... and 2 more

... and 5 files with indirect coverage changes

codecov[bot] avatar Apr 09 '24 12:04 codecov[bot]

/azp run

JakeRadMSFT avatar Jun 07 '24 18:06 JakeRadMSFT

Azure Pipelines successfully started running 2 pipeline(s).

azure-pipelines[bot] avatar Jun 07 '24 18:06 azure-pipelines[bot]

TestTokenizerUsingExternalVocab test fails, because external vocabulary is not available by https://pythia.blob.core.windows.net/public/encoding/gpt2.tiktoken url

asmirnov82 avatar Jun 08 '24 06:06 asmirnov82

TestTokenizerUsingExternalVocab test fails, because external vocabulary is not available by https://pythia.blob.core.windows.net/public/encoding/gpt2.tiktoken url

@tarekgh ?

JakeRadMSFT avatar Jun 10 '24 17:06 JakeRadMSFT

TestTokenizerUsingExternalVocab test fails, because external vocabulary is not available by https://pythia.blob.core.windows.net/public/encoding/gpt2.tiktoken url

Yes, the gpt-2 link became invalid. I'll submit a PR soon to fix that.

tarekgh avatar Jun 10 '24 17:06 tarekgh

@asmirnov82 if you are intersted, you may add the fix to this PR.

-            yield return new object[] { GPT2, @"https://pythia.blob.core.windows.net/public/encoding/gpt2.tiktoken" };
+            yield return new object[] { GPT2, @"https://fossies.org/linux/misc/whisper-20231117.tar.gz/whisper-20231117/whisper/assets/gpt2.tiktoken?m=b" };

otherwise, I can submit PR for that.

tarekgh avatar Jun 10 '24 17:06 tarekgh

@tarekgh, I implemented the fix. Now all tokenizer tests passed. Thank you for your help

asmirnov82 avatar Jun 10 '24 21:06 asmirnov82