ML icon indicating copy to clipboard operation
ML copied to clipboard

Weird issue with Truncated SVD and NumericStringConvertor

Open MihailoJoksimovic opened this issue 3 years ago • 1 comments
trafficstars

So it took me ages to figure out the WHY, but I finally pinpointed some extremely weird behavior.

Namely, here's the simples code that reproduces the issue:

$dataset = \Rubix\ML\Datasets\Labeled::build([
    [5.1, 3.5, 1.4, 0.2],
    [4.9, 3, 1.4, 0.2]
], ['setosa', 'variosa'])->apply(new NumericStringConverter());

$transformer = new \Rubix\ML\Transformers\TruncatedSVD(2);

$dataset->apply($transformer);

var_dump($dataset);

Output

object(Rubix\ML\Datasets\Labeled)#2 (2) {
  ["labels":protected]=>
  array(2) {
    [0]=>
    string(6) "setosa"
    [1]=>
    string(7) "variosa"
  }
  ["samples":protected]=>
  array(2) {
    [0]=>
    array(2) {
      [0]=>
      float(0)
      [1]=>
      float(0)
    }
    [1]=>
    array(2) {
      [0]=>
      float(0)
      [1]=>
      float(0)
    }
  }
}

As you can see - it's all zeros.

Now, removing the NumericStringConverter:

$dataset = \Rubix\ML\Datasets\Labeled::build([
    [5.1, 3.5, 1.4, 0.2],
    [4.9, 3, 1.4, 0.2]
], ['setosa', 'variosa']);

$transformer = new \Rubix\ML\Transformers\TruncatedSVD(2);

$dataset->apply($transformer);

var_dump($dataset);

Gives following output:

object(Rubix\ML\Datasets\Labeled)#2 (2) {
  ["labels":protected]=>
  array(2) {
    [0]=>
    string(6) "setosa"
    [1]=>
    string(7) "variosa"
  }
  ["samples":protected]=>
  array(2) {
    [0]=>
    array(2) {
      [0]=>
      float(-6.3431263560806)
      [1]=>
      float(-0.1573150685585)
    }
    [1]=>
    array(2) {
      [0]=>
      float(-5.9145190147327)
      [1]=>
      float(0.16871521675666)
    }
  }
}

Now, it took me hours to figure out WTF is happening, because, apparently, nothing spectacular is ... BUT ... BUT! I pinpointed the issue to the following line in NumericStringCoverter:

    protected function convertToNumber(array &$sample) : void
    {
        foreach ($sample as &$value) {
            if (is_string($value)) {
                if (is_numeric($value)) {
                    $value = (int) $value == $value
                        ? (int) $value
                        : (float) $value;

                    continue;
                }

This foreach loop that passes reference to $value is the culprit! By replacing it with:

        foreach ($sample as $key => $value) {
            if (is_string($value)) {
                if (is_numeric($value)) {
                    $sample[$key] = (int) $value == $value
                        ? (int) $value
                        : (float) $value;

                    continue;
                }

all works as expected really!

This leads me to conclusion that for whatever WEIRD reason, something happens internally that messes up the SVD process. Now the problem is that SVD is written as C extension and I honestly have no clue how to debug that :)

My question is -- do you see this as a bug in NumericStringConverter or in C extension? If it's former, I'd be happy to submit a bugfix really!

MihailoJoksimovic avatar May 10 '22 18:05 MihailoJoksimovic

Hey @MihailoJoksimovic yeah I've run into this problem before with SVD, unfortunately, I have not had the time to debug the issue. Maybe create an issue in the Tensor repo and see if someone can fix it. Thanks!

andrewdalpino avatar May 15 '22 01:05 andrewdalpino