tfjs icon indicating copy to clipboard operation
tfjs copied to clipboard

Download error value for model debugging

Open axinging opened this issue 3 years ago • 5 comments

For model with too many nodes, such as benchmark=FaceLandmarkDetection&architecture=attention_mesh, debug mode may possibly print very long logs in console. This is too hard to investigate the difference between reference and prediction.

Instead of printing logs in console, this change downloads the logs into json file. This will simplify the model debugging.

This change also adds a new url parameter intermediateIndex, this parameter can be used to control which intermediate tensors to be downloaded. For example, below parameter will download intermediate tensor of index: from 0 to 3, 9, from 10 to 100.

intermediateIndex=0-3,9,10-100

To download all:

intermediateIndex=-1

Default download 0-9:

intermediateIndex=0-9

To see the logs from the Cloud Build CI, please join either our discussion or announcement mailing list.


This change is Reviewable

axinging avatar Aug 19 '22 01:08 axinging

@qjia7 @xhcao @haoyunfeix @gyagp PTAL

axinging avatar Aug 29 '22 01:08 axinging

Could we merge all logs into one file? Otherwise, for each pair of nodes with values diff, two JSON files will be downloaded, and then, if 26 nodes have diff values, 52 JSON files will be downloaded, which will be a little more difficult to parse or read.

Probably like:

[
  {
      "node": "25_StatefulPartitionedCall-StatefulPartitionedCall-predict-MobilenetV3-expanded_conv_3-depthwise-depthwise_bn_offset",
      "webgl":  tensorData1,
       "cpu": tensorData2
  }
]
  1. Single file as below would be too big. For example, for bodypix-resnet50, this will be more than 100MB at some worse condition. Beside this, "webgl": tensorData1 may also be very long, we have seen a tensorData1 of 73983 items.
  2. For debug purpose, we just need to find out which node(s) is(are) incorrect. So when there is any mismatch, we need to compare this node by node. The seperated file make this comparision very simple with tools like(Beyond compare). From my experience, most time, simply compare the smallest indexed (please always check the smallest index first) nodes will get the wrong operator.

So seperated file is convenient from my debug experience. But this takes too long time to download those files (sleep 200 ms), especially when there are too many files (>1000).

[
  {
      "node": "25_StatefulPartitionedCall-StatefulPartitionedCall-predict-MobilenetV3-expanded_conv_3-depthwise-depthwise_bn_offset",
      "webgl":  tensorData1,
       "cpu": tensorData2
  }
]

axinging avatar Aug 30 '22 07:08 axinging

Regarding to @Linchenn's comments:

  • If yes, we could add one more flag for download nodes with diff values functionality in the UI.

The first version is print all logs in console. It's ok when the logs are short. But in order to investigate errors like below(from facelandmark, attention_mesh), I have to save it as a json file, and write a tool to split it into many files node by node:

debug_facelandmark

The idea here I proposed, is to save extra efforts to save and split. I didn't see any necessary to preserve these logs in console. So my opinion is: download. Do you have any use case for this purpose (print on console)? @Linchenn @lina128.

  • If no, we have to rename Print intermediate flag in the UI. Yes, this is good catching. Rename it to : 'Download tensor diff' (The UI doesn't support very long string)

  • Single file or multiple file In the latest change, I turned it into a single file. PTAL.

As to @pyu10055, currently each file name contains an prefixed index. This index can be used as a timestamp.

axinging avatar Aug 31 '22 02:08 axinging

@pyu10055 @Linchenn PTAL

This change also adds a new url parameter intermediateIndex, this parameter can be used to control which intermediate tensors to be downloaded. For example, below parameter will download intermediate tensor of index: from 0 to 3, 9, from 10 to 100.

intermediateIndex=0-3,9,10-100

To download all:

intermediateIndex=-1

Default download 0-9:

intermediateIndex=0-9

axinging avatar Sep 05 '22 08:09 axinging

@pyu10055 @Linchenn PTAL

axinging avatar Sep 08 '22 06:09 axinging

This change has been replaced by : https://github.com/tensorflow/tfjs/pull/6850 So I will close this.

axinging avatar Oct 25 '22 07:10 axinging