Dmitri Smirnov

Results 141 comments of Dmitri Smirnov

/azp run MacOS NoContribops CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows WebAssembly CI Pipeline, orttraining-amd-gpu-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

> Pre-allocation could work, but that would require a rewrite of all the output processing in both the Java and the native code. I'd missed the update to `Run` which...

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux CPU x64 NoContribops CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar...

/azp run MacOS NoContribops CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows WebAssembly CI Pipeline, orttraining-amd-gpu-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

/azp run onnxruntime-binary-size-checks-ci-pipeline, onnxruntime-python-checks-ci-pipeline, ONNX Runtime Web CI Pipeline

> Can we get this integrated now the 1.11 release has happened? I can rebase it on master if necessary, then it'll be easier to work on the native binding...

Copying or not, we want to make sure that the output tensors are deallocated when no longer needed. --- In reply to: [1054694679](https://github.com/microsoft/onnxruntime/pull/10653#issuecomment-1054694679) [](http://example.com/codeflow?ancestors=1054694679,1054582742)

> public static final int MAX_DIMENSIONS = 8; We did some profiling, the max dim we ever hit was 5. --- In reply to: [1306361595](https://github.com/microsoft/onnxruntime/pull/10653#issuecomment-1306361595) [](http://example.com/codeflow?ancestors=1306361595) --- Refers to: java/src/main/java/ai/onnxruntime/TensorInfo.java:15...

I think that captures it. Interesting enough, that when thrust::inclusive_scan was used, the output was correct, however, the switch was made to cub::DeviceScan because of its ability to take cuda...

I also found [this ](https://forums.developer.nvidia.com/t/how-to-use-thrust-for-each-with-cuda-streams/177797/3)in the forums, implying that in 11.4 the problem may have been addressed.