djl
djl copied to clipboard
Warning: FALLBACK path has been taken inside: torch::jit::fuser::cuda::runCudaFusionGroup.
Description
Running the code specified below I get a number of warning at the beginning, apparently harmless.
Training: 0% |= | Accuracy: _, SoftmaxCrossEntropyLoss: _ Training: 0% |= | Accuracy: _, SoftmaxCrossEntropyLoss: _[W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\codegen\cuda\manager.cpp:336] Warning: FALLBACK path has been taken inside: torch::jit::fuser::cuda::runCudaFusionGroup. This is an indication that codegen Failed for some reason. To debug try disable codegen fallback path via setting the env variable
export PYTORCH_NVFUSER_DISABLE=fallback
(function runCudaFusionGroup) [W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\codegen\cuda\manager.cpp:336] Warning: FALLBACK path has been taken inside: torch::jit::fuser::cuda::runCudaFusionGroup. This is an indication that codegen Failed for some reason. To debug try disable codegen fallback path via setting the env variableexport PYTORCH_NVFUSER_DISABLE=fallback
(function runCudaFusionGroup) [W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\codegen\cuda\manager.cpp:336] Warning: FALLBACK path has been taken inside: torch::jit::fuser::cuda::runCudaFusionGroup. This is an indication that codegen Failed for some reason. To debug try disable codegen fallback path via setting the env variableexport PYTORCH_NVFUSER_DISABLE=fallback
(function runCudaFusionGroup) [W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\codegen\cuda\manager.cpp:336] Warning: FALLBACK path has been taken inside: torch::jit::fuser::cuda::runCudaFusionGroup. This is an indication that codegen Failed for some reason. To debug try disable codegen fallback path via setting the env variableexport PYTORCH_NVFUSER_DISABLE=fallback
(function runCudaFusionGroup) [W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\codegen\cuda\manager.cpp:336] Warning: FALLBACK path has been taken inside: torch::jit::fuser::cuda::runCudaFusionGroup. This is an indication that codegen Failed for some reason. To debug try disable codegen fallback path via setting the env variableexport PYTORCH_NVFUSER_DISABLE=fallback
(function runCudaFusionGroup) [W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\codegen\cuda\manager.cpp:336] Warning: FALLBACK path has been taken inside: torch::jit::fuser::cuda::runCudaFusionGroup. This is an indication that codegen Failed for some reason. To debug try disable codegen fallback path via setting the env variableexport PYTORCH_NVFUSER_DISABLE=fallback
(function runCudaFusionGroup)Training: 0% |= | Accuracy: _, SoftmaxCrossEntropyLoss: _, speed: 129,14 items/sec Training: 0% |= | Accuracy: _, SoftmaxCrossEntropyLoss: _, speed: 24,89 items/sec Training: 1% |= | Accuracy: 0,01, SoftmaxCrossEntropyLoss: 5,06, speed: 154,95 items/sec Training: 2% |= | Accuracy: 0,01, SoftmaxCrossEntropyLoss: 5,06, speed: 145,35 items/sec
I set the env variable, as requested and I report the full error message in the corresponding section. Are they really harmless or should I worry?
Expected Behavior
No warning if harmless, (or a clearer warning), else correction
Error Message
Training: 0% |= | Accuracy: _, SoftmaxCrossEntropyLoss: _ Training: 0% |= | Accuracy: _, SoftmaxCrossEntropyLoss: _ai.djl.engine.EngineException: The following operation failed in the TorchScript interpreter. Traceback of TorchScript (most recent call last): RuntimeError: The following operation failed in the TorchScript interpreter. Traceback of TorchScript (most recent call last): RuntimeError: false INTERNAL ASSERT FAILED at "C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\codegen\cuda\executor_utils.cpp":1181, please report a bug to PyTorch. namespace CudaCodeGen {
typedef signed char int8_t; typedef unsigned char uint8_t; typedef short int int16_t; typedef unsigned short int uint16_t; typedef int int32_t; typedef unsigned int uint32_t; typedef long long int int64_t; typedef unsigned long long int uint64_t; typedef int nvfuser_index_t;
#define POS_INFINITY __int_as_float(0x7f800000) #define INFINITY POS_INFINITY #define NEG_INFINITY __int_as_float(0xff800000) #define NAN __int_as_float(0x7fffffff)
namespace std {
template <class _Tp> _Tp&& __declval(int); template <class _Tp> _Tp __declval(long); template <class _Tp> decltype(__declval<_Tp>(0)) declval() noexcept;
template <class _Tp, _Tp __v> struct integral_constant { static const _Tp value = __v; typedef _Tp value_type; typedef integral_constant type; };
typedef integral_constant<bool, true> true_type; typedef integral_constant<bool, false> false_type;
// is_same, functional template <class _Tp, class _Up> struct is_same : public false_type {}; template <class _Tp> struct is_same<_Tp, _Tp> : public true_type {};
// is_integral, for some types. template <class _Tp> struct is_integral : public integral_constant<bool, false> {};
[************ OMISSIS: I received an error: Comment is too long (maximum is 65536 characters)] SO I CUT MANY LINES ************]
NVFUSER_UPDATE_MAGIC_ZERO if ((((((nvfuser_index_t)threadIdx.x) * 4) + 3) < T0.size[0])) { loadLocalToGlobal<float, 4, false>( &T18[(((nvfuser_index_t)blockIdx.x) * T0.size[0]) + i256], &T23[0]); } } } }
CUDA NVRTC compile error: nvrtc: error: failed to open nvrtc-builtins64_117.dll. Make sure that nvrtc-builtins64_117.dll is installed correctly.
at ai.djl.pytorch.jni.PyTorchLibrary.moduleForward(Native Method) at ai.djl.pytorch.jni.IValueUtils.forward(IValueUtils.java:47) at ai.djl.pytorch.engine.PtSymbolBlock.forwardInternal(PtSymbolBlock.java:154) at ai.djl.nn.AbstractBaseBlock.forwardInternal(AbstractBaseBlock.java:128) at ai.djl.nn.AbstractBaseBlock.forward(AbstractBaseBlock.java:93) at ai.djl.nn.SequentialBlock.forwardInternal(SequentialBlock.java:209) at ai.djl.nn.AbstractBaseBlock.forward(AbstractBaseBlock.java:93) at ai.djl.training.Trainer.forward(Trainer.java:189) at ai.djl.training.EasyTrain.trainSplit(EasyTrain.java:122) at ai.djl.training.EasyTrain.trainBatch(EasyTrain.java:110) at ai.djl.training.EasyTrain.fit(EasyTrain.java:58) at it.algaware.mrjvs.djl.Test2.getIntentAll(Test2.java:239) at it.algaware.mrjvs.djl.Test2.main(Test2.java:82)
How to Reproduce?
I run the code I already posted in https://github.com/deepjavalibrary/djl/issues/2144#issuecomment-1356405023 , but with the following change in loading the modelPath (with distilbert it seems to work, the above issue is only with bert and it seems to me there is no relation among the two.
.optModelPath(Paths.get("build/pytorch/traced_distilbert_wikipedia_uncased"))
//.optModelPath(Paths.get("build/pytorch/bert/bertBase"))
Steps to reproduce
Create class, run main
What have you tried to solve it?
Nothing, it seems harmless, just tracing and asking
Environment Info
Please run the command ./gradlew debugEnv
from the root directory of DJL (if necessary, clone DJL first). It will output information about your system, environment, and installation that can help us debug your issue. Paste the output of the command below:
I receive an error from terminal, but I'm working from eclipse. If you need other information, please, let me know
So, is this just a flush of warning instead of an error?
It seems that these warning are thrown from PyTorch library. One of the reasons could be that the bert model *.pt file is not called in a proper way. As mentioned in https://github.com/deepjavalibrary/djl/issues/2144#issuecomment-1360144067, it is possible that the model is not exactly same if you do the switch:
.optModelPath(Paths.get("build/pytorch/traced_distilbert_wikipedia_uncased"))
//.optModelPath(Paths.get("build/pytorch/bert/bertBase"))
This is a model level debugging. Could you narrow down the issue?
It is a flush of warnings, apparently harmless. I just opened this issue to ask you if you understand it better and it can be harmful in other contexts, but the overall training and use of the trained model seems to be ok.
The switch above referenced a previously posted code. But the previously posted code comes from an example (and if you read again from the beginning the #2144 issue, you'll find there are really few changes). And the switch was a switch back to the previous code (#2144 began with a distilbert example, then I switched it to a bert example to understand differences and blocks differences in modelling, then I switched it again back to distilbert). The way distilbert model has been called is the same of the original example.
I have no idea how to narrow dowm the issue, in this case. But, maybe, cleaning up the code, should be a good start and also comparing it with the original example. I'll do that and come back in a new comment in this issue.
Thank you
hi, do you have any new ideas about the warning? The same warning appeared in https://github.com/ultralytics/yolov5/issues/10333#issue-1467648446