LightGBM icon indicating copy to clipboard operation
LightGBM copied to clipboard

Some GPU machines crashing

Open torronen opened this issue 3 years ago • 5 comments

2 machines with 2x 3090, Windows 11 are crashing (seem fully unresponsive) on about 2-5% of runs. Different motherboards, different manual installation, X570 chipset. However, 7 machines with 3070 are okay.

I am running LightGBM version 2.3.1. for compatibility with .NET.

Are there some advice on what I should research or which logs to review to find potential causes for it?

torronen avatar Feb 28 '22 10:02 torronen

@torronen Thanks for using LightGBM. Could you please provide us with the output logs from LightGBM. In addition, it would be great if you could provide the following information:

  1. Which device type are you using to train LightGBM? (cuda or gpu).
  2. Is there any error information (like Segmentfault, or anything else) available?
  3. Are you running a distributed version of LightGBM?

shiyu1994 avatar Mar 01 '22 13:03 shiyu1994

Thanks for the quick reply.

  1. GPU
  2. No, computer is unresponsive, neither GPU provides any output. However both machines are running fine on other GPU tasks, such as 3D, gaming or cryptomining. Latest Nvidia Geforce "game-ready" drivers.
  3. non distributed

Do you know if the DLL version (compiled with instructions on website from 2.3.1 tag) would save logs by default somewhere? If not, then I still need to implement it, as it is probably just printing on the screen for now. I will follow-up after logs have been saved.

torronen avatar Mar 01 '22 17:03 torronen

@torronen Thanks for your feedback. Could you please save the screen output to a file and posted or attached here? Maybe we can find some problems from the logs.

shiyu1994 avatar Mar 02 '22 15:03 shiyu1994

Sorry for delay, my runs takes some time so the machines are reserved.

This is what I get by default. It seems if I set the display on the secondary GPU upon computer start, then I can still use the computer as normal after the crash. It just seems I am no longer able to query that GPU with any tools.


=============== Running AutoML experiment ===============
#########################################################
Running AutoML binary classification experiment...
Press 'q' key to stop the experiment run...
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Info] Number of positive: 687280, number of negative: 670316
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 416060
[LightGBM] [Info] Number of data: 1357596, number of used features: 6766
[LightGBM] [Info] Using requested OpenCL platform 0 device 0
[LightGBM] [Info] Using GPU Device: NVIDIA GeForce RTX 3090, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 12

torronen avatar Mar 13 '22 16:03 torronen

The log did not provide show clue for the cause of crashing.

Windows 11 are crashing (seem fully unresponsive) on about 2-5% of runs

Did you mean that the whole operating system is unresponsive?

shiyu1994 avatar Mar 24 '22 02:03 shiyu1994