simple-llm-finetuner icon indicating copy to clipboard operation
simple-llm-finetuner copied to clipboard

Issue in train in colab

Open fermions75 opened this issue 1 year ago • 7 comments

While I run the train in colab, this error is shown -

Something went wrong
Connection errored out.

How can I solve this?

fermions75 avatar Apr 09 '23 12:04 fermions75

Getting this error too..

alior101 avatar Apr 09 '23 13:04 alior101

I'm guessing it's running out of RAM? Are you using high ram env?

lxe avatar Apr 11 '23 04:04 lxe

No, I did not. I just tried using colab pro. I used the base model cerebras/Cerebras-GPT-2.7B. When I press train, the following error shows in colab - image

fermions75 avatar Apr 11 '23 08:04 fermions75

No, it's broken. It works on hugging face now but can't download loras xD.

MillionthOdin16 avatar Apr 12 '23 00:04 MillionthOdin16

I have the same issue, I even tried running it without Gradio's tunnel but rather with another 3rd party but I get the same error.

rs189 avatar Apr 22 '23 20:04 rs189

Should note that for me colab does in fact work, but only in an A100 colab instance with more than 64 GB of RAM. It seemed to spike to ~36+ GB, which is more than the maximum for the free tier/standard RAM profile. This leads me to think it's just due to the RAM limitation of lower colab tiers.

Trying it on the generic RAM profile with a V100 (provides me with ~20-24 GB of RAM), and I had the issue listed in the original post. Trying it locally on a machine with 32 GB of RAM and a P100, I have the same problem where the RAM spikes, which leads to the machine starting the OOM killer and ending the process.

Clybius avatar Apr 23 '23 19:04 Clybius

Should note that for me colab does in fact work, but only in an A100 colab instance with more than 64 GB of RAM. It seemed to spike to ~36+ GB, which is more than the maximum for the free tier/standard RAM profile. This leads me to think it's just due to the RAM limitation of lower colab tiers.

Trying it on the generic RAM profile with a V100 (provides me with ~20-24 GB of RAM), and I had the issue listed in the original post. Trying it locally on a machine with 32 GB of RAM and a P100, I have the same problem where the RAM spikes, which leads to the machine starting the OOM killer and ending the process.

What model and dataset are you using to generate and train? Because this is happening even with a half-precision 7b LLaMa model with default "unhelpful" example in my case, I can even generate with it on my PC which has only 8GB of VRAM, I can't train however, but I don't believe that fine tunning half-precision 7b LLaMa should be more demanding that 15GB of VRAM that Colab provides for free? As you can see the crash/"Connection errored out" error occurs way before RAM and/or VRAM is saturated.

image

rs189 avatar Apr 24 '23 10:04 rs189