llm-foundry issues

Feat: Script to generate addition eval data

2

Just a simple script to generate eval data in the right format for addition samples

Multiple models inference on Single-GPU

Hi, I want to test model's inference on my hardware. I am using a **A100** single instance GPU with 60Gb memory. I have created 4 processes (and 4 model instances,...

gitsand996

question

Tensor Parallel MLP with torch2.0

2

This PR adds torch 2.0 based tensor parallel support for the ffn block. It's ported over from https://github.com/mosaicml/examples/pull/255 Currently the trained weights don't match between parallel/no-parallel versions even in a...

dskhudia

[WIP] Refactor logging

This PR refactors the logging to: * centralize verbosity controls into the python logging levels, instead of passing `verbose` arguments to many helper methods * downgrades most warnings to info...

hanlint

GPTQ support for quantization

9

Hi MosaicML. AutoGPTQ is a package trying to provide support for quantizing various LLMs. However, to do so, a few requirements are needed. Here are a few issues: - MPTForCausalLM...

casper-hansen

ERROR: expected to be in states [<TrainingState_.IDLE: 1>] but current state is TrainingState_.BACKWARD_PRE

3

Got this error, which finetuning instruct model 8XA100 machine ``` ERROR: expected to be in states [] but current state is TrainingState_.BACKWARD_PRE File "/usr/local/lib/python3.10/dist-packages/composer/core/engine.py", line 526, in _close callback.close(state, logger)...

NarenZen

timeout error

3

I got the below when finetuning with [mpt-7b_dolly_sft.yaml](https://github.com/mosaicml/llm-foundry/blob/main/scripts/train/yamls/finetune/mpt-7b_dolly_sft.yaml) Dataset: [mosaicml/dolly_hhrlhf](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) ``` [E ProcessGroupNCCL.cpp:828] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, Timeout(ms)=600000) ran for 606142 milliseconds before timing out....

NarenZen

Error:"Watchdog caught collective operation timeout" when finetuning MPT-7B on a local dataset using 2 A100 GPUs

7

Hi, I am trying to finetune the MPT-7B model using a local dataset on 2 A100 - 80GB GPUs. Below is the complete log. Torch Version: 1.13.1+cu117 Appreciate any help...

satyaskada

How to reproduce zero-shot evals from Table 1 in the blog?

2

Hi, I am trying to reproduce your zero-shot evals from the Table 1 in the blog: https://www.mosaicml.com/blog/mpt-7b but the numbers I am seeing are much worse than the ones reported...

eldarkurtic

G-Eval

# G-Eval This is an implementation of G-Eval, which uses GPT4 as a way to judge model outputs without a ground truth (some groups have started using GPT3.5 as a...

samhavens

llm-foundry
llm-foundry copied to clipboard

Metadata

Feat: Script to generate addition eval data

Multiple models inference on Single-GPU

Tensor Parallel MLP with torch2.0

[WIP] Refactor logging

GPTQ support for quantization

ERROR: expected to be in states [<TrainingState_.IDLE: 1>] but current state is TrainingState_.BACKWARD_PRE

timeout error

Error:"Watchdog caught collective operation timeout" when finetuning MPT-7B on a local dataset using 2 A100 GPUs

How to reproduce zero-shot evals from Table 1 in the blog?

G-Eval

← Metadata

Owner

Metadata

llm-foundry llm-foundry copied to clipboard

Metadata

← Metadata

Owner

Metadata

llm-foundry
llm-foundry copied to clipboard