llm-foundry mpt-7b generates no output when finetuned with 2.0.1 based llm-foundry, works with 1.13.1

Environment

I'm using the Docker images for llm-foundry, training on 8xA100.

mosaicml/llm-foundry:1.13.1_cu117-latest mosaicml/llm-foundry:2.0.1_cu118-latest

Collecting system information...

System Environment Report
Created: 2023-06-24 17:52:44 UTC

PyTorch information

PyTorch version: 1.13.1+cu117 Is debug build: False CUDA used to build PyTorch: 11.7 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: Could not collect CMake version: version 3.26.3 Libc version: glibc-2.31

Python version: 3.10.12 (main, Jun 7 2023, 12:45:35) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-5.15.0-67-generic-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: 11.7.99 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100-SXM4-40GB GPU 1: NVIDIA A100-SXM4-40GB GPU 2: NVIDIA A100-SXM4-40GB GPU 3: NVIDIA A100-SXM4-40GB GPU 4: NVIDIA A100-SXM4-40GB GPU 5: NVIDIA A100-SXM4-40GB GPU 6: NVIDIA A100-SXM4-40GB GPU 7: NVIDIA A100-SXM4-40GB

Nvidia driver version: 525.85.12 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.5.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.5.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.5.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.5.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.5.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.5.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.5.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Versions of relevant libraries: [pip3] numpy==1.24.3 [pip3] pytorch-ranger==0.1.1 [pip3] torch==1.13.1+cu117 [pip3] torch-optimizer==0.3.0 [pip3] torchmetrics==0.11.3 [pip3] torchtext==0.14.1 [pip3] torchvision==0.14.1+cu117 [conda] Could not collect

Composer information

Composer version: 0.14.1 Composer commit hash: None Host processor model name: AMD EPYC 7J13 64-Core Processor Host processor core count: 124 Number of nodes: 1 Accelerator model name: NVIDIA A100-SXM4-40GB Accelerators per node: 1 CUDA Device Count: 8

To reproduce

Fine-tune a mpt-7b-instruct model using the 2.0.1 based llm-foundry. Convert it to HuggingFace format using the convert script (as per instructions in the llm-foundry README) Attempt inference with the huggingface inference script (as per instructions in the llm-foundry README)

Expected behavior

The inference comes back empty on any model fine-tuned using the 2.0.1 based llm-foundry container. I expect it to infer output. This same process works fine when using the pytorch 1.13.1 based container for fine tuning.

Additional context

It doesn't seen to matter which version of torch is used to convert the data, or which version is used to run the inference, the version used for fine-tuning seems to be the determining factor.

Jun 24 '23 17:06 jwatte

FWIW, the cross entropy and perplexity values in the training run look reasonable with either version of the container, so it seems like the problem might be somewhere in how the model checkpoint is saved out, maybe? Because it seems like the validation inference works while training? Not 100% sure here.

Jun 24 '23 17:06 jwatte

Interesting. Thanks for bringing this to our attention. Would it be possible to capture the hf_generate call you're making and the empty outputs so we can better understand exactly what's happening and what you're seeing?

Jun 26 '23 23:06 alextrott16

Unfortunately, I've deleted the non-working models. The inference is bog standard, though, and the output is just the empty string.

python inference/hf_generate.py --name_or_path /var/tmp/o11y-hf-busted --max_new_tokens 400 --prompts "Summarize average of field cpu_load"

Jun 27 '23 17:06 jwatte

A possibly related problem: I installed pytorch 2.0.1 with cuda 11.8 support on a H100/80GB based Ubuntu 20.04 instance. I ran the following inference request on mpt-30b-instruct. (I ran this on a bare instance, where I had run the pip install -e '.[gpu]' from llm-foundry, not the container instance.)

python inference/hf_generate.py -n 'mosaicml/mpt-30b-instruct' --prompts 'How do I summarize metrics?' --autocast_dtype=bf16 --model_dtype=bf16 --max_new_tokens=400

It infers empty space.

2023-06-28 01:23:07.605653 Generating responses...
2023-06-28 01:23:07.645752 ####################################################################################################
How do I summarize metrics?
####################################################################################################
inference/hf_generate.py:338: RuntimeWarning: divide by zero encountered in divide
  latency_per_output_token = total_latency / total_output_tokens

ubuntu@mpt-30b:~$ composer_collect_env 
2023-06-28 01:26:56.142020: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-28 01:26:56.295774: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            209-20-157-54
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4122

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           209-20-157-54
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       udcm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: 209-20-157-54
  Location: mtl_ofi_component.c:610
  Error: No data available (61)
--------------------------------------------------------------------------
Collecting system information...
---------------------------------
System Environment Report        
Created: 2023-06-28 01:26:59 UTC
---------------------------------

PyTorch information
-------------------
PyTorch version: 2.0.1+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.26.3
Libc version: glibc-2.31

Python version: 3.8.10 (default, May 26 2023, 14:05:08)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-73-generic-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA H100 PCIe
Nvidia driver version: 525.105.17
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 57 bits virtual
CPU(s):                          26
On-line CPU(s) list:             0-25
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       26
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           143
Model name:                      Intel(R) Xeon(R) Platinum 8480+
Stepping:                        8
CPU MHz:                         2000.000
BogoMIPS:                        4000.00
Virtualization:                  VT-x
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       832 KiB
L1i cache:                       832 KiB
L2 cache:                        104 MiB
L3 cache:                        416 MiB
NUMA node0 CPU(s):               0-25
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Unknown: No mitigations
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Mitigation; TSX disabled
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b fsrm md_clear serialize tsxldtrk avx512_fp16 arch_capabilities

Versions of relevant libraries:
[pip3] numpy==1.23.5
[pip3] pytorch-ranger==0.1.1
[pip3] torch==2.0.1+cu118
[pip3] torch-optimizer==0.3.0
[pip3] torchaudio==2.0.2+cu118
[pip3] torchdata==0.6.1
[pip3] torchmetrics==0.11.3
[pip3] torchtext==0.15.2
[pip3] torchvision==0.15.2+cu118
[conda] Could not collect


Composer information
--------------------
Composer version: 0.14.1
Composer commit hash: None
Host processor model name: Intel(R) Xeon(R) Platinum 8480+
Host processor core count: 26
Number of nodes: 1
Accelerator model name: NVIDIA H100 PCIe
Accelerators per node: 1
CUDA Device Count: 1

Jun 28 '23 01:06 jwatte

Closing as we haven't had any more reports of this and are unable to repro. Please feel free to open a new issues if you are still encountering it!

Apr 04 '24 21:04 dakinggg

llm-foundry llm-foundry copied to clipboard

mpt-7b generates no output when finetuned with 2.0.1 based llm-foundry, works with 1.13.1

Environment

Collecting system information...

System Environment Report Created: 2023-06-24 17:52:44 UTC

PyTorch information

Composer information

To reproduce

Expected behavior

Additional context

llm-foundry
llm-foundry copied to clipboard

System Environment Report
Created: 2023-06-24 17:52:44 UTC