Sam Foreman

Results 8 issues of Sam Foreman

Updates `deepspeed/monitor/monitor.py` to instantiate objects with correct configs Specifically, fixes issue when trying to use W&B ```Shell File "/soft/datascience/conda/2023-01-10/mconda3/lib/python3.10/site-packages/deepspeed/monitor/wandb.py", line 14, in __init__ self.group = wandb_config.group AttributeError: 'CSVConfig' object has...

Hello, When trying to enable W&B monitoring (as shown below in the snippet from my `ds_config.json`): ```json "wandb": { "enabled": True, "project": projectName, "group": groupName }, ``` I get the...

Not sure the cause, but trying to run multi-node training (launching with [mpich](https://www.mpich.org/)), I'm getting the following error: ```bash File "/lus/grand/projects/datascience/foremans/locations/polaris/projects/saforem2/Megatron-DeepSpeed/dist.py", line 106, in init_deepspeed deepspeed.init_distributed() File "/lus/grand/projects/datascience/foremans/locations/polaris/miniconda3/envs/2022-09-08-hvd-nccl/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 646,...

Explicitly: - Use [`Hydra`](https://hydra.cc) for all aspects of configuration - Modularize and move source code into `src/ngpt` - Add `pyproject.toml` - Add (Google Colab compatible) self-contained notebooks for training various...

Issue coming from: https://github.com/intel/intel-extension-for-pytorch/blob/a7f9edebd5fc102a7f290613987c380668d2a297/intel_extension_for_pytorch/__init__.py#L36 Trying: ```python >>> from intel_extension_for_transformers.transformers import ViTImageProcessor >>> processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k") /lus/gila/projects/Aurora_deployment/foremans/locations/sunspot/projects/saforem2/stormer-dev/venvs/sunspot/q4-drop/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image...

Bug
CPU

I was seeing: - `ModuleNotFoundError` in `components/checkpoint.py`: ```python Traceback (most recent call last): File "/opt/aurora/25.190.0/frameworks/aurora_frameworks-2025.2.0/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/aurora/25.190.0/frameworks/aurora_frameworks-2025.2.0/lib/python3.10/runpy.py", line 86, in _run_code exec(code,...

CLA Signed

For whatever reason, it seems like this commit: - https://github.com/intel/torch-ccl/commit/c27ded5190a6b115ec68c7a8c28f40cfe7f0a32a: ```diff diff --git a/version.txt b/version.txt index feb74ff..994e3f7 100644 --- a/version.txt +++ b/version.txt @@ -1 +1 @@ -2.7.0+xpu +2.8.0+xpu ``` never...