Feature/group offload pinning
What does this PR do?
Fixes #11966
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [x] Did you read the contributor guideline?
- [x] Did you read our philosophy doc (important for complex PRs)?
- [x] Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [x] Did you write any new necessary tests?
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
@sayakpaul
Thanks for your PR. However, it's being worked on in https://github.com/huggingface/diffusers/pull/12721.
Could we resolve conflicts so that it's a bit easier to review? Seems like there's some overlap from https://github.com/huggingface/diffusers/pull/12692.
Done! Rebased on latest main and resolved conflicts with #12692. Should be much cleaner to review now.
Thank you for the initial comment! We are working on the solutions right now
@bot /style
Style bot fixed some files and pushed the changes.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
@sayakpaul need help with approval
@sayakpaul On my local device there is still a few failed tests on tests/models/autoencoders/test_models_autoencoder_kl.py , regarding safetensors I/O serialization error and a few decimal output difference on test_output_pretrained. However I have pulled the recent diffusers main and the same error persists. We are interested to check out whether this error appears on the GitHub workflow tests environment.
We have fixed the previous error on tests/models/autoencoders/test_models_autoencoder_wan.py from the last Github workflow run.
Thanks!
Could you enlist the failures you are seeing?
These were the error logs
_____________________________________________________ AutoencoderKLTests.test_layerwise_casting_memory _____________________________________________________
self = <tests.models.autoencoders.test_models_autoencoder_kl.AutoencoderKLTests testMethod=test_layerwise_casting_memory>
@require_torch_accelerator
@torch.no_grad()
def test_layerwise_casting_memory(self):
MB_TOLERANCE = 0.2
LEAST_COMPUTE_CAPABILITY = 8.0
def reset_memory_stats():
gc.collect()
backend_synchronize(torch_device)
backend_empty_cache(torch_device)
backend_reset_peak_memory_stats(torch_device)
def get_memory_usage(storage_dtype, compute_dtype):
torch.manual_seed(0)
config, inputs_dict = self.prepare_init_args_and_inputs_for_common()
inputs_dict = cast_maybe_tensor_dtype(inputs_dict, torch.float32, compute_dtype)
model = self.model_class(**config).eval()
model = model.to(torch_device, dtype=compute_dtype)
model.enable_layerwise_casting(storage_dtype=storage_dtype, compute_dtype=compute_dtype)
reset_memory_stats()
model(**inputs_dict)
model_memory_footprint = model.get_memory_footprint()
peak_inference_memory_allocated_mb = backend_max_memory_allocated(torch_device) / 1024**2
return model_memory_footprint, peak_inference_memory_allocated_mb
fp32_memory_footprint, fp32_max_memory = get_memory_usage(torch.float32, torch.float32)
fp8_e4m3_fp32_memory_footprint, fp8_e4m3_fp32_max_memory = get_memory_usage(torch.float8_e4m3fn, torch.float32)
fp8_e4m3_bf16_memory_footprint, fp8_e4m3_bf16_max_memory = get_memory_usage(
torch.float8_e4m3fn, torch.bfloat16
)
compute_capability = get_torch_cuda_device_capability() if torch_device == "cuda" else None
self.assertTrue(fp8_e4m3_bf16_memory_footprint < fp8_e4m3_fp32_memory_footprint < fp32_memory_footprint)
# NOTE: the following assertion would fail on our CI (running Tesla T4) due to bf16 using more memory than fp32.
# On other devices, such as DGX (Ampere) and Audace (Ada), the test passes. So, we conditionally check it.
if compute_capability and compute_capability >= LEAST_COMPUTE_CAPABILITY:
> self.assertTrue(fp8_e4m3_bf16_max_memory < fp8_e4m3_fp32_max_memory)
E AssertionError: False is not true
tests\models\test_modeling_common.py:1757: AssertionError
_____________________________________________ AutoencoderKLTests.test_lora_adapter_wrong_metadata_raises_error _____________________________________________
self = <tests.models.autoencoders.test_models_autoencoder_kl.AutoencoderKLTests testMethod=test_lora_adapter_wrong_metadata_raises_error>
@torch.no_grad()
@unittest.skipIf(not is_peft_available(), "Only with PEFT")
def test_lora_adapter_wrong_metadata_raises_error(self):
from peft import LoraConfig
from diffusers.loaders.lora_base import LORA_ADAPTER_METADATA_KEY
from diffusers.loaders.peft import PeftAdapterMixin
init_dict, _ = self.prepare_init_args_and_inputs_for_common()
model = self.model_class(**init_dict).to(torch_device)
if not issubclass(model.__class__, PeftAdapterMixin):
pytest.skip(f"PEFT is not supported for this model ({model.__class__.__name__}).")
denoiser_lora_config = LoraConfig(
r=4,
lora_alpha=4,
target_modules=["to_q", "to_k", "to_v", "to_out.0"],
init_lora_weights=False,
use_dora=False,
)
model.add_adapter(denoiser_lora_config)
self.assertTrue(check_if_lora_correctly_set(model), "LoRA layers not set correctly")
with tempfile.TemporaryDirectory() as tmpdir:
model.save_lora_adapter(tmpdir)
model_file = os.path.join(tmpdir, "pytorch_lora_weights.safetensors")
self.assertTrue(os.path.isfile(model_file))
# Perturb the metadata in the state dict.
loaded_state_dict = safetensors.torch.load_file(model_file)
metadata = {"format": "pt"}
lora_adapter_metadata = denoiser_lora_config.to_dict()
lora_adapter_metadata.update({"foo": 1, "bar": 2})
for key, value in lora_adapter_metadata.items():
if isinstance(value, set):
lora_adapter_metadata[key] = list(value)
metadata[LORA_ADAPTER_METADATA_KEY] = json.dumps(lora_adapter_metadata, indent=2, sort_keys=True)
> safetensors.torch.save_file(loaded_state_dict, model_file, metadata=metadata)
tests\models\test_modeling_common.py:1315:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tensors = {'decoder.mid_block.attentions.0.to_k.lora_A.weight': tensor([[-0.0836, 0.4591, -0.4989, -0.0175],
[-0.2300, ..., 0.2900, 0.3015],
[ 0.2199, 0.0162, -0.3994, -0.0383],
[ 0.2069, 0.4327, -0.3422, -0.0724]]), ...}
filename = 'C:\\Users\\Bryan\\AppData\\Local\\Temp\\tmpbwk5t9sf\\pytorch_lora_weights.safetensors'
metadata = {'format': 'pt', 'lora_adapter_metadata': '{\n "alora_invocation_tokens": null,\n "alpha_pattern": {},\n "arrow_con...e": null,\n "trainable_token_indices": null,\n "use_dora": false,\n "use_qalora": false,\n "use_rslora": false\n}'}
def save_file(
tensors: Dict[str, torch.Tensor],
filename: Union[str, os.PathLike],
metadata: Optional[Dict[str, str]] = None,
):
"""
Saves a dictionary of tensors into raw bytes in safetensors format.
Args:
tensors (`Dict[str, torch.Tensor]`):
The incoming tensors. Tensors need to be contiguous and dense.
filename (`str`, or `os.PathLike`)):
The filename we're saving into.
metadata (`Dict[str, str]`, *optional*, defaults to `None`):
Optional text only metadata you might want to save in your header.
For instance it can be useful to specify more about the underlying
tensors. This is purely informative and does not affect tensor loading.
Returns:
`None`
Example:
```python
from safetensors.torch import save_file
import torch
tensors = {"embedding": torch.zeros((512, 1024)), "attention": torch.zeros((256, 256))}
save_file(tensors, "model.safetensors")
```
"""
> serialize_file(_flatten(tensors), filename, metadata=metadata)
E safetensors_rust.SafetensorError: Error while serializing: I/O error: The requested operation cannot be performed on a file with a user-mapped section open. (os error 1224)
C:\Users\Bryan\miniconda3\envs\diffusers_contrib\lib\site-packages\safetensors\torch.py:307: SafetensorError
________________________________________________________ AutoencoderKLTests.test_output_pretrained _________________________________________________________
self = <tests.models.autoencoders.test_models_autoencoder_kl.AutoencoderKLTests testMethod=test_output_pretrained>
def test_output_pretrained(self):
model = AutoencoderKL.from_pretrained("fusing/autoencoder-kl-dummy")
model = model.to(torch_device)
model.eval()
# Keep generator on CPU for non-CUDA devices to compare outputs with CPU result tensors
generator_device = "cpu" if not torch_device.startswith(torch_device) else torch_device
if torch_device != "mps":
generator = torch.Generator(device=generator_device).manual_seed(0)
else:
generator = torch.manual_seed(0)
image = torch.randn(
1,
model.config.in_channels,
model.config.sample_size,
model.config.sample_size,
generator=torch.manual_seed(0),
)
image = image.to(torch_device)
with torch.no_grad():
output = model(image, sample_posterior=True, generator=generator).sample
output_slice = output[0, -1, -3:, -3:].flatten().cpu()
# Since the VAE Gaussian prior's generator is seeded on the appropriate device,
# the expected output slices are not the same for CPU and GPU.
if torch_device == "mps":
expected_output_slice = torch.tensor(
[
-4.0078e-01,
-3.8323e-04,
-1.2681e-01,
-1.1462e-01,
2.0095e-01,
1.0893e-01,
-8.8247e-02,
-3.0361e-01,
-9.8644e-03,
]
)
elif generator_device == "cpu":
expected_output_slice = torch.tensor(
[
-0.1352,
0.0878,
0.0419,
-0.0818,
-0.1069,
0.0688,
-0.1458,
-0.4446,
-0.0026,
]
)
else:
expected_output_slice = torch.tensor(
[
-0.2421,
0.4642,
0.2507,
-0.0438,
0.0682,
0.3160,
-0.2018,
-0.0727,
0.2485,
]
)
> self.assertTrue(torch_all_close(output_slice, expected_output_slice, rtol=1e-2))
tests\models\autoencoders\test_models_autoencoder_kl.py:171:
E AssertionError: Max diff is absolute 0.000513467937707901. Diff tensor is tensor([4.3437e-05, 6.3509e-05, 2.4503e-04, 5.1347e-04, 3.8743e-05, 2.0981e-04,
E 1.7959e-04, 1.4303e-04, 2.2203e-06]).
tests\testing_utils.py:129: AssertionError
------------------------------------------------------------------- Captured stderr call -------------------------------------------------------------------
An error occurred while trying to fetch fusing/autoencoder-kl-dummy: fusing/autoencoder-kl-dummy does not appear to have a file named diffusion_pytorch_model.safetensors.
Defaulting to unsafe serialization. Pass `allow_pickle=False` to raise an error instead.
================================================================= short test summary info ==================================================================
FAILED tests/models/autoencoders/test_models_autoencoder_kl.py::AutoencoderKLTests::test_layerwise_casting_memory - AssertionError: False is not true
FAILED tests/models/autoencoders/test_models_autoencoder_kl.py::AutoencoderKLTests::test_lora_adapter_wrong_metadata_raises_error - safetensors_rust.SafetensorError: Error while serializing: I/O error: The requested operation cannot be performed on a file with a user-mapped section o...
FAILED tests/models/autoencoders/test_models_autoencoder_kl.py::AutoencoderKLTests::test_output_pretrained - AssertionError: Max diff is absolute 0.000513467937707901. Diff tensor is tensor([4.3437e-05, 6.3509e-05, 2.4503e-04, 5.1347e-04, 3.8743e-05, 2.0981e-04,
======================================================== 3 failed, 50 passed, 22 skipped in 19.32s =========================================================
where two of them are I/O serialization error, a memory check error (I read the comments that this error usually passes on Ampere and Ada environment, which both are not my current environment), and a slight output difference in test_output_pretrained
@sayakpaul Also with the current checks, it looks like there is coding style error. Can you help us run the automatic style correction?
@bot /style
Style fix is beginning .... View the workflow run here.
The style bot cannot automatically do it. See: https://github.com/huggingface/diffusers/actions/runs/20191712208/job/57970027213
I would recommend the following:
- Create a fresh Python env.
- Run
pip install - ".[style]"from the root of the repository directory. - Run
make style && make quality.
Thanks for the pointer @sayakpaul
@Aki-07 @bconstantine I ran those failing tests on my end with this branch and also on main. I didn't notice any failures.
@sayakpaul thankyou for testing! Glad to hear no failures on your environment end.
Hey @sayakpaul, the WanVACE LoRA failures came from the hook offloading immediately when it was attached. It saved the weights before LoRA was added, then put them back later, so the adapters never took effect. I removed that eager offload so the first offload happens after adapters are loaded. Would need your help to re run the pipelines
Hi @DN6 @sayakpaul We’ve updated the fix according to the review. Could you take a quick look and share any feedback when you have a moment? Thank you in advance!
Hey @DN6 @sayakpaul , As mentioned above, have fixed the comments. Could you help us guide on to the next steps?
Hey @sayakpaul, I noticed the diff is confusing because the branch history got complicated after earlier force-push/imperfect rebase activity (around #12692) that is done above, and some commits don’t line up cleanly with main for both the test and hooks. To make review straightforward and avoid further churn, I can open a fresh PR off the current main that contains only the intended final changes, and link it here. Would that be okay?