Jiani Wang comments

Results 37 comments of


                                            Jiani Wang

[Flux] Incorrect loss after loading from checkpoint

> I think I may have found the reason for this. For the last step in each training job, the loss seems to be incorrect, at least in the plot....

[Flux] Incorrect loss after loading from checkpoint

> We would expect the lr to increase in each step by 0.0008/5 = 0.00016 I tried to reproduce with `main` branch. Here's my setting: ``` [optimizer] name = "AdamW"...

Make encoder params contiguous before fsdp

> @wwwjn Have you encountered this issue when running on the devgpu? Landing the PR looks not harmful but wants to understand why this is required specific to Flux encoder....

Make encoder params contiguous before fsdp

Hi @cli99 , I want to follow up with this PR, thanks for contributing! I haven't met this issue during my training, do you know under what circumstances the tensor...

Make encoder params contiguous before fsdp

Close this PR for now because I can not reproduce it. Ignore it for now

🐛 Use correct path for train_configs

Nice catch! LGTM, please sign the CLA to processed

Add validation and batched inference to flux

@CarlosGomes98 one quick note is `flux-train` is a little bit behind the main branch, let's just solve the comments and create a PR to main branch instead.

[Flux] Test and enable checkpointing for Flux model

> > By tuning off classifier-free guidance(in dataloader), eval steps and load from downloaded dataset, the hash of each batch is identical across different runs. The issue is around deterministic...

[Flux] Flux Issue Tracking

cc @CarlosGomes98 @tianyu-l , here's a centralized tracker of Flux issue and next steps.

[Flux] Flux Issue Tracking

Preprocessing code is here: https://github.com/pytorch/torchtitan/tree/flux-train. The preprocessed data will take huge storge, because the generated t5 encoding for each sample is 256 * 4096.