monaspace Update install_linux.sh cp: cannot `stat' on './fonts/otf/*': No such file or directory

cp: cannot `stat' ... No such file or directory

I have detected that I cannot execute the script, therefore, I did so many tests, deleting folders and creating again, the error was that the font folder could not be found and I added one more point to indicate that it goes back and looks for the fonts folder.

Nov 23 '23 19:11 DiegoHC06

Injecting new line :

def on_train_epoch_end(self):
     loss = torch.stack(self.batch_losses).mean()
     self.log('loss', loss, on_step=False, on_epoch=True, prog_bar=True)
     print(f"Epoch {self.current_epoch} --> loss{loss.item()}")
     self.batch_losses.clear()
     print("")

Output will look like this:

@jojje Do you meant this approach ? Also Epoch 2 is printed twice.. cc @Borda Do you think it as issue or not? If issue can you please share your insights I could work on it.

Feb 29 '24 12:02 ManoBharathi93

Hey @jojje It could be seen as an issue or not, it depends. Fixing this might be very hard. It has to do with the fact that the callback hooks run before the LightningModule hooks if I interpret this correctly.

If you self.log in your training step with on_epoch=True, it will work correctly.

Regarding "why does Epoch 2 show twice" it is because you have print statements and the TQDM bar will continue to write updates to the progress bar after your prints. If you want to avoid that, use self.print(...) instead.

Feb 29 '24 18:02 awaelchli

@awaelchli I tried the two changes you proposed, It solved the "off by one" problem, but at the cost of a performance hit. It also doesn't solve the problem of the individual epoch progress bars vanishing, causing data loss in the console output.

Change:

@@ -7,5 +7,4 @@ class DemoNet(pl.LightningModule):
         super().__init__()
         self.fc = torch.nn.Linear(784, 10)
-        self.batch_losses = []

     def configure_optimizers(self):
@@ -17,12 +16,9 @@ class DemoNet(pl.LightningModule):
         yh = self.fc(x)
         loss = torch.nn.functional.cross_entropy(yh, y)
-        self.batch_losses.append(loss)
+        self.log('loss', loss, on_step=False, on_epoch=True, prog_bar=True)
         return loss

     def on_train_epoch_end(self):
-        loss = torch.stack(self.batch_losses).mean()
-        self.log('loss', loss, on_step=False, on_epoch=True, prog_bar=True)
-        self.batch_losses.clear()
-        print("")
+        self.print("")

 ds = torchvision.datasets.MNIST(root="dataset/", train=True, transform=torchvision.transforms.ToTensor(), download=True)

Resulting output:



Epoch 2: 100%|███████| 938/938 [00:04<00:00, 217.09it/s, v_num=6, loss=0.300]`Trainer.fit` stopped: `max_epochs=3` reached.
Epoch 2: 100%|███████| 938/938 [00:04<00:00, 216.87it/s, v_num=6, loss=0.300]

As you can see,

the previous progress bars are closed and thus not retained. That's why there are two leading blank lines.
there is still a duplicate of the final progress bar for the third epoch (Epoch 2).

The reason why I didn't let Lightning calculate the stats automatically via the on_epoch (end) flag is because it's expensive. On my test run above, the training takes a 25% performance (throughput) hit by logging on each training step. with on_step=True, on_epoch=True and about 7% with on_step=False, on_epoch=True. I've researched the issues and discussion forums, and the consensus seem to be "Log as little and as seldom as possible, and calculate statistics only when you need to in order to not slow down training". So that's why I'm performing the cheapest operation possible in the training step; just storing the losses, and then at the end of the epoch, doing the expensive tensor creation, mean calculation and logging, since it's only at the end of the epoch it's relevant to log the loss for the epoch. I'm simply trying to find a near "zero cost" stats logging solution here that keeps the training observability ergonomics from our pure pytorch training loops.

Right now I'm just in an evaluation phase seeing if Lightning might be something we can use going forward, but these initial 101 training ergonomics have put such notions on ice. I like the idea of bringing more structure to training, but can unfortunately not sell the idea of a new framework without even the basics being handled correctly, so that's why I opened this issue. I look forward to hearing further suggestions on how to leverage lightning correctly, so as to pass the initial sniff test ;)

To reiterate the composite objective:

Log loss or any other statistic at the end of each epoch.
Retain the progress bar and statistics for each epoch.
Avoid incurring significant training slowdown due to logging.

Mar 05 '24 10:03 jojje

Update, workaround that makes lightning log as expected:

import torch
import torchvision
import pytorch_lightning as pl
from pytorch_lightning.callbacks.progress.tqdm_progress import TQDMProgressBar

class LitProgressBar(TQDMProgressBar):
    def on_train_end(self, *_):
        # self.train_progress_bar.close()
        pass

    def on_validation_end(self, trainer, pl_module):
        # self.val_progress_bar.close()
        self.reset_dataloader_idx_tracker()
        if self._train_progress_bar is not None and trainer.state.fn == "fit":
            self.train_progress_bar.set_postfix(self.get_metrics(trainer, pl_module))

    def on_test_end(self, trainer, pl_module):
        # self.test_progress_bar.close()
        self.reset_dataloader_idx_tracker()

    def on_predict_end(self, trainer, pl_module):
        # self.predict_progress_bar.close()
        self.reset_dataloader_idx_tracker()


class DemoNet(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.fc = torch.nn.Linear(784, 10)

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.001)

    def training_step(self, batch:torch.Tensor, _):
        x, y = batch
        x = x.reshape(x.size(0), -1)
        yh = self.fc(x)
        loss = torch.nn.functional.cross_entropy(yh, y)
        self.log('loss', loss, on_step=False, on_epoch=True, prog_bar=True)
        return loss

    def on_train_epoch_end(self):
        # import ipdb; ipdb.set_trace(context=15)
        print("")
        pass

ds = torchvision.datasets.MNIST(root="dataset/", train=True, transform=torchvision.transforms.ToTensor(), download=True)
train_loader = torch.utils.data.DataLoader(dataset=ds, batch_size=64, shuffle=False)
trainer = pl.Trainer(max_epochs=3, callbacks=[LitProgressBar()])
trainer.fit(DemoNet(), train_loader)

The key bit of information here is the need to subclass the TQDMProgressBar, just to be able to disable all the hard-coded *bar.close() calls you make in the default progress bar.

It would be great if every user didn't have to deal with all that boiler plate for every project, and instead the TQDMProgressBar constructor taking an optional argument such as "leave:bool" (same as tqdm) that you'd then check in the code to decide whether to close the progress bars or not.

E.g.

class TQDMProgressBar(ProgressBar):
    def __init__(self, refresh_rate: int = 1, process_position: int = 0, leave: bool = False):

Mar 05 '24 13:03 jojje

A PR for discussion and review has been submitted to address this issue. If anyone has time to look at it and provide feedback, that'd be great.

Reviewer note: There was a failed test, but it seems entirely unrelated. In fact, the change was made such that there is zero change in behavior by default, and explicitly setting a new flag (which no existing tests could possibly be aware of) is required to enable the new behavior , so I don't how this change could possibly be related to the failure of core/test_metric_result_integration.py::test_result_reduce_ddp

Mar 05 '24 19:03 jojje

monaspace monaspace copied to clipboard

Update install_linux.sh cp: cannot `stat' on './fonts/otf/*': No such file or directory

cp: cannot `stat' ... No such file or directory

monaspace
monaspace copied to clipboard