torchmd-net icon indicating copy to clipboard operation
torchmd-net copied to clipboard

Metrics are in random order

Open peastman opened this issue 1 year ago • 10 comments

At some point, the metrics.csv file seems to have changed to putting the columns in a random order. Literally. As in, every training run puts them in a different order! This is confusing and makes the file hard to read.

How about sorting the columns alphabetically? That will give a consistent and logical order: epoch first, all the training losses grouped together, then all the validation losses grouped together in the same order as the training losses.

Bonus points if it can output the epoch as an integer rather than a floating point number.

peastman avatar Jan 18 '24 21:01 peastman

That is curious behavior indeed... Lightning takes care of writing the metrics, all TMDNet does is calling log_dict from it: https://github.com/torchmd/torchmd-net/blob/2c2b5f059b89b2a2199a8ecc3bbf92707f2cabe6/torchmdnet/module.py#L229

The fact that it is random makes me think of this: https://docs.python.org/3/using/cmdline.html#envvar-PYTHONHASHSEED https://stackoverflow.com/questions/2053021/is-the-order-of-a-python-dictionary-guaranteed-over-iterations But looks like an old behavior...

This is how CSVLogger writes metrics down the line:

    def log_metrics(self, metrics_dict: Dict[str, float], step: Optional[int] = None) -> None:
        """Record metrics."""

        def _handle_value(value: Union[Tensor, Any]) -> Any:
            if isinstance(value, Tensor):
                return value.item()
            return value

        if step is None:
            step = len(self.metrics)

        metrics = {k: _handle_value(v) for k, v in metrics_dict.items()}
        metrics["step"] = step
        self.metrics.append(metrics)

    def save(self) -> None:
        """Save recorded metrics into files."""
        if not self.metrics:
            return

        new_keys = self._record_new_keys()
        file_exists = self._fs.isfile(self.metrics_file_path)

        if new_keys and file_exists:
            # we need to re-write the file if the keys (header) change
            self._rewrite_with_new_header(self.metrics_keys)

        with self._fs.open(self.metrics_file_path, mode=("a" if file_exists else "w"), newline="") as file:
            writer = csv.DictWriter(file, fieldnames=self.metrics_keys)
            if not file_exists:
                # only write the header if we're writing a fresh file
                writer.writeheader()
            writer.writerows(self.metrics)

        self.metrics = []  # reset

I do not see anything that would make this random after python 3.6.

AFAICT there is also nothing random in how LNNP constructs the metrics dict: https://github.com/torchmd/torchmd-net/blob/2c2b5f059b89b2a2199a8ecc3bbf92707f2cabe6/torchmdnet/module.py#L219-L231

RaulPPelaez avatar Jan 19 '24 10:01 RaulPPelaez

BTW I cannot think of a reason for epoch to be a float, lets change that... I will include it in #231

RaulPPelaez avatar Jan 19 '24 10:01 RaulPPelaez

Found this issue: https://github.com/Lightning-AI/pytorch-lightning/issues/18978

RaulPPelaez avatar Jan 19 '24 10:01 RaulPPelaez

Fix is already merged and will be included in lightning 2.2. So we just have to wait for this one. https://github.com/Lightning-AI/pytorch-lightning/pull/19159

RaulPPelaez avatar Jan 19 '24 10:01 RaulPPelaez

Are you storing the data in a dictionary?

On Fri, Jan 19, 2024 at 11:20 AM Raul @.***> wrote:

Fix is already merged and will be included in lightning 2.2. So we just have to wait for this one. Lightning-AI/pytorch-lightning#19159 https://github.com/Lightning-AI/pytorch-lightning/pull/19159

— Reply to this email directly, view it on GitHub https://github.com/torchmd/torchmd-net/issues/255#issuecomment-1900133560, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUOXKS56V52OTRLPYKIDYPJCIDAVCNFSM6AAAAABCA533POVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBQGEZTGNJWGA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

giadefa avatar Jan 19 '24 10:01 giadefa

The different losses and other metadata are packed into a dictionary at the end of each epoch and sent to log_dict from lightning.

RaulPPelaez avatar Jan 19 '24 10:01 RaulPPelaez

Dictionaries are sorted after python 3.7 by the insertion order so it's not the issue. I guess the issue is that they are added to the dict in a random order.

stefdoerr avatar Jan 19 '24 10:01 stefdoerr

The issue is explained here: https://github.com/Lightning-AI/pytorch-lightning/issues/18978 Is an internal bug in pytorch lightning

RaulPPelaez avatar Jan 19 '24 10:01 RaulPPelaez

Ok, good to know.

peastman avatar Jan 19 '24 17:01 peastman

Thanks, Raul

guillemsimeon avatar Jan 19 '24 17:01 guillemsimeon