torchmd-net
torchmd-net copied to clipboard
Metrics are in random order
At some point, the metrics.csv file seems to have changed to putting the columns in a random order. Literally. As in, every training run puts them in a different order! This is confusing and makes the file hard to read.
How about sorting the columns alphabetically? That will give a consistent and logical order: epoch first, all the training losses grouped together, then all the validation losses grouped together in the same order as the training losses.
Bonus points if it can output the epoch as an integer rather than a floating point number.
That is curious behavior indeed... Lightning takes care of writing the metrics, all TMDNet does is calling log_dict from it: https://github.com/torchmd/torchmd-net/blob/2c2b5f059b89b2a2199a8ecc3bbf92707f2cabe6/torchmdnet/module.py#L229
The fact that it is random makes me think of this: https://docs.python.org/3/using/cmdline.html#envvar-PYTHONHASHSEED https://stackoverflow.com/questions/2053021/is-the-order-of-a-python-dictionary-guaranteed-over-iterations But looks like an old behavior...
This is how CSVLogger writes metrics down the line:
def log_metrics(self, metrics_dict: Dict[str, float], step: Optional[int] = None) -> None:
"""Record metrics."""
def _handle_value(value: Union[Tensor, Any]) -> Any:
if isinstance(value, Tensor):
return value.item()
return value
if step is None:
step = len(self.metrics)
metrics = {k: _handle_value(v) for k, v in metrics_dict.items()}
metrics["step"] = step
self.metrics.append(metrics)
def save(self) -> None:
"""Save recorded metrics into files."""
if not self.metrics:
return
new_keys = self._record_new_keys()
file_exists = self._fs.isfile(self.metrics_file_path)
if new_keys and file_exists:
# we need to re-write the file if the keys (header) change
self._rewrite_with_new_header(self.metrics_keys)
with self._fs.open(self.metrics_file_path, mode=("a" if file_exists else "w"), newline="") as file:
writer = csv.DictWriter(file, fieldnames=self.metrics_keys)
if not file_exists:
# only write the header if we're writing a fresh file
writer.writeheader()
writer.writerows(self.metrics)
self.metrics = [] # reset
I do not see anything that would make this random after python 3.6.
AFAICT there is also nothing random in how LNNP constructs the metrics dict: https://github.com/torchmd/torchmd-net/blob/2c2b5f059b89b2a2199a8ecc3bbf92707f2cabe6/torchmdnet/module.py#L219-L231
BTW I cannot think of a reason for epoch to be a float, lets change that... I will include it in #231
Found this issue: https://github.com/Lightning-AI/pytorch-lightning/issues/18978
Fix is already merged and will be included in lightning 2.2. So we just have to wait for this one. https://github.com/Lightning-AI/pytorch-lightning/pull/19159
Are you storing the data in a dictionary?
On Fri, Jan 19, 2024 at 11:20 AM Raul @.***> wrote:
Fix is already merged and will be included in lightning 2.2. So we just have to wait for this one. Lightning-AI/pytorch-lightning#19159 https://github.com/Lightning-AI/pytorch-lightning/pull/19159
— Reply to this email directly, view it on GitHub https://github.com/torchmd/torchmd-net/issues/255#issuecomment-1900133560, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUOXKS56V52OTRLPYKIDYPJCIDAVCNFSM6AAAAABCA533POVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBQGEZTGNJWGA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
The different losses and other metadata are packed into a dictionary at the end of each epoch and sent to log_dict from lightning.
Dictionaries are sorted after python 3.7 by the insertion order so it's not the issue. I guess the issue is that they are added to the dict in a random order.
The issue is explained here: https://github.com/Lightning-AI/pytorch-lightning/issues/18978 Is an internal bug in pytorch lightning
Ok, good to know.
Thanks, Raul