pytorch-lightning icon indicating copy to clipboard operation
pytorch-lightning copied to clipboard

WandbLogger : use random name instead of project as default name

Open ierezell opened this issue 1 year ago • 9 comments

Proposed refactor

Use the wandb behavior (creating a random name) if no name is passed for the run or just let it None and wandb will deal with creating the name.

Motivation

I got 20 runs with all the same name which is not easy to navigate through

Pitch

Use a random name or name with a date or anything that would distinguish two runs.

Additional context

It's here: https://github.com/Lightning-AI/lightning/blob/9d02ad761c6c9360cba96e339d340c0cc2e572a8/src/pytorch_lightning/loggers/wandb.py#L300

cc @awaelchli @morganmcg1 @borisdayma @scottire @manangoel99

ierezell avatar Aug 05 '22 19:08 ierezell

Hi, this modification was introduced in this https://github.com/Lightning-AI/lightning/pull/12604#issuecomment-1094844274, and I have also discussed the modification aspect of combining with third-party loggers at https://github.com/Lightning-AI/lightning/issues/12028#issuecomment-1088325894, and currently PL logger is not perfect.

tshu-w avatar Aug 06 '22 01:08 tshu-w

Something must've changed in 1.7.0 though. The names of the runs in the wandb dashboard are now no longer randomly generated.

kabouzeid avatar Aug 06 '22 15:08 kabouzeid

This small change has been deteriorating a simple unique identification method provided by WandB which helps distinguishing multiple runs. I understand there are bigger changes in which this fits, but I do wonder if this specific change is required? It's particularly helpful to get an automatic name generated.

kotchin avatar Aug 08 '22 09:08 kotchin

Can confirm that after updating to PL 1.7 the former random wandb run names (e.g., “noble-snowball-37”) all have the same (project) name. Not useful (for me). Since 1.7 introduced other bugs/changes with wandb, I am back to 1.6.5

hogru avatar Aug 08 '22 15:08 hogru

Hi! Engineer from W&B here. In PL 1.7 this change of using the project as a default run name instead of random wandb run names (I love those too) was added because using the default None value lead to a broken directiry structure for PL Logs. More discussion here. This is because of the interface between the PL logger and WandbLogger requires some adjustment which we are looking into it. For now, there is a workaround which is call wandb.init before your initialization of the WandbLogger. This will initialize a run with a random name and the WandbLogger will use this run automatically. For e.g.

import wandb
from pytorch_lightning.loggers import WandbLogger

wandb.init(project='my-project')
logger = WandbLogger(project='my-project')

I hope this helps.

manangoel99 avatar Aug 08 '22 15:08 manangoel99

Thank you, @manangoel99, that's good to know. For now, I stay with 1.6.5 mainly for this reason: https://github.com/Lightning-AI/lightning/issues/14021 but once it's resolved, and I adapt my code to work with some other minor changes this workaround will be useful (I had it this way anyway until recently, but I think wandb (?) issues a warning that a run is already existing. Does not hurt, just wanted to cleanse the console output.

hogru avatar Aug 08 '22 16:08 hogru

We can try to revert it but keep in mind that there are efforts in standardizing things across loggers which may contradict certain logger specific usage patterns, so this back and forth of naming conventions will (I predict) always happen as people disagree in how folders and experiments should be named.

For example, in wandb, there is a difference between a version and a run name, whereas with the tensorboard logger the version is sort of the same concept as the run name.

The best we can do is make things configurable and hope we choose good defaults. Note that this is all not at all trivial. If we did this

wandb.init(project='my-project')
logger = WandbLogger(project='my-project')

directly inside the initialization of your wandb logger, it would create an experiment with a random name (great!). But you will see N experiments created when running DDP, which is not at all desired.

I hope this adds some context, and I will see what can be done.

awaelchli avatar Aug 09 '22 12:08 awaelchli

Hello @awaelchli,

For my use case, your solution (wandb.init first) is perfectly fine and it's easy enough so everyone can do it and it let you focus on bigger problems. So we can close the issue if you want.

For more complex problems, (like DDP) I guess the user will have to make more small hacks (like only logging on rank 0 )) but it seems reasonable not to break the whole "PR loggers API".

Another solution is to use Faker with something like

from faker import Faker
fake = Faker()
name = fake.name().split()[0] + "_" + fake.color_name()

Which will give a name and a color (like John_DarkBlue) which mimics the wandb behavior.

It's three lines but is still usable by many and is not too much overhead.

Hope it helps someone. Have a great day

ierezell avatar Aug 09 '22 14:08 ierezell

Thank you, @ierezell, for mentioning Faker, I did not know that. I think, I (we) cannot expect that a single API (in this case pytorch lightning) can cover all features/cases of all loggers it wants to integrate with. Neither are the features identical, nor the semantics/conventions/assumptions.

But… it would be nice to have transparency/documentation about what works with what in which ways. The combination of frameworks is (for me at least) often the big issue (in this case, on the API level, sometimes on more abstract levels). The websites/documentation of those frameworks sometimes suggest “with only 3 lines of code you can integrate x and y”. And yes, that's true, and I genuinely understand this from a marketing perspective in a changing and growing field, but if you really want to dig deeper, it's sometimes not as easy. Ok, enough whining, I guess it's a plea for some fine-print in the documentation / release notes about the conditions that apply when you intend to integrate two frameworks/packages/utils.

hogru avatar Aug 09 '22 18:08 hogru

Hi, as the submitter of that PR, I'm sorry for the inconvenience, the intention was to fix the name property of WandbLogger.

The reason for setting wandb run_name to name or project is that now PL stores the log content as follows, so the experiment path will be filled with run folders.

experiment_name (run_name in wandb logger if user doesn't provide after fix)
├── version_id
│   └── checkpoints
└── version_id2

Personally, I think the ideal solution would be to adjust the log save path as commented in https://github.com/Lightning-AI/lightning/issues/12028#issuecomment-1088325894:

save_dir 
└── experiment_name (PL logger own experiment_name, may same to wandb project name but not mandatory)
    ├── version_1 (PL logger own version_id, synchronize with run_name)
    └── version_2

Are there any potential problems I haven't thought about?

tshu-w avatar Aug 10 '22 12:08 tshu-w

After reading a bit of https://github.com/Lightning-AI/lightning/issues/12028#issuecomment-1088325894 I know that I am not deep enough in all the details that I feel comfortable to give any advice or raise concerns. I did not have problems with directory structures in the past (besides my own) because I do not rely on the defaults, but rather define the structure myself. I even create my last-something links myself often to understand/control what's going on. E.g., until PL 1.6.5 I had a “last-checkpoint.ckpt” in a directory and I used this to refer to the last experiment in a subsequent program. Now, with PL 1.7, it appears to me that this link gets a v1 suffix (which IS in the release notes). Which might lead to me doing a “very last to last link” ;-) I understand the reasoning behind it, but it also tells me that it is hard to provide a fitting solution for everyone / every use case. In the end, I am happy that I have access to those solutions.

hogru avatar Aug 10 '22 13:08 hogru

Hi Can you have a look at #14145 and try it with your projects please. It is the best I could come up with for now.

Alternatively, I also tried to generate a run name myself. I couldn't find anything in the public wandb API, closest I found was wandb.apis.util.generate_id but it is for ids and not for human comfortable run names. I'm pretty sure it gets done in their backend, so this is not going to work.

awaelchli avatar Aug 10 '22 14:08 awaelchli

Alternatively, I also tried to generate a run name myself. I couldn't find anything in the public wandb API,

Is this what you're looking for? https://docs.wandb.ai/guides/track/launch#can-i-just-set-the-run-name-to-the-run-id It shows that the run name can be set/modified using:

import wandb
wandb.init()
wandb.run.name = something
wandb.run.save()

EDIT: Haven't tried, just sharing what I found. EDIT2: it works. There's also a name field in the wandb.init() which allows to set it at initialization. Tested it and it works too. Documented here: https://docs.wandb.ai/ref/python/init

kotchin avatar Aug 10 '22 14:08 kotchin

No, @kotchin, I was looking for a way to do it offline. It's not possible for us to create a Run at the time of initialization there.

awaelchli avatar Aug 10 '22 15:08 awaelchli

@awaelchli I think #14145 looks really cool and seems like a simple elegant solution. I'm testing it on my end. I think it is really great for the user experience! Thanks for jumping on this

manangoel99 avatar Aug 10 '22 16:08 manangoel99

Since this affects a couple of people, I would appreciate if some of you could run the logger in your script and let me know if you are pleased with it. Please take note of the folder structure and the logging in the wandb UI. I will of course add a test but I'm looking for a confirmation of expected behavior. @kotchin @hogru @ierezell @kabouzeid @manangoel99 Appreciate anyone who finds the time.

You can check out the branch bugfix/wandb (PR https://github.com/Lightning-AI/lightning/pull/14145 ) on this repo or pip install from my branch directly:

pip install git+https://github.com/Lightning-AI/lightning@bugfix/wandb

Thank you

awaelchli avatar Aug 12 '22 13:08 awaelchli

I think it makes more sense to leave the run name out of the folder structure. The run name can be changed later from the wandb UI, which I often do.

Instead of this:

this
  helpful-puddle-4
    3hkpkuej
      checkpoints
        abc.ckpt

it should be this (which is also the old behavior):

this
  3hkpkuej
    checkpoints
      abc.ckpt

kabouzeid avatar Aug 12 '22 13:08 kabouzeid

That's problematic. In Lightning, the place where trainer saves the checkpoints is determined like so:

os.path.join(logger.save_dir, str(logger.name), logger.version, "checkpoints")

On master:

  • save_dir = cwd
  • name = project, when name is not specified
  • version = 3hkpkuej

On my branch:

  • save_dir = cwd/project
  • name = helpful-puddle-4
  • version = 3hkpkuej

If we wanted to implement what you are saying, we would have to define logger.name as something that does NOT represent the name you see in the UI. Is this appropriate? It would also not represent the same as the "name" parameter you pass into the WandbLogger.

awaelchli avatar Aug 12 '22 14:08 awaelchli

pip install git+https://github.com/Lightning-AI/lightning@bugfix/wandb Thank you

Thanks for the update. Here's my feedback (using LightningCLI only): What worked:

  • The automatic 2-words+exp-number is back, no need to run wandb.init() which is solving this issue, thank you!
  • Additionally (bonus?) the wandb directory is now also placed in the log_dir which is great as it's not common to all projects anymore, which I find is an improvement.

What doesn't work:

  • the wandb directory created within the log_dir yields the follow error:
wandb: Currently logged in as: WANDBUSER. Use `wandb login --relogin` to force relogin
wandb: WARNING Path log_dir/wandb/ wasn't writable, using system temp directory.
wandb: WARNING Path log_dir/wandb/ wasn't writable, using system temp directory
wandb: Tracking run with wandb version 0.13.1
wandb: Run data is saved locally in /tmp/1230112.1.q_very_short_gpu/wandb/run-20220812_170737-3bsppmq6

so in fact this new wandb dir is not actually used/usable, despite having been correctly created. These are the folders properties (don't know if it's of interest or if it helps) to verify that there's nothing strange with the permissions of the created wandb folder:

drwxr-sr-x 2 username usergroup 4096 Aug 12 17:08 wandb
drwxr-sr-x 3 username usergroup 4096 Aug 12 17:19 pretty-pond-1

Additionally, the directory the run data was allegedly saved in doesn't seem to exist? Not sure what happened there. The end result of the log_dir after my experiment finished:

log_dir/
├── config.yaml
├── pretty-pond-1
│   └── 3bsppmq6
│       └── checkpoints
│           └── epoch=0-step=549.ckpt
└── wandb
  • (This was not the scope of this PR, but I'd just like to link back to this issue https://github.com/Lightning-AI/lightning/issues/14162 which would also need to be addressed and is still applicable to this PR).

Edit: updated portion on wandb folder permissions.

kotchin avatar Aug 12 '22 15:08 kotchin

@kotchin We released this fix in the latest version 1.7.2

We currently don't have control over where the wandb directory gets stored. We just let wandb do its default behavior. The LightningCLI config file not being stored to the experiment folder is a conflict we can solve in the future when we have #14188.

awaelchli avatar Aug 18 '22 08:08 awaelchli