clearml
clearml copied to clipboard
'NoneType' object has no attribute 'get_logger' - clearML with pytorch distributed
Hi, I'm trying to follow your examples and use clearML with Pytorch distributed run.
my script looks as follows:
from clearml import Task, Logger
import argparse
def main(args):
if int(os.environ.get('LOCAL_RANK', 0)) == 0:
task = Task.init(project_name='DETR', task_name='all_bn_detr')
for epoch in range(args.start_epoch, args.epochs):
train_stats = train_one_epoch(
model, criterion, data_loader_train, optimizer, device, epoch,
args.clip_max_norm)
Task.current_task().get_logger().report_scalar("test", "mAP", iteration=epoch, value=a.stats[0])
if __name__ == '__main__':
args = parser.parse_args()
main(args)
I'm getting error messages saying:
File "main.py", line 275, in main Task.current_task().get_logger().report_scalar("test", "mAP", iteration=epoch, value=a.stats[0]) AttributeError: 'NoneType' object has no attribute 'get_logger'
When I try to change my report_scalar to:
Logger.current_logger().report_scalar("train", "loss bbox", iteration=epoch, value=train_stats['loss_bbox'])
I also get a similar message.
What am I missing?
Thanks
Hi,
if int(os.environ.get('LOCAL_RANK', 0)) == 0: task = Task.init(project_name='DETR', task_name='all_bn_detr')
How do you associate a task to your execution if the script doesn't enter into the conditionnal statement ? You have to be sure that you have a task associated to the execution before you call current_task or current_logger. If your code doesn't associate any task to his execution (it doesn't do any Task.init) then those functions will return None and will have this kind of error.
Hey. Thanks for the reply! The reason I got into writing:
if int(os.environ.get('LOCAL_RANK', 0)) == 0: task = Task.init(project_name='DETR', task_name='all_bn_detr')
is that my run command is of the form:
python -m torch.distributed.launch --nproc_per_node=8
If I am not using the above condition, I will get 8 different projects for a single run in the clearML API:
I suggest you to launch the distributed training for within your main script file, so that all the nodes will be reported into one unique task, also created in the main.
Here is a detailed example : https://github.com/allegroai/clearml/blob/master/examples/frameworks/pytorch/pytorch_distributed_example.py
I suggest you to launch the distributed training for within your main script file, so that all the nodes will be reported into one unique task, also created in the main.
Here is a detailed example : https://github.com/allegroai/clearml/blob/master/examples/frameworks/pytorch/pytorch_distributed_example.py
Thanks!