ignite icon indicating copy to clipboard operation
ignite copied to clipboard

Compute metric per image and handler mutex in DistributedDataParallel

Open Nic-Ma opened this issue 4 years ago • 16 comments

❓ Questions/Help/Support

Hi @vfdev-5 ,

I am writing an ignite handler to write the segmentation metrics of every image into 1 CSV file as the summary, for example:

metrics.csv:
/data/spleen/image_1    0.85
/data/spleen/image_2    0.87
/data/spleen/image_3    0.91
... ...

The problems are that:

  1. I tried to add logic to metrics.update() to cache every record and write to CSV in metrics.complete(), but ignite.metrics only accepts output_transform, so I can't extract the filenames from engine.state.batch.
  2. Then I changed to write a separate handler for this feature, but ignite metrics only saves the final average metrics into engine.state.metrics, handler is not easy to get every metric value corresponding to every image.
  3. Another problem is the DistributedDataParallel, when I run the handler in multi-processsing, how do you usually use the multi-processing lock to save content into 1 CSV in both unix and windows OS?

Thanks.

Nic-Ma avatar Jan 20 '21 02:01 Nic-Ma

@Nic-Ma Hi !

  1. You are right, metrics works on (transformed) outputs from the process function but the batch is embedded in the engine.state. I suppose that the filenames are provided by dataset and are packed in every batch when __getitem__ is called. I could try on my side to see if I can help.
  2. Again, if we define the dataflow from dataset to metrics, we could go further together. What do you think ?
  3. No idea atm

sdesrozis avatar Jan 20 '21 07:01 sdesrozis

Hi @Nic-Ma ,

As far as I understand your question, it is about evaluating a pretrained model on images and output the metric per image.

For a single process, IMO, this can be done like that:

from ignite.engine import Engine, Events
from ignite.metrics import Accuracy, BatchWise


num_classes = 10
batch_size = 4


data = [
    {
        "x": torch.rand(batch_size, 3, 32, 32), 
        "y": torch.randint(0, num_classes, size=(batch_size, 32, 32)), 
        "filenames": [f"file_{i + 0 * batch_size}" for i in range(batch_size)]
    },
    {
        "x": torch.rand(batch_size, 3, 32, 32), 
        "y": torch.randint(0, num_classes, size=(batch_size, 32, 32)), 
        "filenames": [f"file_{i + 1 * batch_size}" for i in range(batch_size)]
    },
]


def infer_step(engine, batch):
    x = batch["x"]
    y = batch["y"]
    return {
        "y_pred": torch.rand(batch_size, num_classes, 32, 32),
        "y": y,
    }


evaluator = Engine(infer_step)

acc_metric = Accuracy()


@evaluator.on(Events.ITERATION_COMPLETED)
def compute_metric_per_image(engine):
    print("compute_metric_per_image")
    
    output = engine.state.output
    batch = engine.state.batch
    assert len(output["y_pred"]) == len(output["y"])
    assert len(batch["filenames"]) == len(output["y"])
    
    evaluator.state.metrics = []        
    for y_pred, y, filename in zip(output["y_pred"], output["y"], batch["filenames"]):
        acc_metric.reset()
        acc_metric.update((y_pred.unsqueeze(0), y.unsqueeze(0)))
        o = {
            "filename": filename,
            "accuracy": acc_metric.compute(),
        }
        evaluator.state.metrics.append(o)


@evaluator.on(Events.ITERATION_COMPLETED)
def save_files():
    print("append data to CSV:", evaluator.state.metrics)

    
evaluator.run(data)

Another problem is the DistributedDataParallel, when I run the handler in multi-processsing, how do you usually use the multi-processing lock to save content into 1 CSV in both unix and windows OS?

Maybe, the idea is to gather all data from all participating processes to a single one and then use barrier (idist.barrier()) to write from a single process to the file.

vfdev-5 avatar Jan 20 '21 09:01 vfdev-5

Hi @vfdev-5 ,

Thanks very much for your detailed example! Your program can work for the metrics-saving case, but it requires to manually call metrics.reset() and metrics.update() and set values for evaluator.state.metrics which should be done by ignite.Metric base class when we attach metrics to the engine. Maybe we can keep the existing metrics logic, just save the metric of every image to a value in engine.state in update(), then develop a MetricsSaver handler to get the value from engine.state and write to the CSV file? To simplify the problem of DistributedDataParallel, I am thinking maybe just every GPU process writes a separate CSV file..

Thanks.

Nic-Ma avatar Jan 20 '21 15:01 Nic-Ma

Your program can work for the metrics-saving case, but it requires to manually call metrics.reset() and metrics.update() and set values for evaluator.state.metrics which should be done by ignite.Metric base class when we attach metrics to the engine.

Yes, that's true. As we need to compute metric per image of the batch we can not currently simply attach metric to an engine and compute batch_size of metric values on each iteration... (in some sort without averaging). A hacky solution can be to set batch_size = 1 and make usage of BatchWise metric usage.

develop a MetricsSaver handler to get the value from engine.state and write to the CSV file?

yes, this makes sense and we can put such handler into contrib module !

To simplify the problem of DistributedDataParallel, I am thinking maybe just every GPU process writes a separate CSV file..

yes, this can be also a possible solution :+1:

vfdev-5 avatar Jan 20 '21 15:01 vfdev-5

which should be done by ignite.Metric base class when we attach metrics to the engine.

@sdesrozis maybe we can think about another MetricUsage like InstanceWise ?

vfdev-5 avatar Jan 20 '21 15:01 vfdev-5

Thanks, let me do more experiments tomorrow.

Nic-Ma avatar Jan 20 '21 15:01 Nic-Ma

which should be done by ignite.Metric base class when we attach metrics to the engine.

@sdesrozis maybe we can think about another MetricUsage like InstanceWise ?

Let's think about it. From my side, I have a similar need, so I'm very interested in such a feature.

sdesrozis avatar Jan 20 '21 21:01 sdesrozis

Hi @vfdev-5 and @sdesrozis ,

I think at least it's valuable to add engine as an arg for Metric.update() and Metric.compute(), so we can put some context in engine.state to share with other components, just same as regular handlers.

Thanks.

Nic-Ma avatar Jan 22 '21 07:01 Nic-Ma

@Nic-Ma thanks for the idea. Which use-case you think of while using engine inside Metric.update/compute ? Thanks

vfdev-5 avatar Jan 22 '21 08:01 vfdev-5

Hi @vfdev-5 ,

I think context information is widely used in most system SW arch, like a data bus for every components to produce/consume. Your engine is a good object as context, all handlers can communicate through engine, but metrics can't access it directly. For example, if I want to save some temp values(mean dice of every image) during metrics computation and print out in another handler later, I can save them into engine.state.metrics_report after every iteration.

Thanks.

Nic-Ma avatar Jan 22 '21 09:01 Nic-Ma

I see, thanks for explaining !

add engine as an arg for Metric.update() and Metric.compute()

In my mind, update/compute methods are sort of independent of Engine etc. Maybe, it would make sense to store engine once passed to attach :

def attach(self, engine: Engine, name: str, usage: Union[str, MetricUsage] = EpochWise()) -> None:
    self.engine = engine

thus, it can be accessible in all methods. We discussed that previously and seems like it was not retained... @sdesrozis what do you think ?

On the other hand, out-of-the-box metrics do not use engine (except writing to state.metrics), so a custom metric can also add the engine as an argument on construction ...

vfdev-5 avatar Jan 22 '21 10:01 vfdev-5

I agree that having engine as a hub for handlers is very convenient. This ensures very strong customisation possibilities.

In the case of Metric (which are handlers like any others), engine is passed as argument to each call of attached methods (iteration_completed, etc.). The point is that Metric also offers an user api that is completely agnostic of engine. I think it is a good idea to keep this separation.

I even think we could go further in the separation and adopt a composite pattern rather than inheritance. But that's irrelevant here.

Keep a reference of engine when calling attach as suggested by @vfdev-5 is a smart and simple option. If it solves the issue of @Nic-Ma, I think we should do it.

sdesrozis avatar Jan 22 '21 21:01 sdesrozis

@sdesrozis the problem here is that we already have existing API for detach and is_attached method that always require engine and adding Metric.engine would require an update for it too, I think.

vfdev-5 avatar Jan 23 '21 15:01 vfdev-5

Yes there should have collateral effects to control...

sdesrozis avatar Jan 23 '21 16:01 sdesrozis

Hi @vfdev-5 and @sdesrozis ,

Thanks for your discussion. I added self.engine to the metrics in this draft MONAI PR: https://github.com/Project-MONAI/MONAI/pull/1497. Will delete it in the future when you guys added engine to the Metric base class.

Thanks.

Nic-Ma avatar Jan 25 '21 02:01 Nic-Ma

Sounds good ! Thanks for the update @Nic-Ma

vfdev-5 avatar Jan 25 '21 08:01 vfdev-5