FastFlow Q&A

Hello.

I would like to open this issue to talk about this project. I am also interested in developing this project and would be great to share information as the paper doesn't give deeply information about the implementations and offical code is no available.

If you are agree with this iniciative, firstly we could simplify the project to use Wide-ResNet50 in order to get comparative results with the previous researching. I would like to start from the begining of the paper when says:

For ResNet, we directly use the features of the last layer in the first three blocks, and put these features into three corresponding FastFlow model.

This make me thing that in the implementation we need to use the features after the input layer, layer 1 and layer 2. In this way this table 6 makes sense

But can not to imagine how to concatenate this information for make it sense with the next

In the forward process, it takes the feature map from the backbone network as input

Depending of what part you read, it seems that just one feature map or 3 are taken

Feb 09 '22 09:02 mjack3

Hi @mjack3, I'm really glad to find some help in this project. Thank you very much for your proposal, I accept. This paper is quite obscure. The problem you are addressing is explained in paragraph 4.7:

For ResNet18 and Wide-ResNet50-2, we directly use the features of the last layer in the first three blocks, put these features into the 2D flow model to obtain their respective anomaly detection and localization results, and finally take the average value as the final result.

I think that the paper wants us to build three different models and average their anomaly score. But how do we compute this anomaly score? This is the question that I can't solve. In the introduction we can find that:

We propose a 2D normalizing flow denoted as FastFlow for anomaly detection and localization with fully convolutional networks and two-dimensional loss function to effectively model global and local distribution.

But I can't find how this two-dimensional loss is defined. If you have an idea of good two-dimensional loss for this problem, I'm all ears. Best, Alessio

Feb 10 '22 16:02 AlessioGalluccio

hummm yes you are right, definitevly we need to create 3 fastflow models..i will try. By the way, you can find my implementation here https://github.com/mjack3/EasyFastFlow feel you free to use what you want

Feb 11 '22 07:02 mjack3

Have you try contacting to some of the main authors of the paper? I googled them but didn't find the email

Feb 11 '22 08:02 mjack3

@mjack3 Hi, have you take a look about the CFLOW-AD? It also implemented by FLOW model, maybe it can help you to understand how 3 Fastflow module work. I'm trying to implement Fastflow by modify Cflow-AD. If you need any help or discuss, I would like to help (if I can).

Feb 11 '22 09:02 Howeng98

@Howeng98 you are welcome =)

Yes I also looked the CSFLOW-AD code but I am not sure if here, we need to create 3 individual fastFLow model and training with 3 optimizers (one for FastFLow) or doing similar to CSFLOW-AD

Feb 11 '22 10:02 mjack3

@mjack3 I tried to contact Yushuang Wu through a university e-mail I found, but I got no answer. I haven't found the e-mail of the other authors

Feb 11 '22 17:02 AlessioGalluccio

When did you contact them ?@AlessioGalluccio

Feb 11 '22 17:02 mjack3

Hi @mjack3, Can you please share your implementation of FastFlow? The link seems to be deactivated. Thanks

Feb 14 '22 14:02 rafalfirlejczyk

Currently i am obliged to make the code in private because my job contract. I hope to open it soon. Anyway I will share information in this same thread if is needed :)

Feb 14 '22 15:02 mjack3

@AlessioGalluccio just a small remark: For anomaly score calculation (global and pixelwise) you need to use p(z) and not z which you are currently using.

you can estimate logp(z) (and therefore p(z)) analogous to the pytorch implementation of CFlow AD.

Mar 04 '22 10:03 maaft

Hi @maaft, did you manage to achieve a similar result as the claimed? I tried both the way of CFlow and DifferNet but still far below the performance in the paper.

Another confusion for me is that I cannot get the same A.d param#: I take each FlowStep as one AllInOneBlock from FrEIA, with 2 convolution layers This is my counting result (and paper counting result in parentheses)

CaiT:  7,043,780 (14.8M)
DeiT:  7,043,780 (14.8M)
Resnet18:  4,650,240 (4.9M)
WideResnet50:  41,309,184 (41.3M) -> this one is matched

Here's code I used to compute param#

def count_params_per_flow_step(k, cin, ratio):
    cout = 2 * cin
    cmed = int(cin * ratio)
    w1 = k * k * cin * cmed
    b1 = cmed
    w2 = k * k * cmed * cout
    b2 = cout
    return w1 + w2 + b1 + b2
    
def count_total_params(num_steps, conv3x3_only, feature_channels, ratio):
    s = 0
    for channels in feature_channels:
        for i in range(num_steps):
            k = 1 if (i % 2 == 1 and not conv3x3_only) else 3
            s += count_params_per_flow_step(k, channels // 2, ratio)
    return s

print("CaiT: ", count_total_params(20, False, [768], 0.16))
print("DeiT: ", count_total_params(20, False, [768], 0.16))
print("Resnet18: ", count_total_params(8, True, [64, 128, 256], 1.0))
print("WideResnet50: ", count_total_params(8, False, [256, 512, 1024], 1.0))

Mar 06 '22 08:03 gathierry

@gathierry no, I don't think that I can match the scores in the paper (didn't evaluate it yet, only visually). In particular, the transistor class (broken legs) does not learn at all.

I'll evaluate auroc etc next week and report back.

Also I tried different backbones than resnet18, that achieve higher accuracy on imagenet (e.g. EfficientNet) and noticed that the training has a very hard time to converge at all. No idea, why this is the case.

Mar 06 '22 09:03 maaft

My tests of this code (24 Epochs) shows acceptable results only for Resnet18 and only for three mvtec classes:

	AUROC-MAX	AUCPR-MAX
Bottle	0.9849	0.9955
Screw	0.9859	0.9959
Wood	0.9956	0.9987

Other classes performed badly. I did not test it for WideResnet50 yet. Feature extractors based on Vision Transformers like Deit or Cait does not learn at all.

Mar 06 '22 10:03 brm738

Thats weird, right? Do the resnet18 features follow some kind of special/nice distribution that the other architectures don't have?

Has anyone tried different feature extractors with CFLOW-AD or other flow-based approaches?

Mar 06 '22 13:03 maaft

I found out that resnet18 works well, because the extracted features have a low magnitude. When I use e.g. EfficientNet and just scale the features by 0.1, the NF-Head seems to learn quite well.

~~I'll try to add a learnable scaling parameter to make my model backbone agnostic.~~ Doesn't work - features will collapse to 0.

Mar 07 '22 13:03 maaft

Another thing: According to the architecture image (fig 2) from the paper, I think we should use RNVPCouplingBlock and not AllInOneBlock. The former includes two alternating coupling networks, while the latter is only single sided.

Furthermore, the AllInOneBlock applies ActNorm and PermuteRandom in the end of the coupling block and not in the beginning. We need to add those therefore manually before every RVNPCouplingBlock.

Does anyone know if the permutation indeed needs to be fixed during training? Or do we need to use a different permutation at every training step? I'm asking because The PermuteRandom Module from FrEIA is fixed during training.

Edit: Does it really matter though? I think the reason for alternating coupling networks for RealNVP was to also train the upper half of channels.. But when permuting randomly multiple times, we also train every channel. Hm, I'm a bit clueless here.

Edit2: ActNorm on beginning is paramount. When you do this, all backbones will work like magic. No manual scaling needed.

Mar 08 '22 09:03 maaft

I think PermuteRandom is actually a more flexible lower-half/upper-half alternating so essentially, I don't feel big difference. I was also trying to figure out if it's a coupling block or AllInOneBlock from the A.D params in Table. 1. But as mentioned earlier, I can never match all of them.

Based on our experiments, PermuteRandom must be fixed since initialization. Otherwise, the NF cannot learn anything useful.

Another thing: According to the architecture image (fig 2) from the paper, I think we should use RNVPCouplingBlock and not AllInOneBlock. The former includes two alternating coupling networks, while the latter is only single sided.

Furthermore, the AllInOneBlock applies ActNorm and PermuteRandom in the end of the coupling block and not in the beginning. We need to add those therefore manually before every RVNPCouplingBlock.

Does anyone know if the permutation indeed needs to be fixed during training? Or do we need to use a different permutation at every training step? I'm asking because The PermuteRandom Module from FrEIA is fixed during training.

Edit: Does it really matter though? I think the reason for alternating coupling networks for RealNVP was to also train the upper half of channels.. But when permuting randomly multiple times, we also train every channel. Hm, I'm a bit clueless here.

Edit2: ActNorm on beginning is paramount. When you do this, all backbones will work like magic. No manual scaling needed.

Mar 10 '22 12:03 gathierry

Yes, I think you are right.

To match parameters: Which layers are you using from the resnet?

Per paper:

use first three block outputs (64, 64, 128) channels for resnet18
use RNVPCouplingBlock (or Glow - I think parameterwise it shouldn't matter)
use ActNorm followed by PermuteRandom before every block
use a total of 8 coupling blocks per layer output (3x3 and 1x1 alternating)

the only free variable to play with in this case is the number of mid-channels for both subnets.

Unfortunately my GPU memory is too small to use first 3 image features. Please let me know if you can achieve any good results with above configuration.

Count Parameters with:

nf_params = sum(p.numel() for p in self.nf.parameters() if p.requires_grad) # self.nf is the flow head

Mar 10 '22 12:03 maaft

I opened a Q in the Freia github

https://github.com/VLL-HD/FrEIA/issues/113

Mar 10 '22 12:03 mjack3

@maaft

I think the "first 3 blocks" for resnet18 means stride4x, 8x, and 16x, so the channel numbers should be (64, 128, 256). See table 6.
In fact, in section 6.1 and caption of Table7, the paper indicates the mid-channel numbers in subnets

I tried to move Permute and ActNorm from the end to the beginning of the block, as you suggested, but I didn't see significant improvement. Maybe there are some other issues in my code.

Mar 10 '22 16:03 gathierry

@maaft

I think the "first 3 blocks" for resnet18 means stride4x, 8x, and 16x, so the channel numbers should be (64, 128, 256). See table 6.

In fact, in section 6.1 and caption of Table7, the paper indicates the mid-channel numbers in subnets

I tried to move Permute and ActNorm from the end to the beginning of the block, as you suggested, but I didn't see significant improvement. Maybe there are some other issues in my code.

I am getting NaN when ActNorm is at the beginning of the block in a innSeq. Could you share an image?

Mar 10 '22 17:03 mjack3

I guess I could share my model later. No idea why you get nans.

Maybe your data is already bad and contains nans? Are you normalizing your images?

Mar 10 '22 17:03 maaft

Btw, for resNet the output of layer1, layer2 and layer3 are used.

Currently, i got a model that achieve [0.98, 1.0] with 25 epochs instead 500 In Clasification, for every class.

It needs some adjusts but hope to open the code soon for community participation.

Note: the code of this repo is wrong (sorry)

Mar 10 '22 17:03 mjack3

I guess I could share my model later. No idea why you get nans.

Maybe your data is already bad and contains nans? Are you normalizing your images?

For that, i am emulating the process:

x = torch.rand(16,3,256,256)

o=model(x)

But yes, I tested with the real normalized image in a pytorch-standard way

Mar 10 '22 17:03 mjack3

The permutation of channels must be fixed during training. As @gathierry mentioned, it's necessary for normalizing flows.

@AlessioGalluccio just a small remark: For anomaly score calculation (global and pixelwise) you need to use p(z) and not z which you are currently using.

you can estimate logp(z) (and therefore p(z)) analogous to the pytorch implementation of CFlow AD.

For the anomaly score I apply anomaly_score.append(t2np(torch.mean(z_grouped_temp ** 2, dim=(-2, -1)))) As it is used in DifferNet. Do you mean that I should add a /2 to it to be the same as the negative loglikelihood of a normal function?

Mar 10 '22 17:03 AlessioGalluccio

Looking a CFlow AD, it does in utils.py logp = C * _GCONST_ - 0.5*torch.sum(z**2, 1) + logdet_J He computes the positive likelihood instead of the negative one. In fact, he calculates the score, not the anomaly score. Then he computes in train.py

# invert probs to anomaly scores
        super_mask = score_mask.max() - score_mask

In this way he gets the anomaly score. So, it's basically the same. I think that adding the jacobian in the anomaly score is useless, since it is the same for every output. The jacobian depends on the weights of the net, not on the input image

Mar 10 '22 18:03 AlessioGalluccio

I'll share my model, loss function and anomaly map generation tomorrow

Mar 10 '22 21:03 maaft

@AlessioGalluccio In CFlow, there's an exponential converting logp to p as well. It's the same if there's only one feature level (such as DeiT and CaiT). But if there are 3 feature levels (resnet), it would be different since exp is performed before sum of three score maps in three levels. logp is in (-inf, 0] but p is in [0, 1], sum(log_p) and sum(p) can result in totally different values

Mar 11 '22 02:03 gathierry

And for logp = C * _GCONST_ - 0.5*torch.sum(z**2, 1) + logdet_J. Does it make sense if we only reduce dim=1 when doing sum on logdet_J? I subclassed AllInOneBlock to keep the axes of H and W

class AllInOneBlock2D(Fm.AllInOneBlock):
    def __init__(self, dims_in, **kwargs):
        super().__init__(dims_in, **kwargs)
        self.sum_dims = (1,)

Mar 11 '22 02:03 gathierry

@mjack3 As for ActNorm, I simply moved the _permute of AllInOneBlock to the beginning of forward and removed the original ones. I don't think this is the root cause of NaN but it might somehow amplify your gradient.

def forward(self, x, c=[], rev=False, jac=True):
        '''See base class docstring'''
        if self.householder:
            self.w_perm = self._construct_householder_permutation()
            if rev or self.reverse_pre_permute:
                self.w_perm_inv = self.w_perm.transpose(0, 1).contiguous()
        # ==== ActNorm ====
        x0, global_scaling_jac = self._permute(x[0], rev=False)
        # ==== ActNorm end ====
        x1, x2 = torch.split(x0, self.splits, dim=1)

        if self.conditional:
            x1c = torch.cat([x1, *c], 1)
        else:
            x1c = x1

        if not rev:
            a1 = self.subnet(x1c)
            x2, j2 = self._affine(x2, a1)
        else:
            a1 = self.subnet(x1c)
            x2, j2 = self._affine(x2, a1, rev=True)

        log_jac_det = j2
        x_out = torch.cat((x1, x2), 1)

        # add the global scaling Jacobian to the total.
        # trick to get the total number of non-channel dimensions:
        # number of elements of the first channel of the first batch member
        n_pixels = x_out[0, :1].numel()
        log_jac_det += (-1)**rev * n_pixels * global_scaling_jac

        return (x_out,), log_jac_det

Mar 11 '22 02:03 gathierry

FastFlow FastFlow copied to clipboard

Q&A

FastFlow
FastFlow copied to clipboard