submissions icon indicating copy to clipboard operation
submissions copied to clipboard

[Re] Network Deconvolution

Open rochanaro opened this issue 1 year ago • 28 comments

Original article: C. Ye, M. Evanusa, H. He, A. Mitrokhin, T. Goldstein, J. A. Yorke, C. Fermüller, and Y. Aloimonos. “Network Deconvolution.” In: ICLR (2020).

PDF URL: https://github.com/lamps-lab/rep-network-deconvolution/blob/master/article.pdf Metadata URL: https://github.com/lamps-lab/rep-network-deconvolution/blob/master/metadata.yaml Code URL: https://github.com/lamps-lab/rep-network-deconvolution

Scientific domain: Machine Learning Programming language: Python Suggested editor:

rochanaro avatar Sep 19 '24 20:09 rochanaro

Thanks for your submission. We'll assign an editor soon.

rougier avatar Oct 14 '24 06:10 rougier

By the way is this submision part of the ICLR reproducibility challenge? If yes, are there any open review somewhere?

rougier avatar Oct 14 '24 06:10 rougier

Thanks for your submission. We'll assign an editor soon.

Thank You!

rochanaro avatar Oct 14 '24 06:10 rochanaro

By the way is this submision part of the ICLR reproducibility challenge? If yes, are there any open review somewhere?

No, our work was not submitted to ICLR reproducibility challenge.

rochanaro avatar Oct 14 '24 06:10 rochanaro

Very sorry for such a long delay, hopefully things will get better for 2025. I was asking the question because the format of the PDF is very similar to the ICLR challenge. Note that this is not a problem at all, the idea was to re-use review if they were available.

I'll edit your review and assign reviewers soon hopefully.

In the meantime, can you have a look at other submissions and propose yourself to review?

rougier avatar Jan 21 '25 10:01 rougier

@birdortyedi @MiWeiss Coud lyou review this submission?

rougier avatar Jan 21 '25 10:01 rougier

Very sorry for such a long delay, hopefully things will get better for 2025. I was asking the question because the format of the PDF is very similar to the ICLR challenge. Note that this is not a problem at all, the idea was to re-use review if they were available.

I'll edit your review and assign reviewers soon hopefully.

In the meantime, can you have a look at other submissions and propose yourself to review?

Thank you for the update! We understand how busy things can get.
Yes, for organizing the content, we tried to follow the structure of MLRC 2022 template only to ensure clarity. I’ll certainly take a look at other submissions and will propose myself as a reviewer where I can contribute.

rochanaro avatar Jan 23 '25 03:01 rochanaro

Sounds like an interesting paper and a good match for me. Unfortunately, though, I won't be able to review a paper in the next months.

@rougier

MiWeiss avatar Jan 23 '25 19:01 MiWeiss

@ReScience/reviewers We're looking for two reviewers (Machine learnong / Python ICLR), any takers?

rougier avatar Feb 17 '25 15:02 rougier

@rochanaro Don't hesitate to post here to ask for update

rougier avatar Feb 17 '25 15:02 rougier

@rougier I would like to review this one.

alchemi5t avatar Feb 17 '25 15:02 alchemi5t

@alchemi5t Many thanks! You can start the review now and let's target mid-March (or sooner if you can)

rougier avatar Feb 17 '25 15:02 rougier

@rougier I am interested in reviewing this one if a second review is needed.

jsta avatar Feb 18 '25 19:02 jsta

@jsta That would be great, many thanks. Do you think mid-March would work for you?

rougier avatar Feb 19 '25 05:02 rougier

Yes

jsta avatar Feb 19 '25 18:02 jsta

@rougier I'm not getting responses to PR's and Issues in the code repo. Can we delay the due date pending a response from @rochanaro or team?

jsta avatar Feb 26 '25 15:02 jsta

@rougier I'm not getting responses to PR's and Issues in the code repo. Can we delay the due date pending a response from @rochanaro or team?

@jsta Apologies for the delayed response. We encountered an issue with notification delivery for pull requests and issues. However, we have now addressed all queries received thus far.

rochanaro avatar Feb 26 '25 17:02 rochanaro

@alchemi5t @jsta Any progress on your reviews?

rougier avatar Mar 12 '25 06:03 rougier

@rougier I'll have my review by the end of the week.

alchemi5t avatar Mar 12 '25 06:03 alchemi5t

@rougier Yes, I have some general comments prepared. I will post them once I have finished a couple tests to verify general behavior. I do not plan to run all the tests due to computational constraints.

jsta avatar Mar 12 '25 14:03 jsta

Dear Authors,

Here's my review of your work.

Review of "[Re] Network Deconvolution"

This reproduction study provides a thorough evaluation of the network deconvolution technique introduced by Ye et al. (2020). After examining both the original paper and this reproduction, I find that the authors have validated the primary claim about network deconvolution for most parts. While there are many cases where network deconvolution improves model performance compared to batch normalization, the reproduction results show few exceptions that contradict the original paper's universal claim. For example, in the ResNet-18 architecture with CIFAR-100 at 100 epochs, batch normalization (97.42%) actually outperformed network deconvolution (94.31%), which contradicts both the original paper's results and its central claim.

Strengths

  • Comprehensive testing: The authors tested 10 modern neural network architectures on CIFAR-10/100 and ImageNet datasets, over 3 runs
  • Training time analysis: The authors went beyond the original paper by analyzing computational overhead, showing that network deconvolution requires more training time

Key Observations

The few exceptions where BN seems to have outperformed has not been noted and there is no analysis around it. Furthermore, this statement is not true given the results: "The results show that the model performance with network deconvolution is always better than the model performance with batch normalization."

Another notable finding is that the reproduced accuracy values were often significantly higher than those reported in the original paper. For example:

  • For VGG-16 with CIFAR-100 at 100 epochs, the original paper reported 75.32% accuracy with network deconvolution, while the reproduction achieved 99.30%
  • Similar large improvements were observed across most architectures and datasets

The authors attribute this systematic improvement to:

  1. Advancements in numerical stability of libraries (NumPy 1.16.1 → 1.23.5, PyTorch 1.0 → 1.13)
  2. Improved optimization algorithms and parallelism in Tensorflow

While these explanations are plausible, the magnitude of improvement (sometimes exceeding 20%) suggests there might be additional factors at play that weren't fully investigated. Given that this improvement has increased most accuracy (both ND and BN) to 99.xx where the comparison comes down to 10e-2 (vs the 1-2 pts in the original paper), I expected deeper analysis and stronger evidence for these claims.

Code

As far as I understand, running training runs seems to be the only way to test the codebase. Due to lack of hardware I am unable to do so, but to anyone who can and would like to quickly verify the claims, releasing the trained weights at 100 epochs and a minimal script to infer the results would be greatly helpful.

Opinion

This reproduction study provides valuable insights but reveals important discrepancies with the original paper's claims. While network deconvolution often outperforms batch normalization, the reproduction found notable exceptions.

The reproduction yielded substantially higher accuracy values for both techniques compared to those reported in the original paper. These significant discrepancies make it difficult to draw direct comparisons with the original results, and the proposed explanations for these differences (library improvements, optimization algorithms) remain speculative without rigorous empirical validation.

Rather than enhancing confidence in network deconvolution as a universal improvement over batch normalization, this reproduction suggests a more nuanced view: network deconvolution appears to be a viable alternative that performs better in many but not all scenarios. The authors' detailed reporting of computational costs and performance characteristics across architectures provides essential practical context for researchers considering which normalization technique to employ.

alchemi5t avatar Mar 17 '25 04:03 alchemi5t

Ok, I have completed my review. Please see the comments below. In addition to reading the paper and git repository, I verified that I could run 2 arbitrary architectures (vgg16 and pnasnetA) for 100 epochs for both the BN and ND cases, and got similar relative accuracy results.

major comments

  • Section 1, last paragraph, says "our study attempts to reproduce the results reported in the original paper with the most recent versions of software libraries.". This is not accurate, for example, Tensorflow 2.12 is years old. Maybe change to "the most recent version of software libraries as of [some date]"?

minor comments

  • Section 4.1, first sentence, says "compared it against the state‐of‐the‐art against". This seems like a typo or grammar mistake.
  • Is there a typo in imagenet_single_experiment_densenet121.sh, should -a densenet121 -> -a densenet121d?

highly optional and/or minor curiosity questions

  • Section 5.1, first paragraph, says "[...] we averaged the results of three attempts and compared them with the original study’s reported values.". Was there variation among the attempts such that running each model/architecture combination that many times was worthwhile?
  • Figure 3, why were the reproduced values so much more accurate than the original for CIFAR100 but not CIFAR10 or Imagenet?
  • There is some information in the repository README that duplicates the paper. I wonder if you could remove information about the steps you did to create the repo and focus the README on how users will interact with it. Get them up and running quickly without wading through extraneous details. Conversely, the papers says that torchvision datasets are downloaded on-the-fly but this is not in the README.
  • It would be nice to add a time argument to the sbatch files so users have a rough idea of how long they're expected to run.
  • It's strange that Table 8 lists out the model names while Table 4 keys them out by number code.

jsta avatar Mar 19 '25 21:03 jsta

@alchemi5t @jsta Many thanks for your detailed reviews. @rochanaro Can you address the commend and specifically address @alchemi5t concerns about performances? It seems the change in numeric stability might not be a plausible reason for enhanced performances. Any idea on that ? Would it be possible to rerun your code (without too much hassle) with the exact same stack as the original paper?

rougier avatar Mar 24 '25 06:03 rougier

@alchemi5t @jsta Many thanks for your detailed reviews. @rochanaro Can you address the commend and specifically address @alchemi5t concerns about performances? It seems the change in numeric stability might not be a plausible reason for enhanced performances. Any idea on that ? Would it be possible to rerun your code (without too much hassle) with the exact same stack as the original paper?

Thank you @alchemi5t and @jsta for the reviews. @rougier It would be challenging to redo the experiment using the exact same dependencies as the original paper. During our study, we attempted to contact the original authors to obtain the precise library versions, as the original study repository does not mention them, but we did not receive a response. Therefore, we adopted the following approach: we used the latest available versions of each library as of 2020 and only opted for more recent versions when compatibility issues arose.

We will soon get back with answers, revisions, and updates addressing the above concerns in our upcoming response.

rochanaro avatar Mar 24 '25 20:03 rochanaro

Full response to ReScience C reviews (submission #89)

Dear editor (@rougier) and the reviewers (@alchemi5t, @jsta) ,

We appreciate the reviewers’ careful evaluation of our reproduction study. Below is our detailed response addressing each of the reviewers’ points:

Performance Discrepancies and Exceptional Cases

While there are many cases where network deconvolution improves model performance compared to batch normalization, the reproduction results show few exceptions that contradict the original paper's universal claim. For example, in the ResNet-18 architecture with CIFAR-100 at 100 epochs, batch normalization (97.42%) actually outperformed network deconvolution (94.31%), which contradicts both the original paper's results and its central claim.

We appreciate the observation that batch normalization (BN) outperformed network deconvolution in certain cases (e.g., ResNet-18 on CIFAR-100 at 100 epochs), and we acknowledge these exceptions. Our study aimed to carefully reproduce the experiments rather than optimize hyperparameters or modify architectural choices. In the literature, such variations are not uncommon when replicating older experiments (see, e.g., Examining the Effect of Implementation Factors on Deep Learning Reproducibility, A Study on Reproducibility and Replicability of Table Structure Recognition Methods, Towards training reproducible deep learning models). We do not claim that network deconvolution universally outperforms BN; rather, our findings suggest that while ND often provides benefits, its effectiveness may be context dependent (explained in manuscript section 6 paragraph 4).

Systematic Accuracy Improvements

Another notable finding is that the reproduced accuracy values were often significantly higher than those reported in the original paper. For example: • For VGG-16 with CIFAR-100 at 100 epochs, the original paper reported 75.32% accuracy with network deconvolution, while the reproduction achieved 99.30% • Similar large improvements were observed across most architectures and datasets

We acknowledge the reviewer’s observation regarding systematically higher accuracy values (e.g., VGG-16 with CIFAR-100) and agree on the importance of understanding these differences. We believe that several factors contribute to these improvements. Notably, advancements in numerical stability, as documented in recent studies (e.g., Recent advances and applications of deep learning methods in materials science, A Comprehensive Review of Deep Learning: Architectures, Recent Advances, and Applications ), and enhancements in optimization algorithms may play a significant role.

Additionally, prior research supports the idea that updated frameworks can improve model performance. Mienye et al. (2024), and Qin et al. 2024 and Vaishnav et al. (2022) demonstrated that deep learning models consistently achieve higher accuracy when trained with updated frameworks due to improvements in optimization techniques and regularization strategies. Similarly, Coakley et al. (2023) found that updated hardware and software environments alone can introduce accuracy differences of over 6%, highlighting the impact of implementation-level changes on reproducibility.

While the observed accuracy gains (e.g., 75% → 99% for VGG-16 on CIFAR-100) are substantial, we emphasize that both network deconvolution (ND) and batch normalization (BN) baselines improved comparably, suggesting that these gains are driven by systemic factors rather than being specific to any particular method. Although we have provided plausible explanations based on updates in software libraries, we acknowledge that these factors may not fully account for the magnitude of the observed differences. Further investigation into these aspects is warranted, but we reiterate that the primary objective was to assess reproducibility rather than to deconstruct every underlying cause of performance variation.

We provided detailed explanations in Section 6 Paragraphs 2 and 3 in the revised manuscript.

Sharing trained weights for verification

As far as I understand, running training runs seems to be the only way to test the codebase. Due to lack of hardware I am unable to do so, but to anyone who can and would like to quickly verify the claims, releasing the trained weights at 100 epochs and a minimal script to infer the results would be greatly helpful.

We acknowledge the reviewer's (@alchemi5t) request and have taken steps to facilitate easy verification of our results. The trained model weights at 100 epochs have been made publicly available at https://osf.io/hp3ab/files/osfstorage , corresponding to the results presented in Tables 1 and 2. Additionally, we provide two minimal inference scripts (test_script_for_table_1.py and test_script_for_table_2.py) that allow users to reproduce our reported accuracy without requiring extensive computational resources. This ensures that our findings can be easily validated by downloading the weights and running a single command.

The steps have been explained in the newly added section, "To Validate the Results Using the Trained Weights (Direct Inference Without Training)" in the GitHub README.md file and in Section 5.4 of the revised manuscript.

For example, the following command can be used to verify results for the VGG-16 architecture on CIFAR-10 with batch normalization:

python test_script_table_1.py --arch vgg16 --dataset cifar10 --deconv False --model_path "checkpoints/cifar10_vgg16_BN.pth.tar"

Clarification on Software Versions (Reviewer 2’s Comment @jsta)

Section 1, last paragraph, says "our study attempts to reproduce the results reported in the original paper with the most recent versions of software libraries.". This is not accurate, for example, Tensorflow 2.12 is years old. Maybe change to "the most recent version of software libraries as of [some date]"?

In response to Reviewer 2’s remark regarding our statement in Section 1, we have revised the language. The original phrase “with the most recent versions of software libraries” has been updated to “with the most recent versions of software libraries as of 2020” to prevent any misunderstanding (Section 1 Page 3). During our study, we attempted to contact the original authors to obtain the precise library versions used in their experiments, as the repository did not include this information. Unfortunately, we did not receive a response. Therefore, we adopted the following approach: we used the latest available versions of each library as of 2020 and opted for more recent versions only when compatibility issues arose. The details are explained in Section 4.5 of the revised manuscript.

Summary and a list of changes made based on the reviewers’ and editor’s comments

*All the updates in the manuscript are highlighted in color blue.

Reviewer 1 (@alchemi5t):

  • Made trained weights public and introduced two scripts for direct model inference without training (revised manuscript Section 5.4)
  • Explained our main objective and provided explanation on accuracy discrepancies (revised manuscript Section 6)

Reviewer 2 (@jsta):

  • major comments:
    • revised the sentence in the manuscript Section 1 on Page 3.
  • minor comments:
    • grammar mistake corrected (Section 4.1)
    • -a densenet121 in GitHub file imagenet_single_experiment_densenet121.sh isn’t a typo
  • highly optional and/or minor curiosity questions:
    • The results were similar for all attempts, so the variation was minimal.
    • Since these are three different datasets, dataset-inherent factors may play a role. However, we suspect that the higher accuracy in the reproduced results for CIFAR-100 compared to CIFAR-10 and ImageNet is primarily due to improvements in regularization techniques and optimization methods in newer versions of PyTorch. (revised manuscript Section 6)
    • Mentioned about downloading CIFAR-10 and CIFAR-100 datasets via torchvision datasets in the README.md section “To reproduce results of our reproducibility study” list Item 2
    • Added --time parameter for sbatch commands (updated in README.md section “Steps we have followed to reproduce the original study” list Items 5 and 6)
    • The intention was to include model names in the columns, but due to space constraints in Tables 4 and 5, we had to use number codes instead

Editor (@rougier):

  • Explained why it is not possible to use the exact same software stack of the original study with our reproducibility study (explained in the revised manuscript Section 4.5 and in this response)
  • Provided an explanation to Reviewer 1’s accuracy discrepancies (revised manuscript Section 6 newly added content in blue and in this GitHub response)

Revised Manuscript URL: https://github.com/lamps-lab/rep-network-deconvolution/blob/master/article.pdf Metadata URL: https://github.com/lamps-lab/rep-network-deconvolution/blob/master/metadata.yaml Code URL: https://github.com/lamps-lab/rep-network-deconvolution

rochanaro avatar Apr 09 '25 23:04 rochanaro

@rochanaro Thanks for your detailed answer. @jsta I guess your thumb means you're satisfied with the answer and the revised manuscript. @alchemi5t are you ok with the answer and revised manuscript?

rougier avatar Apr 11 '25 07:04 rougier

@rougier yes, I am satisfied with the response.

alchemi5t avatar Apr 11 '25 13:04 alchemi5t

@rougier yes, I am satisfied with the response.

jsta avatar Apr 11 '25 14:04 jsta

Thank you @alchemi5t and @jsta, we appreciate the replies.

rochanaro avatar Apr 17 '25 16:04 rochanaro

Great, then we can accept your replication! Congratulations.

For the actual publicaton, I would need a link, to a github repo with the sources of your articles. And you'll need to save you code repo on software heritage such as to get a SWID.

rougier avatar Apr 17 '25 16:04 rougier