[Re] Network Deconvolution
Original article: C. Ye, M. Evanusa, H. He, A. Mitrokhin, T. Goldstein, J. A. Yorke, C. Fermüller, and Y. Aloimonos. “Network Deconvolution.” In: ICLR (2020).
PDF URL: https://github.com/lamps-lab/rep-network-deconvolution/blob/master/article.pdf Metadata URL: https://github.com/lamps-lab/rep-network-deconvolution/blob/master/metadata.yaml Code URL: https://github.com/lamps-lab/rep-network-deconvolution
Scientific domain: Machine Learning Programming language: Python Suggested editor:
Thanks for your submission. We'll assign an editor soon.
By the way is this submision part of the ICLR reproducibility challenge? If yes, are there any open review somewhere?
Thanks for your submission. We'll assign an editor soon.
Thank You!
By the way is this submision part of the ICLR reproducibility challenge? If yes, are there any open review somewhere?
No, our work was not submitted to ICLR reproducibility challenge.
Very sorry for such a long delay, hopefully things will get better for 2025. I was asking the question because the format of the PDF is very similar to the ICLR challenge. Note that this is not a problem at all, the idea was to re-use review if they were available.
I'll edit your review and assign reviewers soon hopefully.
In the meantime, can you have a look at other submissions and propose yourself to review?
@birdortyedi @MiWeiss Coud lyou review this submission?
Very sorry for such a long delay, hopefully things will get better for 2025. I was asking the question because the format of the PDF is very similar to the ICLR challenge. Note that this is not a problem at all, the idea was to re-use review if they were available.
I'll edit your review and assign reviewers soon hopefully.
In the meantime, can you have a look at other submissions and propose yourself to review?
Thank you for the update! We understand how busy things can get.
Yes, for organizing the content, we tried to follow the structure of MLRC 2022 template only to ensure clarity.
I’ll certainly take a look at other submissions and will propose myself as a reviewer where I can contribute.
Sounds like an interesting paper and a good match for me. Unfortunately, though, I won't be able to review a paper in the next months.
@rougier
@ReScience/reviewers We're looking for two reviewers (Machine learnong / Python ICLR), any takers?
@rochanaro Don't hesitate to post here to ask for update
@rougier I would like to review this one.
@alchemi5t Many thanks! You can start the review now and let's target mid-March (or sooner if you can)
@rougier I am interested in reviewing this one if a second review is needed.
@jsta That would be great, many thanks. Do you think mid-March would work for you?
Yes
@rougier I'm not getting responses to PR's and Issues in the code repo. Can we delay the due date pending a response from @rochanaro or team?
@rougier I'm not getting responses to PR's and Issues in the code repo. Can we delay the due date pending a response from @rochanaro or team?
@jsta Apologies for the delayed response. We encountered an issue with notification delivery for pull requests and issues. However, we have now addressed all queries received thus far.
@alchemi5t @jsta Any progress on your reviews?
@rougier I'll have my review by the end of the week.
@rougier Yes, I have some general comments prepared. I will post them once I have finished a couple tests to verify general behavior. I do not plan to run all the tests due to computational constraints.
Dear Authors,
Here's my review of your work.
Review of "[Re] Network Deconvolution"
This reproduction study provides a thorough evaluation of the network deconvolution technique introduced by Ye et al. (2020). After examining both the original paper and this reproduction, I find that the authors have validated the primary claim about network deconvolution for most parts. While there are many cases where network deconvolution improves model performance compared to batch normalization, the reproduction results show few exceptions that contradict the original paper's universal claim. For example, in the ResNet-18 architecture with CIFAR-100 at 100 epochs, batch normalization (97.42%) actually outperformed network deconvolution (94.31%), which contradicts both the original paper's results and its central claim.
Strengths
- Comprehensive testing: The authors tested 10 modern neural network architectures on CIFAR-10/100 and ImageNet datasets, over 3 runs
- Training time analysis: The authors went beyond the original paper by analyzing computational overhead, showing that network deconvolution requires more training time
Key Observations
The few exceptions where BN seems to have outperformed has not been noted and there is no analysis around it. Furthermore, this statement is not true given the results: "The results show that the model performance with network deconvolution is always better than the model performance with batch normalization."
Another notable finding is that the reproduced accuracy values were often significantly higher than those reported in the original paper. For example:
- For VGG-16 with CIFAR-100 at 100 epochs, the original paper reported 75.32% accuracy with network deconvolution, while the reproduction achieved 99.30%
- Similar large improvements were observed across most architectures and datasets
The authors attribute this systematic improvement to:
- Advancements in numerical stability of libraries (NumPy 1.16.1 → 1.23.5, PyTorch 1.0 → 1.13)
- Improved optimization algorithms and parallelism in Tensorflow
While these explanations are plausible, the magnitude of improvement (sometimes exceeding 20%) suggests there might be additional factors at play that weren't fully investigated. Given that this improvement has increased most accuracy (both ND and BN) to 99.xx where the comparison comes down to 10e-2 (vs the 1-2 pts in the original paper), I expected deeper analysis and stronger evidence for these claims.
Code
As far as I understand, running training runs seems to be the only way to test the codebase. Due to lack of hardware I am unable to do so, but to anyone who can and would like to quickly verify the claims, releasing the trained weights at 100 epochs and a minimal script to infer the results would be greatly helpful.
Opinion
This reproduction study provides valuable insights but reveals important discrepancies with the original paper's claims. While network deconvolution often outperforms batch normalization, the reproduction found notable exceptions.
The reproduction yielded substantially higher accuracy values for both techniques compared to those reported in the original paper. These significant discrepancies make it difficult to draw direct comparisons with the original results, and the proposed explanations for these differences (library improvements, optimization algorithms) remain speculative without rigorous empirical validation.
Rather than enhancing confidence in network deconvolution as a universal improvement over batch normalization, this reproduction suggests a more nuanced view: network deconvolution appears to be a viable alternative that performs better in many but not all scenarios. The authors' detailed reporting of computational costs and performance characteristics across architectures provides essential practical context for researchers considering which normalization technique to employ.
Ok, I have completed my review. Please see the comments below. In addition to reading the paper and git repository, I verified that I could run 2 arbitrary architectures (vgg16 and pnasnetA) for 100 epochs for both the BN and ND cases, and got similar relative accuracy results.
major comments
- Section 1, last paragraph, says "our study attempts to reproduce the results reported in the original paper with the most recent versions of software libraries.". This is not accurate, for example, Tensorflow 2.12 is years old. Maybe change to "the most recent version of software libraries as of [some date]"?
minor comments
- Section 4.1, first sentence, says "compared it against the state‐of‐the‐art against". This seems like a typo or grammar mistake.
- Is there a typo in
imagenet_single_experiment_densenet121.sh, should-a densenet121->-a densenet121d?
highly optional and/or minor curiosity questions
- Section 5.1, first paragraph, says "[...] we averaged the results of three attempts and compared them with the original study’s reported values.". Was there variation among the attempts such that running each model/architecture combination that many times was worthwhile?
- Figure 3, why were the reproduced values so much more accurate than the original for CIFAR100 but not CIFAR10 or Imagenet?
- There is some information in the repository README that duplicates the paper. I wonder if you could remove information about the steps you did to create the repo and focus the README on how users will interact with it. Get them up and running quickly without wading through extraneous details. Conversely, the papers says that torchvision datasets are downloaded on-the-fly but this is not in the README.
- It would be nice to add a time argument to the sbatch files so users have a rough idea of how long they're expected to run.
- It's strange that Table 8 lists out the model names while Table 4 keys them out by number code.
@alchemi5t @jsta Many thanks for your detailed reviews. @rochanaro Can you address the commend and specifically address @alchemi5t concerns about performances? It seems the change in numeric stability might not be a plausible reason for enhanced performances. Any idea on that ? Would it be possible to rerun your code (without too much hassle) with the exact same stack as the original paper?
@alchemi5t @jsta Many thanks for your detailed reviews. @rochanaro Can you address the commend and specifically address @alchemi5t concerns about performances? It seems the change in numeric stability might not be a plausible reason for enhanced performances. Any idea on that ? Would it be possible to rerun your code (without too much hassle) with the exact same stack as the original paper?
Thank you @alchemi5t and @jsta for the reviews. @rougier It would be challenging to redo the experiment using the exact same dependencies as the original paper. During our study, we attempted to contact the original authors to obtain the precise library versions, as the original study repository does not mention them, but we did not receive a response. Therefore, we adopted the following approach: we used the latest available versions of each library as of 2020 and only opted for more recent versions when compatibility issues arose.
We will soon get back with answers, revisions, and updates addressing the above concerns in our upcoming response.
Full response to ReScience C reviews (submission #89)
Dear editor (@rougier) and the reviewers (@alchemi5t, @jsta) ,
We appreciate the reviewers’ careful evaluation of our reproduction study. Below is our detailed response addressing each of the reviewers’ points:
Performance Discrepancies and Exceptional Cases
While there are many cases where network deconvolution improves model performance compared to batch normalization, the reproduction results show few exceptions that contradict the original paper's universal claim. For example, in the ResNet-18 architecture with CIFAR-100 at 100 epochs, batch normalization (97.42%) actually outperformed network deconvolution (94.31%), which contradicts both the original paper's results and its central claim.
We appreciate the observation that batch normalization (BN) outperformed network deconvolution in certain cases (e.g., ResNet-18 on CIFAR-100 at 100 epochs), and we acknowledge these exceptions. Our study aimed to carefully reproduce the experiments rather than optimize hyperparameters or modify architectural choices. In the literature, such variations are not uncommon when replicating older experiments (see, e.g., Examining the Effect of Implementation Factors on Deep Learning Reproducibility, A Study on Reproducibility and Replicability of Table Structure Recognition Methods, Towards training reproducible deep learning models). We do not claim that network deconvolution universally outperforms BN; rather, our findings suggest that while ND often provides benefits, its effectiveness may be context dependent (explained in manuscript section 6 paragraph 4).
Systematic Accuracy Improvements
Another notable finding is that the reproduced accuracy values were often significantly higher than those reported in the original paper. For example: • For VGG-16 with CIFAR-100 at 100 epochs, the original paper reported 75.32% accuracy with network deconvolution, while the reproduction achieved 99.30% • Similar large improvements were observed across most architectures and datasets
We acknowledge the reviewer’s observation regarding systematically higher accuracy values (e.g., VGG-16 with CIFAR-100) and agree on the importance of understanding these differences. We believe that several factors contribute to these improvements. Notably, advancements in numerical stability, as documented in recent studies (e.g., Recent advances and applications of deep learning methods in materials science, A Comprehensive Review of Deep Learning: Architectures, Recent Advances, and Applications ), and enhancements in optimization algorithms may play a significant role.
Additionally, prior research supports the idea that updated frameworks can improve model performance. Mienye et al. (2024), and Qin et al. 2024 and Vaishnav et al. (2022) demonstrated that deep learning models consistently achieve higher accuracy when trained with updated frameworks due to improvements in optimization techniques and regularization strategies. Similarly, Coakley et al. (2023) found that updated hardware and software environments alone can introduce accuracy differences of over 6%, highlighting the impact of implementation-level changes on reproducibility.
While the observed accuracy gains (e.g., 75% → 99% for VGG-16 on CIFAR-100) are substantial, we emphasize that both network deconvolution (ND) and batch normalization (BN) baselines improved comparably, suggesting that these gains are driven by systemic factors rather than being specific to any particular method. Although we have provided plausible explanations based on updates in software libraries, we acknowledge that these factors may not fully account for the magnitude of the observed differences. Further investigation into these aspects is warranted, but we reiterate that the primary objective was to assess reproducibility rather than to deconstruct every underlying cause of performance variation.
We provided detailed explanations in Section 6 Paragraphs 2 and 3 in the revised manuscript.
Sharing trained weights for verification
As far as I understand, running training runs seems to be the only way to test the codebase. Due to lack of hardware I am unable to do so, but to anyone who can and would like to quickly verify the claims, releasing the trained weights at 100 epochs and a minimal script to infer the results would be greatly helpful.
We acknowledge the reviewer's (@alchemi5t) request and have taken steps to facilitate easy verification of our results. The trained model weights at 100 epochs have been made publicly available at https://osf.io/hp3ab/files/osfstorage , corresponding to the results presented in Tables 1 and 2. Additionally, we provide two minimal inference scripts (test_script_for_table_1.py and test_script_for_table_2.py) that allow users to reproduce our reported accuracy without requiring extensive computational resources. This ensures that our findings can be easily validated by downloading the weights and running a single command.
The steps have been explained in the newly added section, "To Validate the Results Using the Trained Weights (Direct Inference Without Training)" in the GitHub README.md file and in Section 5.4 of the revised manuscript.
For example, the following command can be used to verify results for the VGG-16 architecture on CIFAR-10 with batch normalization:
python test_script_table_1.py --arch vgg16 --dataset cifar10 --deconv False --model_path "checkpoints/cifar10_vgg16_BN.pth.tar"
Clarification on Software Versions (Reviewer 2’s Comment @jsta)
Section 1, last paragraph, says "our study attempts to reproduce the results reported in the original paper with the most recent versions of software libraries.". This is not accurate, for example, Tensorflow 2.12 is years old. Maybe change to "the most recent version of software libraries as of [some date]"?
In response to Reviewer 2’s remark regarding our statement in Section 1, we have revised the language. The original phrase “with the most recent versions of software libraries” has been updated to “with the most recent versions of software libraries as of 2020” to prevent any misunderstanding (Section 1 Page 3). During our study, we attempted to contact the original authors to obtain the precise library versions used in their experiments, as the repository did not include this information. Unfortunately, we did not receive a response. Therefore, we adopted the following approach: we used the latest available versions of each library as of 2020 and opted for more recent versions only when compatibility issues arose. The details are explained in Section 4.5 of the revised manuscript.
Summary and a list of changes made based on the reviewers’ and editor’s comments
*All the updates in the manuscript are highlighted in color blue.
Reviewer 1 (@alchemi5t):
- Made trained weights public and introduced two scripts for direct model inference without training (revised manuscript Section 5.4)
- Explained our main objective and provided explanation on accuracy discrepancies (revised manuscript Section 6)
Reviewer 2 (@jsta):
- major comments:
- revised the sentence in the manuscript Section 1 on Page 3.
- minor comments:
- grammar mistake corrected (Section 4.1)
- -a densenet121 in GitHub file
imagenet_single_experiment_densenet121.shisn’t a typo
- highly optional and/or minor curiosity questions:
- The results were similar for all attempts, so the variation was minimal.
- Since these are three different datasets, dataset-inherent factors may play a role. However, we suspect that the higher accuracy in the reproduced results for CIFAR-100 compared to CIFAR-10 and ImageNet is primarily due to improvements in regularization techniques and optimization methods in newer versions of PyTorch. (revised manuscript Section 6)
- Mentioned about downloading CIFAR-10 and CIFAR-100 datasets via torchvision datasets in the README.md section
“To reproduce results of our reproducibility study”list Item 2 - Added
--timeparameter for sbatch commands (updated in README.md section “Steps we have followed to reproduce the original study” list Items 5 and 6) - The intention was to include model names in the columns, but due to space constraints in Tables 4 and 5, we had to use number codes instead
Editor (@rougier):
- Explained why it is not possible to use the exact same software stack of the original study with our reproducibility study (explained in the revised manuscript Section 4.5 and in this response)
- Provided an explanation to Reviewer 1’s accuracy discrepancies (revised manuscript Section 6 newly added content in blue and in this GitHub response)
Revised Manuscript URL: https://github.com/lamps-lab/rep-network-deconvolution/blob/master/article.pdf Metadata URL: https://github.com/lamps-lab/rep-network-deconvolution/blob/master/metadata.yaml Code URL: https://github.com/lamps-lab/rep-network-deconvolution
@rochanaro Thanks for your detailed answer. @jsta I guess your thumb means you're satisfied with the answer and the revised manuscript. @alchemi5t are you ok with the answer and revised manuscript?
@rougier yes, I am satisfied with the response.
@rougier yes, I am satisfied with the response.
Thank you @alchemi5t and @jsta, we appreciate the replies.
Great, then we can accept your replication! Congratulations.
For the actual publicaton, I would need a link, to a github repo with the sources of your articles. And you'll need to save you code repo on software heritage such as to get a SWID.