Original article: Teaching CLIP to Count to Ten (https://arxiv.org/pdf/2302.12066)

PDF URL: https://github.com/SforAiDl/CountCLIP/blob/main/resc/ReScience.pdf Metadata URL: https://github.com/SforAiDl/CountCLIP/blob/main/resc/metadata.yaml Code URL: https://github.com/SforAiDl/CountCLIP/blob/main/model.ipynb

Scientific domain: Computer Vision Programming language: Python Suggested editor:

Jun 05 '24 16:06 Harshvardhan-Mestha

Thanks for your submission and very sorry for the delay. We'll soon assign an editor.

Jun 24 '24 09:06 rougier

@benoit-girard @gdetor @koustuvsinha Can any if you edit this paper ? I may have some names for reviewer.

Jun 24 '24 09:06 rougier

Hello reviewers,

This is Karan, one of the authors, are there any updates for us?

Thanks.

Aug 14 '24 17:08 karannb

Very sorry for the long delay. I'll edit your submission. Don't hesitate to post here to remind us.

Oct 11 '24 13:10 rougier

@anthony-strock @ilemhadri @bengioewould @hhihnyou be interested in reviewing this submission: Teaching CLIP to Count to Ten (https://arxiv.org/pdf/2302.12066)

Oct 11 '24 13:10 rougier

Hello Are there any updates on this ?

Apr 01 '25 19:04 Harshvardhan-Mestha

@rougier I can review this submission.

Apr 01 '25 19:04 gdetor

@gdetor Thanks. I'll ask anthony-strock by email.

Apr 11 '25 08:04 rougier

@rougier Happy to review this submission.

Apr 11 '25 13:04 a-strock

Hi @rougier and @Harshvardhan-Mestha Here is my report:

Text

Overall, the text is coherent and well-written. There are a few typos here and there; for instance, the parameter \lambda sometimes appears as \lambda and others as lambda. Please fix that that. Moreover, Figures 2 and 3 should be larger. My primary concern is that the authors claim in the main text that the baseline accuracy in the work of (Paiss et al., 2023) is 27.5%. Paiss reports the official baseline accuracy on the CountBench dataset to be 31.67% (CLIP-B/32) (Table 1A in Paiss). I didn't manage to run the code due to some errors and broken dependencies (see my comments below).

Source Code

The script experiment.py does not run due to a missing merged.json file.
The google-colab library is not included in the requirements.txt. You can add a line like this: pip install git+https://github.com/<project_owner>/<project_name> in your requirements.txt.
Also, what version of Google Colab did you use? There were some broken dependencies on my installation.
There is a missing license file. Please add the appropriate license.
I recommend cleaning all cells' outputs in your notebooks. Moreover, adding some comments/text in the notebooks explaining the basic steps and operations would be better.
In the usage section of the README file, the requirements.txt file is located in the parent directory, not in the scripts.
Please pass all your Python scripts through the `autopep' to comply with PEP.
Finally, cleaning up and organizing the repository would be nice.

Apr 19 '25 01:04 gdetor

Hi @rougier and @Harshvardhan-Mestha, here is my report.

Review of the article

I am not sure what is the standard in ReScience to make the distinction between the original paper attempted to be replicated and the submitted paper in ReScience. In the following I use "original paper" for the former and "submitted paper" for the latter.
- It would be helpful for reader if an explicit distinction of that sort was also made consistently in the submitted paper. For instance using consistently "the original paper" for the original paper and "this paper" for the submitted paper.

Major concern

The original paper seems to have replicability issues which I believe is not enough discussed in the submitted paper.
- None of the dataset used for training seem to have been made public in the original paper, only the test dataset. The process to build one of the training dataset has been shared with some missing details. For instance, I'm not sure what was the "very large dataset collected from the Web that contains general in-the-wild images and captions" from the original paper or which dataset was used to build the dataset C from original paper (the former one?). The original paper says that the test dataset CountBench is built from LAION‐400M, but does not seem to say if the training datasets are also created from LAION-400M. From what I understand, because the training dataset were missing, this replication attempted to build them on a smaller scale from LAION‐400M. The authors of the original paper seem to have been contacted but by reading the submitted paper it is not clear to me which information were missing and requested.
- The exact "off-the-shelf object detector" to build the training dataset used in original paper is not provided (from the reference it seemed that it was built from one of the MobileNetV3 [1]). This replication used another one (YOLOv8), without mentionning in the submitted paper why another one was used.
  - [1] Howard, A., Sandler, M., Chu, G., Chen, L. C., Chen, B., Tan, M., ... & Adam, H. (2019). Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1314-1324).
I'm not convinced that the submitted paper can claim that the original paper has been successfully replicated, as the major numerical result of the original paper does not seem to be replicated. The best accuracy reported here is ~30% which is less than the baseline reported in the original paper, and way below the ~75% reported in the original paper. The submitted paper reports a ~1% increase from baseline, whereas in the original paper it was about ~45%.
I'm not convinced that the submitted paper can claim that the original paper has not been successfully replicated either, as training was performed on much smaller datasets, potentially obtained from a different source, which might also have an influence on the results shown in the original paper.
The submitted paper omits some of the results presented from the original paper without discussing it.
- The submitted paper does not provide the mean deviation from correct number which was present in the original paper. In my opinion, this metric has to be included as it provides a complementary understanding about numerical estimation capabilities of the model.
- The submitted paper does not replicate the relevancy map results (Figures 5-6 of original paper). In my opinion, if there is a good reason for this, this has to be discussed in the submitted paper.
- The submitted paper does not replicate the image retrieval (Figure 7 of original paper) and image generation results (Figure 8 of original paper). In my opinion, this decision has to be discussed in the submitted paper.

Methods

From what I understand, there are multiple dataset used here (subsets from LAION-400M), it could help reader to make it clear in 2.1, e.g. with one different subsubsections for each dataset.
"Due to our size and computational constraints, we passed over 2 million images [...]" -> The process to choose the 2M subset from the initial 400M has to be described.
"The method to generate the counting set was highly unfeasible when running over the full dataset." -> I am not sure how it is possible to assess if it is highly unfeasible without numerical estimation of computational/time costs and constraints.
"We have made our code and counting set public." -> A direct reference here to the code link (https://github.com/SforAiDl/CountCLIP) and data link (https://zenodo.org/records/10981852) could help reader.
"We contacted the authors of the papers, and they stated that while they possess the image files, it was against their company policy to share raw images sourced from publicly available data." -> In this sentence, it is not clear to me if it refers to the missing training dataset or the non-functioning URL in CountBench. How was handled the missing data in CountBench? In my opinion, it has to be described in here.
"The benchmark is the only one of its kind, and it was carefully curated, to ensure class balance." -> Not clear to me if this sentence refers to CountBench or to the dataset the submitted paper is providing. It should probably be clarified in text.
In my opinion, in section 2.2 it is not enough to only describe the losses used. It should also describe how many steps/epochs have been used, what optimizer are used with which parameters, how it all compares to original paper. Some of these information are in the Results section and should probably be moved here.
There is no explicit discussion about what was the process used to attempt to match original paper as best as possible, or describing why some values would have been changed.
- For instance, the learning rate in both are the same (i.e. 5e-6), but the batch size in original paper was 32768, and in the submitted paper it is 5, which is known to affect effective learning rate [2].
  - [2] Smith, S. L., Kindermans, P. J., Ying, C., & Le, Q. V. (2018, February). Don't Decay the Learning Rate, Increase the Batch Size. In International Conference on Learning Representations.
- As dataset size differs between original and submitted paper I believe it might be insightful to compare the duration of training in number of pictures. Original paper train for 10000 steps with batch_size 32768, so in total training used about 30M picture. Submitted paper train for 10 epoch with a dataset of size 15605, so in total submitted paper training used about 150K pictures, i.e. about 200 times less pictures. This might be a potential reason why accuracy results reported in the submitted paper are worse than those reported in the original paper, i.e. training might have been stopped too early.
- "Paiss et al.[1]: bsize = 32, 768, p = 1/32, ncount = 200, 000; Ours: bsize = 5, p = 1/5, ncount = 2000." -> p has to be defined. I'm not sure to understand what led to the decision p=1/5 in the submitted paper, when the original paper seems to show that smaller proportion were better (Table 2 from supplementary). In any case, in my opinion, the process that led to take a different decision from the original paper should be discussed in the submitted paper.

Results

"The baseline accuracy for a CLIP B/32 model on the CountBench dataset is 27.5% with no training." -> For ease of reading, baseline can be added to Table 1.

Figures

Labels on figures are very small which makes readability of all figures difficult.
- Figure 2: Can be handled by either (a) reducing the figsize of the matplotlib figure or (b) increasing the size of the figure in the PDF.
- Figure 3-5: Same than Figure 2. (a) might cause labels on x-axis to overlap, which can be fixed by either replacing word description of number with numerical value 2-10 or rotating labels in x-axis like in original paper.
Figure 3-5: Label on colorbar is missing.
The criteria for early stopping is not described for Figure 3.

Review of the code

General comment

I'm not sure in which level I am expected to review/rerun the code before knowing if it is possible to claim that the submitted paper does replicate or does not replicate the original paper, but I tried to run some of the code anyway.
I'm not sure what I am expected to evaluate here: (1) the replication of the dataset creation, (2) the replication of model performance in the zero-shot counting task, or (1) + (2). For now, I've focused mainly on (2).
The code is included in jupyter notebooks that have to be run on Google Colab. I am not sure if I am expected to run the code on Google Colab, or if we expect the code provided to be runnable outside Google Colab as well. I have tried both, running notebooks on Google Colab works but the notebooks I've tried cannot be run as is outside of Google Colab. Some packages are missing from requirements.txt, some package might not be accessible outside of Google Colab (e.g. files from google.colab package).
The README.md describes the role of each notebook, but a detailed instruction of which notebook has to be run in which order is absent from README.md, it would be help reproducibility if that sequential process was described. For instance probably something along the lines:

1. Download external data
   a) Run download.ipynb for downloading LAION-400M
   b) Run cb_download.ipynb for downloading CountBench
2. Curate external data
   a) Run create_json.ipynb to create JSON files for both LAION-400M and CountBench
   b) Run merge.ipynb to merge both JSON files
   c) Run parse_faulty.ipynb to detect faulty data in the JSON files
   d) Run count_set_gen.ipynb to create training counting/non-counting dataset from LAION-400M
3. Run models
   a) Run model.ipynb for augmented CLIP (changing cell X for varying parameters)
   b) Run baseline.ipynb for baseline CLIP

In my opinion, code that is not used in the replication probably should be removed from the github repository (e.g. old folder)
The code expect to have a wandb account, this might be helpful to mention it somewhere in README.md.
Seeds (random/numpy/torch) don't seem to be fixed, which might be important for reproducibility.

Attempt to run the code on Google Colab

I've run only the models (augmented CLIPs and baseline CLIP) without manually changing the parameters.

Running jupyter notebook model.ipynb

Cell 4: Maybe it's possible to set the wandb key as secret in Google Colab so that there is no need to change the code per se before running, and to indicate to users in README.md that they should add this secret. For instance:
- I added my wandb key as secret named wandbkey.
- I included a from google.colab import userdata
- I replaced wandb.login(key='<yourapikey>') with wandb.login(key=userdata.get('wandbkey'))
Cell 6: In the current notebook the user has to change manually the values in this cell. I'm not sure what is the standard ReScience but in my opinion, it might be helpful to include which parameters are used for each panel in Figure3b-e, or even Figure4b-l.
After running without changing parameters in notebook, numerical values match qualitativeley but do not match precisely. What is the source of non-determinism? If I understand correctly comparison should be made with configuration λ = 1 with no scheduler, i.e. first line of Table 1, and with Figure 4b and 5b.
- After 10 epoch (i.e. epoch 9): The code prints Validation Accuracy: 0.18467583497053044, but 21.81 reported was in Table 1.
- Early stop for best validation accuracy (i.e. epoch 4): The code prints Validation Accuracy: 0.2455795677799607 but 25.15 was reported in Table 1.

Running jupyter notebook baseline.ipynb

Validation accuracy for baseline is not printed. print(f"Validation Accuracy: {val_acc}") mising from get_preds or from cell 18.
All numerical values match precisely.
- Validation Accuracy: 0.275049115913556 reported as 27.5 in submitted paper.

Attempt to run the code in local

Installing environment

There are typos in the README.md:
- cd CountCLIP/scripts -> cd CountCLIP/script
- pip install requirements.txt -> pip install -r requirements.txt
  - Probably has to be done before the cd
- What's the exact version of python that was used to run the code? 3.10.x?

Running jupyter notebook model.ipynb

jupyter notebook package and dependencies are not in requirements.txt
Cell 1: /bin/bash: gdown: command not found
- gdown package missing from requirements.txt
Cell 3: CellModuleNotFoundError: No module named 'google.colab'
- I'm not sure how to install locally google.colab package

Running jupyter notebook baseline.ipynb

Same issues than in model.ipynb

Running jupyter notebook count_set_gen.ipynb

Cell 2: ModuleNotFoundError: No module named 'wget'
- wget package is missing from requirements.txt
Cell 4: ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
- Something is missing from requirements.txt

Apr 19 '25 14:04 a-strock

@gdetor @a-strock Many thanks for your detailed reviews. @Harshvardhan-Mestha Can you address the review and can you especially explain why you think you've replicate the original paper in oight of @a-strock review?

Apr 24 '25 08:04 rougier

[Re] Teaching CLIP to Count to Ten

Text

Source Code

Review of the article

Major concern

Methods

Results

Figures

Review of the code

General comment

Attempt to run the code on Google Colab

Running jupyter notebook model.ipynb

Running jupyter notebook baseline.ipynb

Attempt to run the code in local

Installing environment

Running jupyter notebook model.ipynb

Running jupyter notebook baseline.ipynb

Running jupyter notebook count_set_gen.ipynb