[bug] conan upload errors out because the file is already in the download cache
Describe the bug
Environment:
- Conan version 1.66
- Ubuntu linux build nodes in a CI pipeline running parallel builds
- We leverage a conan download cache to optimize our downloads from artifactory and that is configured for all of our conan operations
- The conan download cache is stored on EFS in AWS
- We clean our download cache on a weekly basis
- Our local conan caches (not download cache) for the parallel builds (and across different builds) are all unique as they are running in different environments
Description of issue:
The error signature that we saw in one of the parallel steps was:
ERROR: CompName/1.2.3@user/channel: Upload recipe to 'our_remote_name' failed: Error, the file to download already exists: '/tmp/conan-download-cache/week-23/35071e140698ab6dc04ef8767c91c61fdebe3a4161ff7104e688be335b512898'. [Remote: our_remote_name]
This file in the download cache is the conanmanifest.txt for the recipe.
Expected vs Actual Behavior
It seems like buggy behavior that an upload operation is checking against the download cache. I thought the download cache was only used for downloading packages - not uploading them.
Even if there is a specific reason why the download cache is cross checked when an upload happens, I would have thought that the locking mechanism in the conan download cache would have ensured that the recipe was either retrieved from the download cache or not - without erroring out.
How to reproduce it
This issue has only happened one time that I am aware of. It seems to be a rare race condition and as such it is not readily reproducible.
Hi @Glenn-Duffy-Bose
Thanks for your feedback
It seems like buggy behavior that an upload operation is checking against the download cache. I thought the download cache was only used for downloading packages - not uploading them.
This could be caused by the download of the package sources. When a package is to be uploaded, first it checks that it is complete, that is, that it has the necessary sources (like the conan_sources.tgz file). If they are not in the local cache, because it was never downloaded, it will be first downloaded, so it can be uploaded to the new remote.
Even if there is a specific reason why the download cache is cross checked when an upload happens, I would have thought that the locking mechanism in the conan download cache would have ensured that the recipe was either retrieved from the download cache or not - without erroring out.
yes, indeed the locking mechanism is there to protect against race conditions. But it is also true that the locking (filelocks) is an OS mechanism, so it is mostly guaranteed to work at the OS level, not in a distributed setup, specially if there are different OSs accessing concurrently to that resource. In practice it seems to work well, but I think it is not fully guaranteed for example for concurrent Linux and Windows access. Maybe this is the case?
Thanks for your feedback
It seems like buggy behavior that an upload operation is checking against the download cache. I thought the download cache was only used for downloading packages - not uploading them.
This could be caused by the download of the package sources. When a package is to be uploaded, first it checks that it is complete, that is, that it has the necessary sources (like the
conan_sources.tgzfile). If they are not in the local cache, because it was never downloaded, it will be first downloaded, so it can be uploaded to the new remote.Even if there is a specific reason why the download cache is cross checked when an upload happens, I would have thought that the locking mechanism in the conan download cache would have ensured that the recipe was either retrieved from the download cache or not - without erroring out.
yes, indeed the locking mechanism is there to protect against race conditions. But it is also true that the locking (filelocks) is an OS mechanism, so it is mostly guaranteed to work at the OS level, not in a distributed setup, specially if there are different OSs accessing concurrently to that resource. In practice it seems to work well, but I think it is not fully guaranteed for example for concurrent Linux and Windows access. Maybe this is the case?
Thanks @memsharded for your prompt reply.
This could be caused by the download of the package sources
That's good to know. But I don't think that's the case here because it's only trying to upload the recipe. So it shouldn't need to fetch any source code to do a build.
specially if there are different OSs accessing concurrently to that resource
That makes sense but does not apply here. All of the systems accessing that resource use the same OS (ubuntu linux). It's managed as a node group in AWS so the configurations are in fact identical.
That's good to know. But I don't think that's the case here because it's only trying to upload the recipe. So it shouldn't need to fetch any source code to do a build.
Yes, the conan_sources.tgz are the "exported-sources" and those are part of the recipe, because the source is common to all the different binaries, and the source is part of the recipe, as it is needed to create the binaries (so it cannot be part of the binaries, as it is a precondition to create them).
Conan version 1.66
In any case, I'd like to clarify that the CI for Conan 1.X was already removed, and there are no further releases planned for Conan 1.X. I am still interested in this feedback, because the download cache mechanism is mostly the same in Conan 2, so it could happen there too. Still, if it is so uncommon to happen, it can be very challenging, at least when these things happen sporadically, it can be reproduced by just launching thousands of runs, but the way you describe this seems almost impossible to reproduce.
That's good to know. But I don't think that's the case here because it's only trying to upload the recipe. So it shouldn't need to fetch any source code to do a build.
Yes, the
conan_sources.tgzare the "exported-sources" and those are part of the recipe, because the source is common to all the different binaries, and the source is part of the recipe, as it is needed to create the binaries (so it cannot be part of the binaries, as it is a precondition to create them).
Hi @memsharded . Thanks for the reply. Out of curiosity, are you referring to conan_export.tgz ? I don't see any conan_sources.tgz uploaded to artifactory in this recipe. Otherwise, I thought conan interacted with the source control system (in this case, we use github) to fetch the source code.
Here's what I see in the recipe:
This recipe is very small. Besides the conanfile, it is only a conandata.yml file.
In any case, I'd like to clarify that the CI for Conan 1.X was already removed, and there are no further releases planned for Conan 1.X. I am still interested in this feedback, because the download cache mechanism is mostly the same in Conan 2, so it could happen there too. Still, if it is so uncommon to happen, it can be very challenging, at least when these things happen sporadically, it can be reproduced by just launching thousands of runs, but the way you describe this seems almost impossible to reproduce.
Thanks for the information. I wonder if this could be reproduced "manually" by doing the following:
- Uploading a fresh new recipe
- Setting a debugger breakpoint in a particular place in the recipe upload
- In another environment, downloading the recipe so it gets cached in the download cache
- Letting the recipe upload continue
- See if the recipe upload gets confused about the new contents in the download cache
I don't think I would have cycles right now to attempt to reproduce the issue - just something that came to mind in case someone wants to and has time to look into this.
Thanks for the reply. Out of curiosity, are you referring to conan_export.tgz ? I don't see any conan_sources.tgz uploaded to artifactory in this recipe. Otherwise, I thought conan interacted with the source control system (in this case, we use github) to fetch the source code.
No, I really meant the conan_sources.tgz.
The conan-export.tgz is the exports part, the files that are needed to process the conanfile.py code, like the conandata.yml. The conan_export.tgz is always downloaded together with the conanfile.py.
The conan_sources.tgz contains any exports_sources exported sources files. Many recipes do exports-sources to store a copy of the sources together with the recipe, not necessarily using the scm approach. This is in fact the standard mechanism to handle sources, for example, the conan new cmake_lib and all the other templates uses exports_sources attributes to do such a copy, as the https://docs.conan.io/2/tutorial/creating_packages/create_your_first_package.html tutorial shows.
The management of sources via scm coordinates is not even in the tutorial, but in a dedicated example in https://docs.conan.io/2/examples/tools/scm/git/capture_scm/git_capture_scm.html, but that is not the most common approach, this is done mostly by organizations that are concerned about source code being part of the package recipes (security, IP protection, etc).
The conan_sources.tgz is not downloaded by default, only when necessary to build. But also, when a package is going to be uploaded to a different repository, it makes sure to download such conan_sources.tgz, because that is necessary for the recipe to be complete.
Thanks for the reply. Out of curiosity, are you referring to conan_export.tgz ? I don't see any conan_sources.tgz uploaded to artifactory in this recipe. Otherwise, I thought conan interacted with the source control system (in this case, we use github) to fetch the source code.
No, I really meant the
conan_sources.tgz. Theconan-export.tgzis theexportspart, the files that are needed to process theconanfile.pycode, like theconandata.yml. Theconan_export.tgzis always downloaded together with theconanfile.py.The
conan_sources.tgzcontains anyexports_sourcesexported sources files. Many recipes do exports-sources to store a copy of the sources together with the recipe, not necessarily using thescmapproach. This is in fact the standard mechanism to handle sources, for example, theconan new cmake_liband all the other templates usesexports_sourcesattributes to do such a copy, as the https://docs.conan.io/2/tutorial/creating_packages/create_your_first_package.html tutorial shows.The management of sources via
scmcoordinates is not even in the tutorial, but in a dedicated example in https://docs.conan.io/2/examples/tools/scm/git/capture_scm/git_capture_scm.html, but that is not the most common approach, this is done mostly by organizations that are concerned about source code being part of the package recipes (security, IP protection, etc).The
conan_sources.tgzis not downloaded by default, only when necessary to build. But also, when a package is going to be uploaded to a different repository, it makes sure to download suchconan_sources.tgz, because that is necessary for the recipe to be complete.
Thanks for the information @memsharded .
Closing as solved, if you have any other questions, please feel free to create a new issue, we're always happy to help :)
@AbrilRBS Sorry I wasn't clear about this. I don't believe this issue is solved. I see a lot of information from @memsharded but it seems like background information as opposed to the root cause of the issue I am seeing. Could this be re-opened?
Sure, I can re-open, but it would be necessary to have some reproducible steps with Conan 2 (as Conan 1 is no longer maintained), as it is not clear how to reproduce it. Looking forward to getting your feedback then to reproduce, many thanks!
Sure, I can re-open, but it would be necessary to have some reproducible steps with Conan 2 (as Conan 1 is no longer maintained), as it is not clear how to reproduce it. Looking forward to getting your feedback then to reproduce, many thanks!
Thanks @memsharded (FYI I am the original creator of this issue if you couldn't already tell - due to changing configuration, I need to use my personal account going forward)
We're in progress of migrating to conan 2 so it would be quite a long time before I could reproduce this behavior in conan 2. I'm fine if you want to keep the issue open or close it since we cannot reproduce it in the major version of conan that has active development.
We're in progress of migrating to conan 2 so it would be quite a long time before I could reproduce this behavior in conan 2. I'm fine if you want to keep the issue open or close it since we cannot reproduce it in the major version of conan that has active development.
If you are not planning to try to reproduce any time soon, then I think it can be closed. Then it can be re-opened later when there is further feedback we can check and investigate further.
Maybe it is not necessary to reproduce this with your project and wait until the migration? Sounds like if there is an issue you can reproduce with clear steps in Conan 1, it should be pretty straightforward to adapt the steps for Conan 2. If you already have detailed steps to reproduce in Conan 1, feel free to share, and I'll try to extrapolate them to Conan 2 to double check.
We're in progress of migrating to conan 2 so it would be quite a long time before I could reproduce this behavior in conan 2. I'm fine if you want to keep the issue open or close it since we cannot reproduce it in the major version of conan that has active development.
If you are not planning to try to reproduce any time soon, then I think it can be closed. Then it can be re-opened later when there is further feedback we can check and investigate further.
Maybe it is not necessary to reproduce this with your project and wait until the migration? Sounds like if there is an issue you can reproduce with clear steps in Conan 1, it should be pretty straightforward to adapt the steps for Conan 2. If you already have detailed steps to reproduce in Conan 1, feel free to share, and I'll try to extrapolate them to Conan 2 to double check.
I agree @memsharded that if we can reproduce this in conan 1 that the details can be extrapolated for how it functions in conan 2. Unfortunately, this issue isn't readily reproducible even in conan 1. It happens rarely and intermittently. So it must be a race condition with other pipelines using or clearing the download cache.
Are there any additional details or logs that we could provide that could help troubleshoot this issue further?
Oh, I see.
I thought it was more deterministic. If it is quite random that happens rarely, then I understand, it will be way more challenging to reproduce.
The conan download cache is stored on EFS in AWS
Yes, I believe this could be some kind of race condition. The thing is that the download cache use OS file-lock synchronization mechanisms. Unfortunately, these are not guarantee to work on distributed storage, specially if different OSs are sharing that storage. Up to my understanding the file-lock sync mechanisms only work at the OS-level. We have seen some users with some network shared drives doing this, but maybe the AWS EFS goes beyond the usual and file-locks are not fully safe there.
Unfortunately, these are not guarantee to work on distributed storage, specially if different OSs are sharing that storage.
@memsharded Thanks for the details. I'm not sure if such guarantees exist for EFS. Notably, we are using the same OS for all clients accessing the shared storage. It's a pool of workers that all have identical configuration.
Thanks for the details. I'm not sure if such guarantees exist for EFS. Notably, we are using the same OS for all clients accessing the shared storage. It's a pool of workers that all have identical configuration.
Using the same OS in all clients should definitely help in many cases. But my experience with AWS is that they often have to do things differently because of their scale, so this could still be very related to the issue, it is possible that their sync mechanisms might interfere somehow with file locks
@memsharded Thanks for the information. Any suggestions for how to design things to prevent this kind of issue?