onnx-mlir
onnx-mlir copied to clipboard
Upgrade the LLVM Version
Hi,
There're three main frontends in the deep learning world, TensorFlow, PyTorch and ONNX. And all of them are currently embracing MLIR/LLVM. However, here comes the version problem:
- TensorFlow upgrades LLVM daily
- PyTorch and ONNX upgrades LLVM monthly It will be troublesome if the LLVM version among the three frontends don't match, for those who use another unified dialect (MHLO, TOSA).
Could we work together to upgrade the LLVM version to same as what torch-mlir on? Currently it is 3580daa
.
cc @sjarus , I think TOSA side also has the same needs.
Upgrading daily may be a resource problem, as we have a fairly small team. One of us (@gongsu832 ) is working on making a daily build using the latest LLVM and if it fails, finding the latest commit that works. As you have seen, it is often some non-negligible work to get to the latests, often because of name changes (easy cases) or more fundamental changes.
If syncing PyTorch and ONNX upgrades, that would work, but it can still take a few days to do the work to migrate, so we would have to "decide" of a common branch first and then works toward making our individual repo work. Maybe a label on the LLVM commit for monthly updates.
I will show my ignorance, but would rather understand the process well. If we have ONNX-MLIR CIs failing because one project has updated but not the other... Can we control it because mhlo is imported as a third_party, and thus we would upgrade our entire onnx-mlir repo to the agreed-upon LLVM commit, and once this PR is in, then we update the third party mhlo once they also have reached the same LLVM commit?
Basically, I am asking for experts to give a detailed protocol on how to advance each distinct project to a common LLVM commit knowing that it may take each project between a day to a few days to reach that commit. Key is to make sure that in this transient stage, we don't have CIs failing because of the temporary level differences while we are in the process of upgrading to reach the desired LLVM level.
Thanks for your help on this.
(some discussion of this in PR #1554), should continue here.
As it stands, onnx-mlir
has a dependency on llvm-project
directly as well as indirectly (through mhlo
). These two need to be in sync and should always be updated together to make sure that there are no weird and hard to diagnose issues or breaks.
This is what I did in the latest update #1554 and it was not trivial, because even though mhlo
had in theory moved to a newer version of llvm-project
, there was still a break since mhlo
does not test shared library builds. We can safely assume that moving to a newer version of llvm-project
will require updates to onnx-mlir
as well as have potential to NOT have a working equivalent in mhlo
unless they add more testing around shared library builds. Regardless, the updates are going to require actual dev time and cannot be automated the majority of the time because mlir
changes significantly frequently.
Now for torch-mlir
. torch-mlir
has a direct dependency on llvm-project
and again an indirect dependency on llvm-project
through mhlo
. torch-mlir
updates much more frequently than onnx-mlir
(closer to weekly or sometimes even bi-weekly) but again, it is not trivial, especially with the newly added dependency on mhlo
. Just the most recent change to mlir
python registration has caused some flux and breaks and had to be reverted. Incidentally, because torch-mlir
tried to update to the newest mlir
so quickly, that was impacted by the revert of an mlir
change in llvm-project
. The updates to llvm-project
in torch-mlir
are generally much cheaper than onnx-mlir
, but they are still not trivial. torch-mlir
also almost always picks up the latest daily build of torch
and torchvision
.
mhlo
only has the direct dependency on llvm-project
and the majority of the time the update is just pushing the new llvm-project
commit, however, that is partially because things like shared library builds are NOT maintained, so any newly added library dependencies are largely ignored, unless someone like onnx-mlir
needs the updates (see https://github.com/tensorflow/tensorflow/pull/56844).
So when you put all of this together, keeping up with llvm-project
is expensive regardless of the frequency of the updates. More frequent updates => fewer changes, but on more occasions; while waiting means fewer instances, but much larger and more complicated changes. This is possibly my teams's biggest pain point in working with these open source projects (we work with both onnx-mlir
and torch-mlir
and frequently do the llvm-project
updates in one or both because we internally update llvm-project
weekly).
My recommendation is rather than trying to keep onnx-mlir
and torch-mlir
in sync with each other's version of llvm-project
, we should pick an update cadence for onnx-mlir
to newer llvm-project
and mhlo
and stick to it. This could be once a month or twice a month etc. (I would favor twice a month based on the amount of effort required). Then create an Issue or similar when that work starts, so that others know when an update is coming. At the same time, contributors should feel empowered to do an update sooner if they need it, so that they can unblock themselves and the lack of an existing Issue will indicate whether an update is in flight.
As a side note, we can improve how we pick which llvm-project
version to update to because mainline llvm-project
builds are frequently broken in one piece or another, but that's a separate discussion.
I will show my ignorance, but would rather understand the process well. If we have ONNX-MLIR CIs failing because one project has updated but not the other... Can we control it because mhlo is imported as a third_party, and thus we would upgrade our entire onnx-mlir repo to the agreed-upon LLVM commit, and once this PR is in, then we update the third party mhlo once they also have reached the same LLVM commit?
I suggest the ONNX2MHLO conversion and corresponding MHLO Compilation Target could be turned off by a CMake Option. Thus if the whole ONNX-MLIR fails because of the MHLO part, the users who don't need MHLO could still use ONNX-MLIR. Then we'll fix the MHLO related problems to make the whole ONNX-MLIR correct, ex. the shared library builds as @sstamenova mentioned.
Basically, I am asking for experts to give a detailed protocol on how to advance each distinct project to a common LLVM commit knowing that it may take each project between a day to a few days to reach that commit. Key is to make sure that in this transient stage, we don't have CIs failing because of the temporary level differences while we are in the process of upgrading to reach the desired LLVM level.
I think this should be discussed with more developers. A post is initiated in mlir forum.
Hi,
There're three main frontends in the deep learning world, TensorFlow, PyTorch and ONNX. And all of them are currently embracing MLIR/LLVM. However, here comes the version problem:
- TensorFlow upgrades LLVM daily
- PyTorch and ONNX upgrades LLVM monthly It will be troublesome if the LLVM version among the three frontends don't match, for those who use another unified dialect (MHLO, TOSA).
Could we work together to upgrade the LLVM version to same as what torch-mlir on? Currently it is
3580daa
.cc @sjarus , I think TOSA side also has the same needs.
Yes we do - I've posted in more detail in the MLIR discourse thread. Thanks for the call out here!
Lots have already been said and I think we all generally agree that requiring all projects to be on the same LLVM commit level would be very hard to manage in practice, given that every project has its own resource constraint, development pace and goal, etc. So I agree with @sstamenova that we should have our own pace of upgrading to newer LLVM commit. I'm in the process of creating a Jenkins job that runs periodically in idle times to find the earliest LLVM commit after what we currently use that breaks ONNX-MLIR, which will hopefully help ease some pain in upgrading.
FYI, I'm still trying to figure out the best way to present the result of the Jenkins job but here is what it found:
"recent": {
"failed": [
{
"sha1": "a1ec0d8bdccab1d28e009375209965017c872d3d",
"date": "2022-07-21T19:03:07Z"
},
"140"
],
"succeeded": [
{
"sha1": "6605187103a2369418d014a7f146fee4a04b11bf",
"date": "2022-07-21T19:00:29Z"
},
"138"
]
},
This means that after the latest LLVM commit update, the earliest commit that will break ONNX-MLIR is a1ec0d8bdccab1d28e009375209965017c872d3d
. The one before it, 6605187103a2369418d014a7f146fee4a04b11bf
, is the one we can update to without breaking ONNX-MLIR.