onnx-mlir icon indicating copy to clipboard operation
onnx-mlir copied to clipboard

Upgrade the LLVM Version

Open yaochengji opened this issue 2 years ago • 7 comments

Hi,

There're three main frontends in the deep learning world, TensorFlow, PyTorch and ONNX. And all of them are currently embracing MLIR/LLVM. However, here comes the version problem:

  1. TensorFlow upgrades LLVM daily
  2. PyTorch and ONNX upgrades LLVM monthly It will be troublesome if the LLVM version among the three frontends don't match, for those who use another unified dialect (MHLO, TOSA).

Could we work together to upgrade the LLVM version to same as what torch-mlir on? Currently it is 3580daa.

cc @sjarus , I think TOSA side also has the same needs.

yaochengji avatar Jul 21 '22 17:07 yaochengji

Upgrading daily may be a resource problem, as we have a fairly small team. One of us (@gongsu832 ) is working on making a daily build using the latest LLVM and if it fails, finding the latest commit that works. As you have seen, it is often some non-negligible work to get to the latests, often because of name changes (easy cases) or more fundamental changes.

If syncing PyTorch and ONNX upgrades, that would work, but it can still take a few days to do the work to migrate, so we would have to "decide" of a common branch first and then works toward making our individual repo work. Maybe a label on the LLVM commit for monthly updates.

I will show my ignorance, but would rather understand the process well. If we have ONNX-MLIR CIs failing because one project has updated but not the other... Can we control it because mhlo is imported as a third_party, and thus we would upgrade our entire onnx-mlir repo to the agreed-upon LLVM commit, and once this PR is in, then we update the third party mhlo once they also have reached the same LLVM commit?

Basically, I am asking for experts to give a detailed protocol on how to advance each distinct project to a common LLVM commit knowing that it may take each project between a day to a few days to reach that commit. Key is to make sure that in this transient stage, we don't have CIs failing because of the temporary level differences while we are in the process of upgrading to reach the desired LLVM level.

Thanks for your help on this.

AlexandreEichenberger avatar Jul 21 '22 22:07 AlexandreEichenberger

(some discussion of this in PR #1554), should continue here.

AlexandreEichenberger avatar Jul 21 '22 22:07 AlexandreEichenberger

As it stands, onnx-mlir has a dependency on llvm-project directly as well as indirectly (through mhlo). These two need to be in sync and should always be updated together to make sure that there are no weird and hard to diagnose issues or breaks.

This is what I did in the latest update #1554 and it was not trivial, because even though mhlo had in theory moved to a newer version of llvm-project, there was still a break since mhlo does not test shared library builds. We can safely assume that moving to a newer version of llvm-project will require updates to onnx-mlir as well as have potential to NOT have a working equivalent in mhlo unless they add more testing around shared library builds. Regardless, the updates are going to require actual dev time and cannot be automated the majority of the time because mlir changes significantly frequently.

Now for torch-mlir. torch-mlir has a direct dependency on llvm-project and again an indirect dependency on llvm-project through mhlo. torch-mlir updates much more frequently than onnx-mlir (closer to weekly or sometimes even bi-weekly) but again, it is not trivial, especially with the newly added dependency on mhlo. Just the most recent change to mlir python registration has caused some flux and breaks and had to be reverted. Incidentally, because torch-mlir tried to update to the newest mlir so quickly, that was impacted by the revert of an mlir change in llvm-project. The updates to llvm-project in torch-mlir are generally much cheaper than onnx-mlir, but they are still not trivial. torch-mlir also almost always picks up the latest daily build of torch and torchvision.

mhlo only has the direct dependency on llvm-project and the majority of the time the update is just pushing the new llvm-project commit, however, that is partially because things like shared library builds are NOT maintained, so any newly added library dependencies are largely ignored, unless someone like onnx-mlir needs the updates (see https://github.com/tensorflow/tensorflow/pull/56844).

So when you put all of this together, keeping up with llvm-project is expensive regardless of the frequency of the updates. More frequent updates => fewer changes, but on more occasions; while waiting means fewer instances, but much larger and more complicated changes. This is possibly my teams's biggest pain point in working with these open source projects (we work with both onnx-mlir and torch-mlir and frequently do the llvm-project updates in one or both because we internally update llvm-project weekly).

My recommendation is rather than trying to keep onnx-mlir and torch-mlir in sync with each other's version of llvm-project, we should pick an update cadence for onnx-mlir to newer llvm-project and mhlo and stick to it. This could be once a month or twice a month etc. (I would favor twice a month based on the amount of effort required). Then create an Issue or similar when that work starts, so that others know when an update is coming. At the same time, contributors should feel empowered to do an update sooner if they need it, so that they can unblock themselves and the lack of an existing Issue will indicate whether an update is in flight.

As a side note, we can improve how we pick which llvm-project version to update to because mainline llvm-project builds are frequently broken in one piece or another, but that's a separate discussion.

sstamenova avatar Jul 21 '22 23:07 sstamenova

I will show my ignorance, but would rather understand the process well. If we have ONNX-MLIR CIs failing because one project has updated but not the other... Can we control it because mhlo is imported as a third_party, and thus we would upgrade our entire onnx-mlir repo to the agreed-upon LLVM commit, and once this PR is in, then we update the third party mhlo once they also have reached the same LLVM commit?

I suggest the ONNX2MHLO conversion and corresponding MHLO Compilation Target could be turned off by a CMake Option. Thus if the whole ONNX-MLIR fails because of the MHLO part, the users who don't need MHLO could still use ONNX-MLIR. Then we'll fix the MHLO related problems to make the whole ONNX-MLIR correct, ex. the shared library builds as @sstamenova mentioned.

Basically, I am asking for experts to give a detailed protocol on how to advance each distinct project to a common LLVM commit knowing that it may take each project between a day to a few days to reach that commit. Key is to make sure that in this transient stage, we don't have CIs failing because of the temporary level differences while we are in the process of upgrading to reach the desired LLVM level.

I think this should be discussed with more developers. A post is initiated in mlir forum.

yaochengji avatar Jul 22 '22 00:07 yaochengji

Hi,

There're three main frontends in the deep learning world, TensorFlow, PyTorch and ONNX. And all of them are currently embracing MLIR/LLVM. However, here comes the version problem:

  1. TensorFlow upgrades LLVM daily
  2. PyTorch and ONNX upgrades LLVM monthly It will be troublesome if the LLVM version among the three frontends don't match, for those who use another unified dialect (MHLO, TOSA).

Could we work together to upgrade the LLVM version to same as what torch-mlir on? Currently it is 3580daa.

cc @sjarus , I think TOSA side also has the same needs.

Yes we do - I've posted in more detail in the MLIR discourse thread. Thanks for the call out here!

sjarus avatar Jul 22 '22 14:07 sjarus

Lots have already been said and I think we all generally agree that requiring all projects to be on the same LLVM commit level would be very hard to manage in practice, given that every project has its own resource constraint, development pace and goal, etc. So I agree with @sstamenova that we should have our own pace of upgrading to newer LLVM commit. I'm in the process of creating a Jenkins job that runs periodically in idle times to find the earliest LLVM commit after what we currently use that breaks ONNX-MLIR, which will hopefully help ease some pain in upgrading.

gongsu832 avatar Jul 22 '22 15:07 gongsu832

FYI, I'm still trying to figure out the best way to present the result of the Jenkins job but here is what it found:

  "recent": {
    "failed": [
      {
        "sha1": "a1ec0d8bdccab1d28e009375209965017c872d3d",
        "date": "2022-07-21T19:03:07Z"
      },
      "140"
    ],
    "succeeded": [
      {
        "sha1": "6605187103a2369418d014a7f146fee4a04b11bf",
        "date": "2022-07-21T19:00:29Z"
      },
      "138"
    ]
  },

This means that after the latest LLVM commit update, the earliest commit that will break ONNX-MLIR is a1ec0d8bdccab1d28e009375209965017c872d3d. The one before it, 6605187103a2369418d014a7f146fee4a04b11bf, is the one we can update to without breaking ONNX-MLIR.

gongsu832 avatar Jul 23 '22 04:07 gongsu832