axon icon indicating copy to clipboard operation
axon copied to clipboard

Experiences from porting YOLOv8 to Axon

Open hansihe opened this issue 1 year ago • 10 comments

I recently ported the YOLOv8 object detection model to Axon, and just wanted to share my experiences with it.

https://github.com/hansihe/yolov8_elixir

  • The deployment story seems a lot better than other frameworks from what I have seen so far, good job! It’s really just worked most of the time, while with PyTorch and python things are sometimes very fiddly to get running on a particular machine.
  • What is the recommended replacement of Module from PyTorch?
    • Axon.namespace looks somewhat like it, but there doesn’t seem to be a way of differentiating “this is a module that contains other layers, may depend on other layers outside of the module” vs “this is a subnetwork that is fully independent all the way to inputs”.
  • Related to the above, naming and identifying layers within a network for parameter loading is very fiddly sometimes. I wish there was a way to have “modules” that provided nesting in the parameter map. I ended up more or less doing this with dot separated paths in the parameter names (04.c2f.m.0.bottle_neck.cv2.conv.conv2d).
    • If not, would it be possible to introduce an abstraction where a new layer can be constructed as a combination of other layers? This could then be reflected in the parameter map by nesting.
      • Example: yolov8 has a C2f layer which contains many Bottleneck layers which contains other layers again.
  • Maybe there could be a utility layer in Axon for destructuring a container? Say one layer returns an %{"a" => _, "b" => _} container, having a way to destructure that in another layer without making many different Axon.layers that just pull out one of the inner values.
    • I might also have missed something obvious here.
  • A lot of the time when I get dimension mismatches from building my model, I get no stack trace in the “this layer was defined at” section of the error. It’s just empty. Should I do something special to get a stack trace?
  • The documentation of Axon.build could be a little bit clearer on what the init vs predict functions actually do.
    • Is there an expectation that predict can modify mutable internal state in XLA or other backends?
    • Or is it mainly to initialize/copy parameters to the backend representation?
  • It would be useful if predict had different stricter modes for stuff like:
    • Explicitly warn if a parameter is missing from the input parameter map and instead was initialized. It is unwanted to have a parameter initialize if we are loading a model for inference, it would indicate an error.
    • Warn if there is extra unused data in the parameter map. This would make it easier to track down parameter naming issues.
  • When working on the model in LiveBook, and printing my model as a table, the input got so large that it truncated. I wish there was a way to prevent it from truncating, I had to save the text to a file and open it in an editor to read the full table.
  • Is there any way the unpickler and torch parameter loading stuff could be moved from bumblebee to another library? Right now I depend on bumblebee just for those parts.

hansihe avatar Feb 12 '23 12:02 hansihe

A lot of the time when I get dimension mismatches from building my model, I get no stack trace in the “this layer was defined at” section of the error. It’s just empty. Should I do something special to get a stack trace?

You need to pass debug: true. I improved the erorr message to say so.

josevalim avatar Feb 12 '23 14:02 josevalim

You need to pass debug: true. I improved the erorr message to say so.

Thanks!

hansihe avatar Feb 12 '23 14:02 hansihe

@hansihe Thank you for this very detailed write up! It's really helpful for improving the framework. Also, would you be interested in adding your Yolo implementation upstream to Bumblebee? We don't have any object detection models yet and I think it would be super useful to the community cc @jonatanklosko for his take as well

What is the recommended replacement of Module from PyTorch?

This is a good question and something I have debated for quite some time. Relevant issue in #459. For most cases a modules can just be replaced with Elixir functions. This is the pattern we follow in Bumblebee. For example: https://github.com/elixir-nx/bumblebee/blob/main/lib/bumblebee/text/bert.ex#L554

Axon.namespace initial purpose was to just make it easier to block off subnetworks for fine-tuning. E.g. if you use a ResNet base you can namespace with "resnet" and then pass the pre-trained "resnet" params.

I agree though there is no easy way to explicitly group layers/models, and this is a serious drawback in the API right now. I have considered adding something like Axon.function(...) which wraps a group of layers and returns a function which is basically a re-usable "sub-network" as you describe, but I'm not sure that will really fix it either. One challenge of trying to duplicate the PyTorch module approach is that Axon networks are both immutable and stateless. As currently designed, Axon networks are completely distinct from their parameter map. Additionally, since layer names are generated lazily, the "shape" of subnetworks in the parameter map can change as a result of additional layer modifications.

I will continue thinking through the best way to integrate this in a functional way, without necessarily forcing the OOP/module style approach. In Bumblebee we often reference "blocks" as a group of subnetworks, so maybe we can introduce an Axon.block which represents a "block" of layers which is meant to have ownership of its sublayers.

Related to the above, naming and identifying layers within a network for parameter loading is very fiddly sometimes. I wish there was a way to have “modules” that provided nesting in the parameter map. I ended up more or less doing this with dot separated paths in the parameter names (04.c2f.m.0.bottle_neck.cv2.conv.conv2d).

Dot separated paths is also how we do it in Bumblebee. I agree that we again can probably do a better job of making this easier to do under the hood. Again maybe an Axon.block or Axon.function implementation would help. I will explore adding this and see if it makes these things easier.

Maybe there could be a utility layer in Axon for destructuring a container? Say one layer returns an %{"a" => _, "b" => _} container, having a way to destructure that in another layer without making many different Axon.layers that just pull out one of the inner values.

I believe we have something in Bumlebee as a utility that basically implements this, but I think this is a sign to upstream it. Perhaps Axon.destructure or Axon.unwrap? cc @josevalim @jonatanklosko

Is there an expectation that predict can modify mutable internal state in XLA or other backends?

I'm a bit confused by this one. build returns a numerical definition for initialization and predict functions. All numerical definitions are pure functions, so they do not at least modify the input tensors. XLA is a blackbox though, so whatever happens internally is up to the compiler

Or is it mainly to initialize/copy parameters to the backend representation?

Maybe the confusion is here? Axon networks don't come with any parameters, so when you're creating the network you're just building up an Elixir data structure. (see the Axon.Node: https://github.com/elixir-nx/axon/blob/main/lib/axon/node.ex

When you call Axon.build the network gets compiled into 2 functions: init and predict. This is done by traversing the network and handling each layer here: https://github.com/elixir-nx/axon/blob/main/lib/axon/compiler.ex#L49

When working on the model in LiveBook, and printing my model as a table, the input got so large that it truncated. I wish there was a way to prevent it from truncating, I had to save the text to a file and open it in an editor to read the full table.

I think this issue is with Livebook outputs and not with Axon? Larger models are difficult to inspect in general. PyTorch also has a nested "tree-view" of a model as their default representation, which may be helpful for us here.

It would be useful if predict had different stricter modes

Good point, I think in Bumblebee we have debug logs to indicate which parameters were missing in an initial state. I think we can add something similar here. I don't think we should raise though as we have to consider the training case where you partially initialize a model. I think debug here might be sufficient. In a debug mode we can also log unused parameters.

Is there any way the unpickler and torch parameter loading stuff could be moved from bumblebee to another library? Right now I depend on bumblebee just for those parts.

Good question. The unpickler is a separate library baking_soda, but the logic for loading PyTorch parameters and converting them to Axon parameters is distinct to Bumblebee. Perhaps it does make sense, though the intent with Bumblebee is to make it the repository for using pre-trained models in Elixir, so I don't know if we should have it live separately

seanmor5 avatar Feb 12 '23 17:02 seanmor5

@hansihe Thank you for this very detailed write up! It's really helpful for improving the framework. Also, would you be interested in adding your Yolo implementation upstream to Bumblebee? We don't have any object detection models yet and I think it would be super useful to the community cc @jonatanklosko for his take as well

I would definitely be interested in getting the implementation into Bumblebee! I wasn't sure what the intention for Bumblebee was initially, but now that I hear it's meant to be a package of models it definitely makes sense to add it.

YOLOv8 also has object classification and segmentation detection heads, adding those should not require a whole lot of effort.

It's also not missing a whole lot of stuff in order to actually train the models either, mainly the DFL loss implementation + some image augmentation system. It would be interesting to get those pieces working as well.

This is a good question and something I have debated for quite some time. Relevant issue in #459. For most cases a modules can just be replaced with Elixir functions. This is the pattern we follow in Bumblebee. For example: https://github.com/elixir-nx/bumblebee/blob/main/lib/bumblebee/text/bert.ex#L554

That makes a lot of sense, it's pretty close to what I ended up with as well.

I agree though there is no easy way to explicitly group layers/models, and this is a serious drawback in the API right now. I have considered adding something like Axon.function(...) which wraps a group of layers and returns a function which is basically a re-usable "sub-network" as you describe, but I'm not sure that will really fix it either. One challenge of trying to duplicate the PyTorch module approach is that Axon networks are both immutable and stateless. As currently designed, Axon networks are completely distinct from their parameter map. Additionally, since layer names are generated lazily, the "shape" of subnetworks in the parameter map can change as a result of additional layer modifications.

I will continue thinking through the best way to integrate this in a functional way, without necessarily forcing the OOP/module style approach. In Bumblebee we often reference "blocks" as a group of subnetworks, so maybe we can introduce an Axon.block which represents a "block" of layers which is meant to have ownership of its sublayers.

[...]

Dot separated paths is also how we do it in Bumblebee. I agree that we again can probably do a better job of making this easier to do under the hood. Again maybe an Axon.block or Axon.function implementation would help. I will explore adding this and see if it makes these things easier.

I think something like that would be really nice. For me, it would serve two main functions:

  • Easier to understand tables when printed. In my case I would get ~20 main layers instead of ~300 sublayers. There could then be a mechanism for drilling down, maybe a tree view integration in LiveBook? This would make things significantly easier to debug.
  • Providing nesting to the parameters. Not really a technical things, but makes it easier to debug when porting weights from other implementations.

Something like a Axon.block would be nice, possibly used as:

Axon.block({input1, input2}, fn {input1, input2} ->
  Axon.add(input1, input2)
end, name: "my_block")

Of course I am not familiar with the internals of the library, but something like that would be nice.

Is there an expectation that predict can modify mutable internal state in XLA or other backends?

I'm a bit confused by this one. build returns a numerical definition for initialization and predict functions. All numerical definitions are pure functions, so they do not at least modify the input tensors. XLA is a blackbox though, so whatever happens internally is up to the compiler

Or is it mainly to initialize/copy parameters to the backend representation?

Maybe the confusion is here? Axon networks don't come with any parameters, so when you're creating the network you're just building up an Elixir data structure. (see the Axon.Node: https://github.com/elixir-nx/axon/blob/main/lib/axon/node.ex

When you call Axon.build the network gets compiled into 2 functions: init and predict. This is done by traversing the network and handling each layer here: https://github.com/elixir-nx/axon/blob/main/lib/axon/compiler.ex#L49

Thanks for the clarification, thinks make a lot more sense now.

From my perspective of not being familiar with the internals, I had no idea if these functions were something that called into a NIF to initialize some state on the backend or something.

Maybe we could add a few sentences to the documentation for Axon.build on what init_fn does? Maybe something like:

init_fn Given the shape of the inputs and a map of preinitialized parameters, will perform initialization for all parameters not present in the input and return the updated parameter map. If passed an empty map, initialization will be done on all parameters.

init_fn`

`init_fn`

I think this issue is with Livebook outputs and not with Axon? Larger models are difficult to inspect in general. PyTorch also has a nested "tree-view" of a model as their default representation, which may be helpful for us here.

Yep, this is certainly more LiveBook related, but I thought I would bring it up here since they seem fairly adjacent in the ecosystem :)

It would be useful if predict had different stricter modes

Good point, I think in Bumblebee we have debug logs to indicate which parameters were missing in an initial state. I think we can add something similar here. I don't think we should raise though as we have to consider the training case where you partially initialize a model. I think debug here might be sufficient. In a debug mode we can also log unused parameters.

:+1:

Good question. The unpickler is a separate library baking_soda, but the logic for loading PyTorch parameters and converting them to Axon parameters is distinct to Bumblebee. Perhaps it does make sense, though the intent with Bumblebee is to make it the repository for using pre-trained models in Elixir, so I don't know if we should have it live separately

That makes a lot of sense. Let's work towards getting the YOLO implementation into Bumblebee. It then also makes sense for me to refactor the implementation a bit to use as much of the utilities and structure of Bumblebee as possible.

hansihe avatar Feb 12 '23 18:02 hansihe

I would definitely be interested in getting the implementation into Bumblebee!

Yeah! Bumblebee is pretty much pre-trained models and as a param rewritten so we can import params from HuggingFace! More info here: https://huggingface.co/docs/transformers/model_doc/yolos

josevalim avatar Feb 12 '23 18:02 josevalim

I also have another point that came to mind:

YOLO is inherently a model that works pretty well with images of different dimensions. It would make sense for me to specify the shape of the input image as {1, 3, nil, nil}, so the model can be run on images of different sizes.

There are probably several places that makes that difficult, but the first one I encountered was the Axon.resize layer. In order to get the functionality of scaling up by a factor of 2, I need to call get_output_shape and then scale up from there. That would not work with nil dimensions on the input.

Is there any other way of doing this? The resize node in ONNX takes either a target size or a scale, which makes it possible to represent this operation on an unknown input size.

hansihe avatar Feb 12 '23 18:02 hansihe

We don't have any object detection models yet and I think it would be super useful to the community

@seanmor5 definitely, though I don't see any checkpoint for yolov8 on HF. There are files in this GitHub release, but what they store is a yolo-specific map and it has the whole PyTorch model serialized, not just the parameters.

I believe we have something in Bumblebee as a utility that basically implements this, but I think this is a sign to upstream it. Perhaps Axon.destructure or Axon.unwrap?

@seanmor5 we only have unwrap_tuple, which actually uses Axon nodes that do elem underneath. I'm not sure if we can destructure a node eagerly, because we don't know what container the layer returns without compiling the graph (?)

Good question. The unpickler is a separate library baking_soda, but the logic for loading PyTorch parameters and converting them to Axon parameters is distinct to Bumblebee. Perhaps it does make sense, though the intent with Bumblebee is to make it the repository for using pre-trained models in Elixir, so I don't know if we should have it live separately

The only generic part would be loading the .pt files, and I'm not sure it makes sense to have a separate package for that, especially that we plan to migrate to .safetensors eventually. Converting the parameters to Axon (transposition, etc) is sometimes model-specific and I think it inherently belongs to Bumblebee.

jonatanklosko avatar Feb 13 '23 11:02 jonatanklosko

Yeah! Bumblebee is pretty much pre-trained models and as a param rewritten so we can import params from HuggingFace! More info here: https://huggingface.co/docs/transformers/model_doc/yolos

YOLOS is a different model based on the vision transformer architecture, while the regular YOLO series is a more traditional CNN based object detection model. As I understand it, YOLOS is mainly meant as an exploration in vision transformers, and is not necessarily meant to get state of the art performance. YOLOv8 gets quite a bit higher mAP than YOLOS.

definitely, though I don't see any checkpoint for yolov8 on HF. There are files in this GitHub release, but what they store is a yolo-specific map and it has the whole PyTorch model serialized, not just the parameters.

The YOLOv8 model and the HF community doesn't have much overlap I believe. Is Bumblebee only meant for transformer models/models with a presence on HF?

Those are indeed the checkpoint files you tend to use for the main YOLOv8 implementation. My implementation supports loading parameter from those files.

YOLOv8 are also not that expensive to train from scratch, ranging from <1 day for the smallest variant to around 8 days for the x variant (all on V100). Training our own base checkpoints without importing them is very feasible.

hansihe avatar Feb 13 '23 16:02 hansihe

Is Bumblebee only meant for transformer models/models with a presence on HF?

We currently add models implemented in huggingface/transformers (not limited to transformer models), which are stored on HF Hub in a standardized format (in particular, only parameters are stored, not the whole model).

We could support other "providers", but I don't think it's applicable to cases like this, where both the format and location is highly model-specific.

One way to approach this would be to implement a model in bumblebee, then convert the parameters, as you did, and dump that into HF Hub repos. However, for this we need to figure out how we want to store all the configuration in a repo, similarly to the config files that huggingface/transformers save and load, and also integrate safetensors for storing the parameters.

jonatanklosko avatar Feb 13 '23 19:02 jonatanklosko

We currently add models implemented in huggingface/transformers (not limited to transformer models), which are stored on HF Hub in a standardized format (in particular, only parameters are stored, not the whole model).

The reason why I gravitate towards loading directly from the pt model files is because it it highly compatible with the other implementations. I will need a save format too at some point if I want to do training in Axon, I imagine safetensors is a pretty good choice for that.

We could support other "providers", but I don't think it's applicable to cases like this, where both the format and location is highly model-specific.

One way to approach this would be to implement a model in bumblebee, then convert the parameters, as you did, and dump that into HF Hub repos. However, for this we need to figure out how we want to store all the configuration in a repo, similarly to the config files that huggingface/transformers save and load, and also integrate safetensors for storing the parameters.

When loading from the .pt file I load the configuration that's embedded in it. When started from scratch this is stored in a standalone .yaml file usually for the YOLO models.

In general the use case of YOLO is more often training/transfer learning it on your own data than using the off the shelf provided parameters directly.

I'm willing to donate the model to bumblebee if it is wanted, but I'm pretty neutral on it. It does sound like it might be a better fit for its own repo, since the YOLO ecosystem doesn't seem to overlap with the HF community very much.


Anyways, I feel like this thread got a little bit out of hand as I mainly meant this to be a place for some feedback on Axon. Maybe we should continue this in an issue on either my repo or Bumblebee if we want to discuss it further?

hansihe avatar Feb 13 '23 21:02 hansihe