Halide Halide Development Roadmap

This issue serves to collect high-level areas where we want to improve or extend Halide. Reading it will let you know what is on the minds of the core Halide developers. If there's something you think we're not considering that we should be, leave a comment. This document is a continual work in progress.

This document aims to address the following high-level questions:

How should we organize development?
How do we make Halide easier to use for new users?
How do we make Halide easier to use for new contributors?
How do we keep Halide maintainable over time?
How do we make Halide easier to use for researchers wanting to cannibalize it, extend it, or compare to it?
How do we make Halide more useful on current and upcoming hardware?
How do we make Halide more useful for new types of application?

To the greatest extent possible we should attach actionable items to roadmap issues.

Documentation and education

The new user experience could use an audit (e.g. the README).

There are a large number of topics that are missing tutorials

Some examples:

The GPU memory model (e.g. dirty bits, implicit device copies, explicitly scheduled device copies)
Using Func::compute_with
Effectively picking a good TailStrategy
Scheduling atomic reductions, including horizontal vector reductions
Generators with multiple outputs (there's a trade-off between tuples, extra channels, compute_with)
Using (unrolled) extra reduction dimensions for scattering to multiple sites (plus the scatter/gather intrinsics)
Using extern funcs and extern stages in generators
Calling other generators inside a generator
Using a Generator class defined in the process directly via JIT (Generator::realize isn't discoverable)
Overriding the runtime
Automatic differentiation
Integrating with OpenCV, tensorflow, pytorch, and other popular frameworks.
lambda
Buffer

There is not enough educational material on the Halide expert-user development flow, looping between tweaking a schedule, benchmarking/profiling it, and examining the .stmt and assembly output.

One thing we have is this: https://www.youtube.com/watch?v=UeyWo42_PS8

Documentation for the developers

There should be a guide for how an external contributor should make their first pull request on Halide and what to expect. This is commonly in a CONTRIBUTING.md top-level document. There are also pull request templates we can create.
There should be a more detailed document or talk describing the entire compilation pipeline from the front-end IR to backend code to help new developers understand the entire project.

Support for extending or repurposing parts of Halide for other projects

Some things that could help:

Robust serialization and deserialization of the front-end IR and the lowered IR
Being able to compile libHalide without LLVM
Being able to delegate compilation of parts of a Halide pipeline to an external sub-compiler. (e.g. see https://docs.google.com/presentation/d/1e3gsYkOrsM4XnI2IuMmFtIU6MAzTZ1zajcqWUDMDJUg/edit?usp=sharing )

Build system issues

We shouldn't assume companies have functioning build tools

Some companies build projects using a mix of duct tape and glue in a platform-varying way. Any configuration that goes into the build system is very painful for them (e.g. GeneratorParams for generator variants). Large numbers of binaries (e.g. one generator binary per generator) can also be painful (e.g. in Visual Studio). We should consider making GenGen.cpp friendlier to the build system (e.g. by implementing caching or depfiles) to help out these users.

Our buildbots aren't keeping up and require too much manual maintenance

Our buildbots are overloaded and have increasingly out-of-date hardware in them. Some can only be administered by employees at specific companies. We need to figure out how to increase capacity without requiring excessive manual management of them.

Runtime issues

The runtime includes a lot of global state, which is great for sharing things between all the Halide pipelines in a process, but if there are multiple types of user of Halide in the same large process things can get complicated quickly (e.g. if they want different custom allocators). One option would be removing all global state and passing the whole runtime in as a struct of function pointers.
While most of the important parts of the runtime can be overridden by setting function pointers, some parts of the runtime can only be overridden using weak linkage or other linker tricks, and this is problematic on some platforms in some build configurations.
There needs to be more top-level documentation for the runtime, describing how one may want to customize it in various situations. Currently there's just a few paragraphs at the top of HalideRuntime.h, and then documentation on the individual functions.
Runtime error handling is a contentious topic. The default behavior (abort on any error) is the wrong thing for production environments. There isn't much guidance or consistency on how to handle errors in production environments.

Lifecycle

Versioning

Since October 2020, Halide uses semantic versioning. The latest release is (or will soon be) v15.0.0. We should adopt some practice for keeping a changelog between versions for Halide users. Our current approach of labeling "important" PRs with release_notes has not scaled.

Packaging

Much work has been put into making Halide's CMake build amenable to third-party package maintainers. There is still more to do for cross-compiling our arm builds on x86.

We maintain a list of packaging partners here: https://github.com/halide/Halide/issues/4660

Code reuse, modularity

How do we reuse existing Halide code without recompiling it, especially in a fast prototyping JIT environment? An extension of the extern function calls or the generators should be able to achieve this.

Building a Halide standard library

There should be a set of Halide functions people can just call or include in their programs (e.g., image resampling, FFT, winograd convolution). The longstanding issue to solve is that it's hard to compose the scheduling.

Fast prototyping

How can we make fast prototyping of algorithms in Halide easier? JIT is great for getting started, but not all platforms support it (e.g. iOS), and the step from JIT to AOT is large, in terms of what the code looks like syntactically, what the API is, and what the mental model is.

Consider typical deep learning/numerical computation workflows (PyTorch, NumPy, Matlab, etc). A user would fire up an interpreter, manipulate and visualize their data, experiments with different computation models, print out intermediate values of their program for understanding the data and debugging, and rerun the programs multiple times for different inputs and iterate.

Unfortunately, the current Halide workflow does not fit this very well, even with the Python frontend.

JIT caches are cleared every time the program instance is terminated. Even if the Halide program has not changed, if you rerun the program (for different parameters or inputs), Halide needs to recompile the whole program. This has become a major bottleneck for fast iteration of ideas.
Printing intermediate values of Halide programs for debugging and visualization is painful. Either you have to use the cumbersome print() (and recompile the program) or adding the intermediate Halide function to the output (and recompile the program).
Halide's metaprogramming interface makes it less usable in a (Jupyter) notebook environment.

Two immediate work items.

Have an option for the JIT compiling such that it can save the result to disk, and load it back automatically if it is cached. Related to the serialization effort.
Have an interpreter for Halide (or equivalently the "eager mode" c.f. TensorFlow) that defaults to some slow schedule (e.g., compute root everything with basic parallelization).

GPU features

We should be able to place Funcs in texture memory and use texture sampling units to access them.

This is particularly relevant on mobile GPUs where you can't otherwise get things to use the texture cache. It's also necessary to interop with other frameworks that use texture memory (e.g. coreML).

An API to perform filtered texture sampling is needed. Ideally this will work, if not necessarily be blazingly fast, in a cross platform way. Validating on CPUs is very useful. There are some issues in the design having to do with the scope and cost of required sampler object allocations in many GPU APIs.

Currently this has been low priority because we don't have examples where texture sampling matters a lot. Even for cases where it obviously should (e.g. bilateral guided upsample), it doesn't seem to matter much.

A good first step is supporting texture sampling on CUDA, because it doesn't require changing the way in which the original buffer is written to or allocated. An independent first step would be supporting texture memory on some GPU API without supporting filtered texture sampling. These two things can be done orthogonally.

Past issues on this topic: #1021, #1866

We should support tensor instructions.

We have support for dot product instructions on arm and ptx via within-vector reductions. The next task is nested vectorization #4873. After that we'll need to do some backend work to recognize the right set of multi-dimensional vector reductions that map to tensor cores. A relevant paper on the topic is: https://dl.acm.org/doi/10.1145/3378678.3391880

New CPU features

ARM SVE support

Machine learning use-cases

We should be able to compile generators to tensorflow and coreML custom ops.

We can currently do this for pytorch (see apps/HelloPytorch), but it's not particularly discoverable.

We should have fully-scheduled examples of a few neural networks. We have resnet50, but it's still unscheduled.

Targeting MLIR is worth consideration as well.

This is likely a poor match, because most MLIR flavors operate at a higher level of abstraction than Halide (operations on tensors rather than loops around scalar computation).

Autoschedulers

There's lots of work to do before autoschedulers are truly useful. A list of tasks:

We need to figure out how to provide stable autoschedulers that work with Halide master to serve as baselines for academic work while at the same time being able to improve autoschedulers over time.
There needs to be a tutorial on using standalone autoschedulers, including autotuning modes for those that can autotune.
We need to figure out how to include them in distributions
There should be a hello-world autoscheduler that serves as a guide for writing a custom one.
There should be a click button solution for all sorts of autoscheduling scenarios (pretraining, autotuning, heuristics-based, etc).
For several autoschedulers, the generated schedules may or may not work for image sizes smaller than the estimate provided. This is lousy, because autoschedulers should be usable by people who don't understand the scheduling language and don't know to fix tailstrategies.
https://github.com/halide/Halide/issues/4271

Things we can deprecate

arm-32 (probably still need this)
x86 without sse4.1

Jun 19 '20 20:06 abadams

Assigning a bunch of people who seem like they may want to contribute to top-level planning.

Jun 19 '20 21:06 abadams

GPU support: I think there's a bigger, higher-level architectural issue: the memory management and runtime models for GPUs/accelerators feels broken and insufficient. We should consider rethinking it significantly to allow clearer and more explicit, predictable control (as we have on CPUs with a single unified memory space).

Jun 19 '20 22:06 jrk

Modules / Libraries / reusable code / abstraction

Jun 19 '20 22:06 jrk

Build system: Should we explicitly break apart build system issues for Halide development and build system issues for Halide users? I think these are mostly quite distinct and probably should be separate top-level headings.

Jun 19 '20 22:06 jrk

Better accessibility and support for research within/on the Halide code base

Jun 19 '20 22:06 jrk

Build Halide without LLVM: useful for GPU JIT and IR manipulation.

Jun 19 '20 22:06 slomp

GPU support: I think there's a bigger, higher-level architectural issue: the memory management and runtime models for GPUs/accelerators feels broken and insufficient. We should consider rethinking it significantly to allow clearer and more explicit, predictable control (as we have on CPUs with a single unified memory space).

Have you been following the store_in stuff? You just explicitly place Funcs in the memory type you want now.

Jun 19 '20 23:06 abadams

We shouldn't assume companies have functioning build tools Some companies build projects using a mix of duct tape and glue in a platform-varying way. Any configuration that goes into the build system is very painful for them (e.g. GeneratorParams for generator variants).

I agree with @jrk that we should distinguish build issues that affect Halide developers versus users. Generator aliases fix the GeneratorParams thing a bit, but they aren't very discoverable and aren't covered in the tutorials AFAIK. See #4054 and #3677

Large numbers of binaries (e.g. one generator binary per generator) can also be painful (e.g. in Visual Studio).

I'm not sure why this is painful in Visual Studio? Just because of how many times GenGen.cpp gets built? We could fix that by optimizing GenGen for build time. Windows users who wish to build from Halide from source should use CMake. If they want to use binary releases without our CMake rules, then they're on their own. We shouldn't pay off their technical debt for nothing in return.

We should make GenGen.cpp capable of taking over some of the role of the build system (e.g. caching) to help out these users.

Properly caching Halide outputs is complicated. Outputs are a function of the Halide version (we don't currently version Halide), the autoscheduler version (if used. we also don't currently version our autoschedulers), the algorithm and schedule (can these be consistently hashed?), and the generator parameters. It's not clear to me how often this is a benefit in incremental build scenarios. If your source files changed, then typically so has your pipeline.

In CI scenarios, Halide versioning becomes more important since users would otherwise run into cache invalidation issues every time they update. Between builds, there could be some wins here, but they could also implement their own caching system by hashing the source files, Halide git commit hash, and generator parameters.

Our buildbots aren't keeping up and require too much manual maintenance Our buildbots are overloaded and have increasingly out-of-date hardware in them. We need to figure out how to increase capacity without requiring excessive manual management of them.

Maybe one of the companies that wants Halide to work around their build system should pay for hardware and hire a full-time DevOps specialist. They could get all our buildbots configured with Ansible and set up all the Docker/virtual machine images we'd need. Failing that, they could foot the bill for a cloud-based CI service that has GPU and Android support.

Versioning and Releasing Halide

We should start versioning Halide and getting on a steady (quarterly?) release schedule. We could start at v0.1.0 so we don't imply any API stability per semantic versioning -- only v1 and above implies API stability within a major version. This would allow us to publish Halide on vcpkg / pip / APT PPAs / etc.

Jun 19 '20 23:06 alexreinking

@abadams:

Have you been following the store_in stuff? You just explicitly place Funcs in the memory type you want now.

I was partly thinking of the heavily dynamic, lazy runtime aspects. When does memory get allocated and freed? When do copies happen? Most things in Halide are pretty static and eager, and explicitly controlled via schedules; GPU runtime behavior inherently includes a bunch of dynamic and lazy behavior, which is not clearly controlled by the schedule.

Imagine now having multiple GPUs or different accelerators in a machine. I should be able to use schedules to decompose computation across multiple GPUs, reason about and control explicit data movement between them, etc.

Jun 19 '20 23:06 jrk

We shouldn't assume companies have functioning build tools Some companies build projects using a mix of duct tape and glue in a platform-varying way. Any configuration that goes into the build system is very painful for them (e.g. GeneratorParams for generator variants).

I agree with @jrk that we should distinguish build issues that affect Halide developers versus users. Generator aliases fix the GeneratorParams thing a bit, but they aren't very discoverable and aren't covered in the tutorials AFAIK. See #4054 and #3677

Large numbers of binaries (e.g. one generator binary per generator) can also be painful (e.g. in Visual Studio).

I'm not sure why this is painful in Visual Studio? Just because of how many times GenGen.cpp gets built? We could fix that by optimizing GenGen for build time. Windows users who wish to build from Halide from source should use CMake. If they want to use binary releases without our CMake rules, then they're on their own. We shouldn't pay off their technical debt for nothing in return.

It's painful in visual studio because the actual GUI stops working right once you have more than a certain number of binary targets. Shoaib says it just stops showing them so you have no way to access them.

People should just use X build system is a non-solution. Halide is being used into large products that already have build systems, and it must exist within them. Punting on solving this problem entirely means that the current experience of using Halide in a company other than Google is 80% build system nightmare and 20% writing code. If we want Halide to be a useful tool it must be able to integrate into existing build systems cleanly.

We should make GenGen.cpp capable of taking over some of the role of the build system (e.g. caching) to help out these users.

Properly caching Halide outputs is complicated. Outputs are a function of the Halide version (we don't currently version Halide), the autoscheduler version (if used. we also don't currently version our autoschedulers), the algorithm and schedule (can these be consistently hashed?), and the generator parameters. It's not clear to me how often this is a benefit in incremental build scenarios. If your source files changed, then typically so has your pipeline.

The particular problem I've seen is that people work around the number-of-binaries issue by packing all of their generators into a single binary, but then a naive dependency analysis then thinks that editing any source file requires rerunning every generator (and there may be hundreds). We might be able to help. Telling people to just fix their damn build system is obviously an attractive attitude, but that's also asking them to pay down a large amount of technical debt before they can start using Halide. The outcome is that they don't use Halide.

In CI scenarios, Halide versioning becomes more important since users would otherwise run into cache invalidation issues every time they update. Between builds, there could be some wins here, but they could also implement their own caching system by hashing the source files, Halide git commit hash, and generator parameters.

Probably better to do the hashing correctly in one place upstream than have lots of incorrectly-implemented hashing schemes downstream. We're the ones who know how to hash an algorithm/schedule/Halide version correctly.

But the caching idea was just an example of how we can make life easier for people by making it possible to do some of the things that should really be happening in the build system in C++ instead/as well, so that people can use Halide without taking on the possibly-intractable task of fixing their build system first.

Our buildbots aren't keeping up and require too much manual maintenance Our buildbots are overloaded and have increasingly out-of-date hardware in them. We need to figure out how to increase capacity without requiring excessive manual management of them.

Maybe one of the companies that wants Halide to work around their build system should pay for hardware and hire a full-time DevOps specialist. They could get all our buildbots configured with Ansible and set up all the Docker/virtual machine images we'd need. Failing that, they could foot the bill for a cloud-based CI service that has GPU and Android support.

Versioning and Releasing Halide

We should start versioning Halide and getting on a steady (quarterly?) release schedule. We could start at v0.1.0 so we don't imply any API stability per semantic versioning -- only v1 and above implies API stability within a major version. This would allow us to publish Halide on vcpkg / pip / APT PPAs / etc.

Jun 19 '20 23:06 abadams

@abadams:

Have you been following the store_in stuff? You just explicitly place Funcs in the memory type you want now.

I was partly thinking of the heavily dynamic, lazy runtime aspects. When does memory get allocated and freed? When do copies happen? Most things in Halide are pretty static and eager, and explicitly controlled via schedules; GPU runtime behavior inherently includes a bunch of dynamic and lazy behavior, which is not clearly controlled by the schedule.

Imagine now having multiple GPUs or different accelerators in a machine. I should be able to use schedules to decompose computation across multiple GPUs, reason about and control explicit data movement between them, etc.

Generally agree, but wanted to add that you can explicitly schedule the copies using Func::copy_to_device and friends if you don't want them done lazily. If you use that often no dirty bits come into play. The input to the Func lives only on the CPU, and the output lives only on the GPU.

Jun 20 '20 00:06 abadams

I edited the top post to add:

A few high-level philosophical questions
More features that need tutorials e.g., compute_with, autodiff, and nesting generators (is it called "stub"?)
Documentation for the developers
Packages, releases
Code reuse/modularity
Fast prototyping

Jun 20 '20 01:06 BachiLi

It's painful in visual studio because the actual GUI stops working right once you have more than a certain number of binary targets. Shoaib says it just stops showing them so you have no way to access them.

I don't understand why this is our problem as opposed to Visual Studio's. Isn't this something their customers would complain about? I've see this happen in Visual Studio myself, but Googling for the issue doesn't turn up much. I'll bet if Adobe, no doubt paying hundreds of thousands of dollars for Visual Studio licenses, complained loudly enough, it could get fixed.

People should just use X build system is a non-solution. Halide is being used into large products that already have build systems, and it must exist within them. Punting on solving this problem entirely means that the current experience of using Halide in a company other than Google is 80% build system nightmare and 20% writing code. If we want Halide to be a useful tool it must be able to integrate into existing build systems cleanly.

From offline discussion, it sounds like incremental building with a unified generator binary is our worst end-user story. Still, shouldering the maintenance burden for every hand-rolled, proprietary build system is also a non-solution. If we implement caching, even opt-in, we'll have to test it, keep its behavior stable, and deal with the whole can of worms that opens up.

Even our first-party CMake doesn't get it perfect because it can't assume one-generator-per-file. It can't generically establish a mapping between source files and generator invocations. We should look in to Ninja depfiles for more precise dependencies in CMake.

Jun 20 '20 12:06 alexreinking

I don't understand why this is our problem as opposed to Visual Studio's. Isn't this something their customers would complain about? I've see this happen in Visual Studio myself, but Googling for the issue doesn't turn up much. I'll bet if Adobe, no doubt paying hundreds of thousands of dollars for Visual Studio licenses, complained loudly enough, it could get fixed.

It may seem dumb, but it is a reality that many real users face today, and the alternatives are:

Don't use Halide
Somehow work around it

Even if Microsoft should fix it (and I think it is not likely that they could or would on any reasonable time scale), the only thing in our power to do is help support working around it. If we don't, we're effectively just shutting out some of our highest-impact potential users.

Jun 21 '20 23:06 jrk

My grab-bag of thoughts:

Figure out useful guidance for practical Halide-on-GPU usage and update the code/examples/docs accordingly. For instance: by far, the most questions about GPU usage that I get are "how do I use Halide for mobile GPUs"? This boils down to "On iOS, use Metal; on Android, ¯_(ツ)_/¯ (because hardware fragmentation, crappy driver support, OpenGL is mostly useless for Halide, etc)". It may well be that we'd be better off advising Android developers to focus on non-GPU solutions, but we'd be better off communicating that more up-front than we do currently.
There is an awful lot of useful/necessary information on how to effectively code in Halide that isn't written down anywhere, and has been passed around by word of mouth. (e.g., the use of .stmt files to iterate on schedule development). We have to fix this. Andrew's suggestion of recording a walkthrough of this is a good first start, but that will really need to get converted to text form as well.
I'd call our buildbot setup a disaster, but that would be an insult to actual disasters. IMHO should really try to move as much of our testing as possible into some cloud-based solution, with local hardware (eg for GPU testing) added as needed, but the bar for improvement is low.
On the topic of versioning and releases, I agree with what's been said before, but will go further and suggest that we consider planned release schedules (with bugfix updates), as most other projects do; some orgs may want to continue to just track the trunk branch (as Google has done), but others can stabilize on specific versions for longer term, without having to worry as much about subtle API or behavior issues. (This would make documentation more tractable too, since we'd be able to say "this applies to Halide 3.x" or whatnot.) Of course, this means that we might need a way to decide what features should be targeted to go into any current branch, and what release schedule we'd use (eg quarterly?) etc, but that seems like a good thing to me. (The implied stability of versioning would be especially beneficial for autoscheduler adoption, as it would be much more tractable to promise that autogenerated schedules would be stable within a particular release.)
Autoscheduler infrastructure. For it to be well accepted, autoscheduler(s) need to not just provide good schedules, they have to be easy to integrate into an existing build setup; currently it requires either a lot of manual work (e.g. manual copy-paste of text schedules, etc) or a lot of trust in the stability of the autoscheduler (i.e. that Halide updates won't regress your schedule). Not sure what the right solution for improving this is.
Lower the barrier for experimenting with Halide code. Currently, the quick way to try adding Halide code to an existing project is to use the JIT... unless you are running on a system where this won't work (eg iOS). Then, if it does prove profitable, you probably need to rework your code to use AOT compilation, which has a very different API and build surface (wrap it in a Generator, move it to a separate file, add some build rules as needed, make a completely different looking set of calls). Is there a way we could make it easy to add code using JIT (or feels-like-JIT) that could be transitioned to AOT more easily? Would using the Python bindings (or a bespoke Halide 'language') instead of C++ make this any easier?
Runtime code model. The current model is nice mainly in that it allows for bringing up simple builds easily, but it breaks down quickly for many real-world apps (that require lots of customizations) or on systems without weak linkage (e.g Windows). It also makes it harder than necessary to realize what is a public bit of the API. IMHO we should really consider moving to a runtime model that avoids weak linkage entirely, and instead uses something like a customizable-via-struct-of-pointers approach; this would also allow us to rationalize the user-context stuff and to normalize the runtime API between AOT and JIT, but would be (eventually) a breaking API change. (Yes, I have spent some time thinking about such a design; hopefully I'll get the time to finish an actual proposal someday...)

Jun 23 '20 19:06 steven-johnson

On the topic of versioning and releases, I agree with what's been said before, but will go further [...] but that seems like a good thing to me.

👍 Fully agree here. Having a version is basically a prerequisite for inclusion into package managers, too. Having a stable API also means that shared distributions of Halide can be upgraded independently of applications, which is important if we hope FOSS will adopt us.

[...] try to move as much of our testing as possible into some cloud-based solution, with local hardware (eg for GPU testing) added as needed, but the bar for improvement is low.

👍 AppVeyor seems to have a reasonable set-up that allows for a mix of self-hosted (for special hardware) and cloud-hosted instances. Also, we should try to convince one or more of the multi-billion-dollar companies that employ our developers and benefit from our work to donate computing resources for this purpose.

Runtime code model. The current model [...] breaks down [...] on systems without weak linkage (e.g Windows). IMHO we should really consider moving to a runtime model that avoids weak linkage entirely [...]

I agree. Weak linkage and dynamic lookup into executable exports are super cool... if you're writing Linux software. Unfortunately, since they aren't standard C/C++, they're inherently non-portable and aren't modeled by CMake, so they require hacks for the supported platforms and don't work on Windows. Plus, dynamic lookup breaks a fundamental assumption about static linkage, namely that other modules won't be affected by changes to statically linked libraries. This doesn't just affect the runtime, but the plugins/autoschedulers, too. We're already planning to refactor the autoschedulers out of apps. While we're at it, we should make the interface accept a pointer to a structure in the parent process that it can populate, rather than trying to find the structure via dynamic lookup.

It also makes it harder than necessary to realize what is a public bit of the API.

See also #4651 -- as we discuss versioning, we should also discuss symbol export, since they're inter-related. At the very least, we should investigate whether -fvisibility-inlines-hidden matters in terms of binary size.

Jun 23 '20 23:06 alexreinking

There is an awful lot of useful/necessary information on how to effectively code in Halide [...] that will really need to get converted to text form as well.

I think both @BachiLi and I have put some thought into writing Halide tutorials. I think it would be a good idea to merge our efforts 🙂

Lower the barrier for experimenting with Halide code. [...] you probably need to rework your code to use AOT compilation, which has a very different API and build surface

Fortunately, this is now pretty easy to do if you're using our CMake build 😉

[...] Is there a way we could make it easy to add code using JIT (or feels-like-JIT) that could be transitioned to AOT more easily?

An export API that would generate C++ code representing a Halide pipeline and schedule would be cool... but I'm not sure it would be more useful than our existing compile_to_file API. The build story would still be pretty bad.

Would using the Python bindings (or a bespoke Halide 'language') instead of C++ make this any easier?

I'm torn on the idea of having an external Halide syntax. There are some clear benefits... it would become easier to write tests, to write analysis tools, to metaprogram (maybe), provide more helpful compiler diagnostics, integrate with the Compiler Explorer, etc. But on the other hand, maybe it would just be a high-maintenance dunsel.

Jun 23 '20 23:06 alexreinking

Porting JIT code to AOT is much bigger than just build system issues. All of a sudden it's staged compilation. E.g. things that were constants like memory layout and image size are now unknown.

Jun 23 '20 23:06 abadams

Is there a way we could make it easy to add code using JIT (or feels-like-JIT) that could be transitioned to AOT more easily?

That would be a nice addition. I've been experimenting with compile_jit() with Param placeholders, and then rebinding the Params later before calling realize() -- it's cumbersome, I think we could come up with something more intuitive. For example, compile_jit() could return a "callable" Pipeline which provides a function-call operator () to pass the parameters directly. In addition, it would be nice if we could just instantiate a Generator class and call compile_jit() on it. These changes would bridge the gap between JIT and AoT workflows and would ease development a lot.

On a side note, speaking of compile_to_file, we need a better way to request a dump of generated shader code. Having to use state-of-the-"ark" environment variables like HL_DEBUG_CODEGEN to do that is just... aarghh.

Jul 01 '20 20:07 slomp

You can just instantiate a generator and call it via jit. Generator instances have a "realize" method you can call directly, or a get_pipeline() method that gives you a Pipeline object just like when you're jitting code.

Jul 01 '20 20:07 abadams

On a side note, speaking of compile_to_file, we need a better way to request a dump of generated shader code. Having to use state-of-the-"ark" environment variables like HL_DEBUG_CODEGEN to do that is just... aarghh.

In general, I'd love to see a cleaner workflow for handling (inspecting, modifying, etc.) the "fat binaries" we produce.

Right now, we have a lot of targets that generate some code for a host target, and an "offload" target. This includes GPUs, OpenCL, Hexagon, etc.. Currently, these are designed to produce single object files with the offloaded code embedded in it somehow, which are great for convenience and dependency management.

However, inspecting these embedded objects or even modifying somehow, e.g. signing Hexagon code is hard and requires inspecting object files, or hooks/callbacks (mostly implemented with environment variables like HL_DEBUG_CODEGEN or HL_HEXAGON_CODE_SIGNER).

We took some small steps here, like Module, but that only partially solved the problem for the in-memory part of Halide. It would be great to think of some way to improve this. The things I have thought a lot about are:

A tempting way to go is to just have good tooling for working with objects/shared objects as if they were more like archives/tarballs/zip files, such that this tooling is designed to be invoked during builds. If we could have good reliable tools to extract/update global variables in object files as if they were archives, that would help with this. Object files are hard to edit though (changing the size of an embedded object is a mess).
Another option is compiling objects to folders. Building a fat binary pipeline would become a two stage processing: "compile_to_folder" followed by "link_folder", and inspection/modification steps could be added in between. This would actually be really tempting if it were possible to simply link a file as embedded data with a given symbol name in a standardized way. But, folders introduce a lot of room for new headaches: tooling that doesn't understand folders, having to track down dependencies, and there really isn't a good way to link objects and data together in a standard way. We'd be creating a new build system challenge for Halide users to solve.

Jul 01 '20 20:07 dsharletg

I would say that any approach that does not involve inspecting an object file is inherently better than any approach that does. It's simply not portable and we try to support a variety of compilers and platforms.

Jul 01 '20 20:07 alexreinking

For non-hexagon backends, .stmt files should capture the IR, and assembly output should capture the generated machine code in human readable form. Hexagon is compiled earlier in lowering, so it's tricky. We should find some way to carry along the higher-level representations of it. For other shader backends it's an escaped string constant, so it's there, but it looks like:

\tld.param.u32 \t%r11, [kernel_output_s0_v1_v1___block_id_y_2_param_8];\n\tld.param.u32 \t%r12, [kernel_output_s0_v1_v1___block_id_y_2_param_14];\n\tadd.s32 \t%r13, %r12, -8;\n\tld.param.u32 \t%r14, [kernel_output_s0_v1_v1___block_id_y_2_param_9];\n\tld.param.u32 \t%r15, [kernel_output_s0_v1_v1___block_id_y_2_param_10];\n\tmin.s32 \t%r16, %r10, %r13;\n\tld.param.u32 \t%r17, [kernel_output_s0_v1_v1___block_id_y_2_param_11];\n\tshl.b32 \t%r18, %r5, 4;\n\tld.param.u32 \t%r19, [kernel_output_s0_v1_v1___block_id_y_2_param_12];\n\tld.param.u32 \t%r20, [kernel_output_s0_v1_v1___block_id_y_2_param_15];\n\tadd.s32 \t%r21, %r20, -16;\n\tld.param.u32 \t%r22, [kernel_output_s0_v1_v1___block_id_y_2_param_13];\n\tmin.s32 \t%r23, %r18, %r21;\n\tadd.s32 \t%r24, %r11, -1;\n\tld.param.u32 \t%r25, [kernel_output_s0_v1_v1___block_id_y_2_param_16];\n\tadd.s32 \t%r26, %r16, %r7;\n\tld.param.u32 \t%r27, [kernel_output_s0_v1_v1___block_id_y_2_param_17];\n\tadd.s32 \t%r28, %r26, %r19;\n\tsetp.lt.s32 \t%p1, %r28, %r11;\n\tselp.b32 \t%r29, %r28, %r24, %p1;\n\tmax.s32 \t%r30, %r29, %r27;\n\tadd.s32 \t%r31, %r23, %r9;\n\tadd.s32 \t%r32, %r31, %r22;\n\tadd.s32 \t%r33, %r8, -1;\n\tmin.s32 \t%r34, %r32, %r33;\n\tmax.s32 \t%r35, %r34, %r14;\n\tmad.lo.s32 \t%r36, %r26, %r20, %r31;\n\tmul.wide.s32 \t%rd7, %r36, 4;\n\tadd.s64 \t%rd8, %rd6, %rd7;\n\tld.global.nc.f32 \t%f1, [%rd8];\n\tmad.lo.s32 \t%r37, %r30, %r25, %r35;\n\tadd.s32 \t%r38, %r37, %r15;\n\tmul.wide.s32 \t%rd9, %r38, 2;\n\tadd.s64 \t%rd10, %rd3, %rd9;\n\tld.global.nc.u16 \t%rs1, [%rd10];\n\tcvt.rn.f32.u16 \t%f

Maybe we should add a shader_assembly generator output?

Jul 01 '20 21:07 abadams

Meanwhile I believe standard practice is HL_DEBUG_CODEGEN=1.

Jul 01 '20 21:07 abadams

For non-hexagon backends, .stmt files should capture the IR, and assembly output should capture the generated machine code in human readable form. Hexagon is compiled earlier in lowering, so it's tricky. We should find some way to carry along the higher-level representations of it. For other shader backends it's an escaped string constant, so it's there, but it looks like:

\tld.param.u32 \t%r11, [kernel_output_s0_v1_v1___block_id_y_2_param_8];\n\tld.param.u32 \t%r12, [kernel_output_s0_v1_v1___block_id_y_2_param_14];\n\tadd.s32 \t%r13, %r12, -8;\n\tld.param.u32 \t%r14, [kernel_output_s0_v1_v1___block_id_y_2_param_9];\n\tld.param.u32 \t%r15, [kernel_output_s0_v1_v1___block_id_y_2_param_10];\n\tmin.s32 \t%r16, %r10, %r13;\n\tld.param.u32 \t%r17, [kernel_output_s0_v1_v1___block_id_y_2_param_11];\n\tshl.b32 \t%r18, %r5, 4;\n\tld.param.u32 \t%r19, [kernel_output_s0_v1_v1___block_id_y_2_param_12];\n\tld.param.u32 \t%r20, [kernel_output_s0_v1_v1___block_id_y_2_param_15];\n\tadd.s32 \t%r21, %r20, -16;\n\tld.param.u32 \t%r22, [kernel_output_s0_v1_v1___block_id_y_2_param_13];\n\tmin.s32 \t%r23, %r18, %r21;\n\tadd.s32 \t%r24, %r11, -1;\n\tld.param.u32 \t%r25, [kernel_output_s0_v1_v1___block_id_y_2_param_16];\n\tadd.s32 \t%r26, %r16, %r7;\n\tld.param.u32 \t%r27, [kernel_output_s0_v1_v1___block_id_y_2_param_17];\n\tadd.s32 \t%r28, %r26, %r19;\n\tsetp.lt.s32 \t%p1, %r28, %r11;\n\tselp.b32 \t%r29, %r28, %r24, %p1;\n\tmax.s32 \t%r30, %r29, %r27;\n\tadd.s32 \t%r31, %r23, %r9;\n\tadd.s32 \t%r32, %r31, %r22;\n\tadd.s32 \t%r33, %r8, -1;\n\tmin.s32 \t%r34, %r32, %r33;\n\tmax.s32 \t%r35, %r34, %r14;\n\tmad.lo.s32 \t%r36, %r26, %r20, %r31;\n\tmul.wide.s32 \t%rd7, %r36, 4;\n\tadd.s64 \t%rd8, %rd6, %rd7;\n\tld.global.nc.f32 \t%f1, [%rd8];\n\tmad.lo.s32 \t%r37, %r30, %r25, %r35;\n\tadd.s32 \t%r38, %r37, %r15;\n\tmul.wide.s32 \t%rd9, %r38, 2;\n\tadd.s64 \t%rd10, %rd3, %rd9;\n\tld.global.nc.u16 \t%rs1, [%rd10];\n\tcvt.rn.f32.u16 \t%f

Maybe we should add a shader_assembly generator output?

I would argue that even if we were to output a separate .stmt file for the offloaded Hexagon part of the pipeline, it would be a significant benefit.

Jul 01 '20 22:07 pranavb-ca

Autoscheduler

For pipelines that already have (what the programmer thinks) a tight hand optimized schedule, I wonder if it would be possible for the autoscheduler / autotuner to accept that as a start point in the search space.

Debugging

A --save-temps equivalent to examining various levels of code generation (For instance, .stmt, .s, .o if "-e o,h" is used).

Jul 01 '20 22:07 pranavb-ca

We should start talking about converting this into actionable items and divvying up the work.

Jul 03 '20 05:07 alexreinking

I think the commit messages are too short and lack of details. commit messages with more detail help beginners to look under the hook with ease. Otherwise, the learning curve is too high.

Jul 07 '20 09:07 benzwt

Halide issues are not maintained very well. It seems that Halide team is short-handed. Maybe, we should solve the issue as fast as possible and don't let it pile up. I'm very happy to orgainize the solution and commit it to document as soon as the issue is solved.

Jul 07 '20 10:07 benzwt

Halide issues are not maintained very well. [...] Maybe, we should solve the issue as fast as possible and don't let it pile up.

Maybe we can start by closing all the issues that were opened more than, say. 6-12 months ago and never got a comment. That would take care of 169 issues.

We have many issues that are open and quite old:

Older than 1 year = 407
Older than 2 years = 292
Older than 3 years = 145
Older than 4 years = 80

Similar to this is the number of branches that are still on this repo that have been merged or are stale. See #4567.

Both of these issues make life harder for new collaborators ("Which issues/branches are important? Where do I get started?") and deter would-be new users ("This project's maintainers don't care / are overwhelmed. The project is buggy and/or unstable").

Jul 07 '20 16:07 alexreinking