taichi
taichi copied to clipboard
Support JIT Offline Cache for Taichi
Solution
Workflow (on llvm backends)
... → Python Code → Taichi AST → Trans AST to string as key → Hash it(hashed key) → Try find offline cache file by hashed-key:
- Found: Load cache data from disk → Create kernel form cache → Run kernel
-
Not Found: ( Continue to compile ) ... → Get
llvm::Module
+offloaded_task_list
→ Cache them → Run kernel → ... → ( Before exiting ) Dump cache data to disk
Todo & Memo
- [x] Support for cpu
- [x] Support for cuda
- [x] Add
ASTKeyGenerator
to generator key of Taichi AST instead ofIRPrinter
, which will holds more information compared withIRPrinter
. - [x] Fix bugs that some global-vars' changes will not cause re-compiling (Maybe let result of
IRPrinter
andExpression::serialize
hold more information). - [x] Fix
IRPrinter
to generate offline-cache-key more correctly - [x] Add tests
- [x] Consider compile-config's change
- [x] Trace useless cache-files and delete them. Current implementation causes "cache-file leak"
- [x] Impl binary ticache file format
- [x] Run on multi thread/process
- [x] Support on vulkan
- [x] Support on opengl
- [x] Support on metal
- [x] Refactor (~~see https://github.com/taichi-dev/taichi/issues/4401#issuecomment-1256846965)~~. see https://github.com/taichi-dev/taichi/issues/7002
~~- [ ] Support on dx11~~
~~- [ ] Support on dx12~~
~~- [ ] Handle hash collisions~~
~~- [ ] Allow to set/unset offline_cache
per kernel ( Optional~~
Usage
~~Just set offline_cache=True
~~ The feature is enabled by default.
import taichi as ti
# ti.init(arch=ti.cpu, offline_cache=True)
ti.init(arch=ti.cpu)
@kernel
def f():
print("Hello ticache")
f()
Supported backends
See https://github.com/taichi-dev/taichi/issues/4401#issuecomment-1272283442
For more, see Offline Cache
Potential Bugs
- [ ] https://github.com/taichi-dev/taichi/issues/6877
- [ ] test_print.py with
offline_cache=True
(https://github.com/taichi-dev/taichi/actions/runs/6176139423/job/16764496768)
Requesting for extending this to all back ends, considering the huge range of users on Mac or non-cuda laptops
Requesting for extending this to all back ends, considering the huge range of users on Mac or non-cuda laptops
Yes, impl the feature for all backends is my goal.
https://github.com/taichi-dev/taichi/issues/4401#issue-1153600328
I argue strongly against this solution. We have profiles showing the LLVM codegen takes about 30% of the entire JIT codegen time, it would be much wiser to spend time figuring out AST->CHI-IR caching first.
A two staged caching gives a major JIT time boost to all backends, I'd argue this is a lot cleaner to implement as well compared to having one stage caching for each individual backend, which will cause maintenance problems down the line
A two staged caching gives a major JIT time boost to all backends, I'd argue this is a lot cleaner to implement as well compared to having one stage caching for each individual backend, which will cause maintenance problems down the line I argue strongly against this solution. We have profiles showing the LLVM codegen takes about 30% of the entire JIT codegen time, it would be much wiser to spend time figuring out AST->CHI-IR caching first.
Step by step. Maybe temporary solution. We don't have serialization of CHI IR now. After CHI IR's serialization is implemented, maybe 2-level cache is better, especially for multi backends...
ps. I think CHI IR's serialization is very important for standardizing CHI IR, which needs a feasible efficient standard (more .adj to show the importance I think) solution, like llvm-ir, IL, Java bytecode, intel-asm, which is not easy...
Is there a middle ground we can find out? E.g. how easy is it for us to migrate the implementation from caching LLVM to caching CHI IR? If most users don't care about the internal implementation of the cache, I expect the following scenario to happen:
- At first, they can only benefit from the caching behavior for CUDA/CPU backends
- Then after release X, they find out the caching is working for all the backends automatically.
In addition, IMHO the complexity still comes from the cache key part (considering all the involved global states). The cached contents can be adjusted fairly easily, provided that CHI IR serialization is implemented.
Is there a middle ground we can find out? E.g. how easy is it for us to migrate the implementation from caching LLVM to caching CHI IR? If most users don't care about the internal implementation of the cache, I expect the following scenario to happen:
- At first, they can only benefit from the caching behavior for CUDA/CPU backends
- Then after release X, they find out the caching is working for all the backends automatically.
The (new) implementation of offline-cache is transparent. All logic is in C++ side. Frontend only see the offline_cache: bool
and offline_cache_file_path: str
options. If we have serialization and deserialization of CHI IR, implementing caching CHI IR will be simple. Maybe doing it after standardizing CHI IR is better. After release X, users can also use it by simply set options without any migration cost. And maybe multilevel cache is optional(even better) solution, running backend lang directly is fastest.
In addition, IMHO the complexity still comes from the cache key part (considering all the involved global states). The cached contents can be adjusted fairly easily, provided that CHI IR serialization is implemented.
Can't agree more. Because taichi's kernels depends on global vars/states, generating a key which can uniquely identifies a kernel is difficult and the key of implementing caching a kernel. And, at present, before we have a standardized de/serializable CHI IR, dumping and loading & running backend-language is more simple than CHI IR because they have mature/standard solution.
ps. Overhead of generating key is what we should consider. Python -> Taichi AST -> CHI IR -> Backend lang. From left to right:
- overhead of generating key ↑ ,
- overhead of loading & running offline-cache-file ↓ ,
- difficulty of generating cache key which can uniquely identifies a kernel ↓
Examples
- cornell_box (Run ticache_examples/cornell_box.py, which will output time spent 1st running of
ti.kernel
. The 1st running ofti.kernel
triggers JIT, which can reflect time spent in compiling.)
JIT | JIT & Generate cache files | Load from cache | JIT | JIT & Generate cache files | Load from cache | |
---|---|---|---|---|---|---|
offline_cache |
False |
True |
True |
False |
True |
True |
arch |
cpu | cpu | cpu | cuda | cuda | cuda |
time spent in 1st running render() |
10015.07ms | 10110.19ms | 1060.54ms | 10325.16ms | 11186.13ms | 865.11ms |
- Running with
arch=ti.cpu, offline_cache=False
(JIT)
# Output:
PS D:\C\t\taichi> python .\ticache_examples\cornell_box.py
[Taichi] version 1.0.1, llvm 10.0.0, commit 1c3619d9, win, python 3.7.0
[Taichi] Starting on arch=x64
1st run render spent: 10015.0731ms
1st run tonemap spent: 82.79639999999999ms
...
- 1st running with
arch=ti.cpu, offline_cache=True
(JIT & Generate cache files)
# Output:
PS D:\C\t\taichi> python .\ticache_examples\cornell_box.py
[Taichi] version 1.0.1, llvm 10.0.0, commit 1c3619d9, win, python 3.7.0
[Taichi] Starting on arch=x64
1st run render spent: 10110.189699999999ms
1st run tonemap spent: 57.540699999999994ms
...
- 2nd running with
arch=ti.cpu, offline_cache=True
(Load from cache)
# Output:
PS D:\C\t\taichi> python .\ticache_examples\cornell_box.py
[Taichi] version 1.0.1, llvm 10.0.0, commit 1c3619d9, win, python 3.7.0
[Taichi] Starting on arch=x64
1st run render spent: 1060.5397ms
1st run tonemap spent: 54.1962ms
...
- Running with
arch=ti.gpu, offline_cache=False
(JIT)
# Output
PS D:\C\t\taichi> python .\ticache_examples\cornell_box.py
[Taichi] version 1.0.1, llvm 10.0.0, commit 1c3619d9, win, python 3.7.0
[Taichi] Starting on arch=cuda
1st run render spent: 10325.164999999999ms
1st run tonemap spent: 149.37699999999998ms
...
- 1st running with
arch=ti.gpu, offline_cache=True
(JIT & Generate cache files)
# Output:
PS D:\C\t\taichi> python .\ticache_examples\cornell_box.py
[Taichi] version 1.0.1, llvm 10.0.0, commit 1c3619d9, win, python 3.7.0
[Taichi] Starting on arch=cuda
1st run render spent: 11186.1319ms
1st run tonemap spent: 83.60759999999999ms
- 2nd running with
arch=ti.gpu, offline_cache=True
(Load from cache)
# Output:
PS D:\C\t\taichi> python .\ticache_examples\cornell_box.py
[Taichi] version 1.0.1, llvm 10.0.0, commit 1c3619d9, win, python 3.7.0
[Taichi] Starting on arch=cuda
1st run render spent: 865.1094999999999ms
1st run tonemap spent: 77.90379999999999ms
...
@PGZXB Hi, is that possible to use the offline-cache and offline-cache-key online? I found for the same kernel inputting ScalarField/VectorField (sparse) with the same size but different instance(address), it will cause recompiling. This also cause low performance. It looks like taichi uses pointer of ti.template() class as the hash and if address changes it will recompile the kernel. For example, I wrap some algorithm and scalarfield data in a class and create different instances of this class https://github.com/taichi-dev/taichi/issues/5376. What I want is to avoid JIT (and also loading cache from disk) when to use a kernel with the same size/datatype.
Personally, I think this feature is also very important as people want to use taichi more flexibly. Also, it's essential to use taichi in an OOP way (every time of instancing a class needs JIT is too slow!). For me, it's necessary for implementing the submap feature in TaichiSLAM https://github.com/taichi-dev/taichi/issues/5380
I guess it's not very difficult to migrate from this offline cache to online. Am I right?
A sample code:
import taichi as ti
ti.init()
@ti.kernel
def test(x: ti.template()):
#Do something with x
pass
x = ti.field(dtype=ti.i32)
y = ti.field(dtype=ti.i32)
B0 = ti.root.pointer(ti.ijk, (3, 1, 1)).dense(ti.ijk, (1, 2, 2))
B0.place(x)
B1 = ti.root.pointer(ti.ijk, (3, 1, 1)).dense(ti.ijk, (1, 2, 2))
B1.place(y)
test(x)
#When calling test(y), taichi will recompile the kernel, this should be avoided.
test(y)
@ti.data_oriented
class AClass:
def __init__(self) -> None:
x = ti.field(dtype=ti.i32)
B0 = ti.root.pointer(ti.ijk, (3, 1, 1)).dense(ti.ijk, (1, 2, 2))
B0.place(x)
self.x = x
self.B0 = B0
@ti.kernel
def work(self):
# do sth with self.x
self.x[0, 1, 2] = 3
a = AClass()
a.work()
b = AClass()
#When calling b.work, taichi will recompile the kernel, this should be avoided.
b.work()
@PGZXB Hi, is that possible to use the offline-cache and offline-cache-key online? I found for the same kernel inputting ScalarField/VectorField (sparse) with the same size but different instance(address), it will cause recompiling. This also cause low performance. It looks like taichi uses pointer of ti.template() class as the hash and if address changes it will recompile the kernel. For example, I wrap some algorithm and scalarfield data in a class and create different instances of this class #5376. What I want is to avoid JIT (and also loading cache from disk) when to use a kernel with the same size/datatype.
Personally, I think this feature is also very important as people want to use taichi more flexibly. Also, it's essential to use taichi in an OOP way (every time of instancing a class needs JIT is too slow!). For me, it's necessary for implementing the submap feature in TaichiSLAM #5380
I guess it's not very difficult to migrate from this offline cache to online. Am I right?
A sample code:
import taichi as ti ti.init() @ti.kernel def test(x: ti.template()): #Do something with x pass x = ti.field(dtype=ti.i32) y = ti.field(dtype=ti.i32) B0 = ti.root.pointer(ti.ijk, (3, 1, 1)).dense(ti.ijk, (1, 2, 2)) B0.place(x) B1 = ti.root.pointer(ti.ijk, (3, 1, 1)).dense(ti.ijk, (1, 2, 2)) B1.place(y) test(x) #When calling test(y), taichi will recompile the kernel, this should be avoided. test(y) @ti.data_oriented class AClass: def __init__(self) -> None: x = ti.field(dtype=ti.i32) B0 = ti.root.pointer(ti.ijk, (3, 1, 1)).dense(ti.ijk, (1, 2, 2)) B0.place(x) self.x = x self.B0 = B0 @ti.kernel def work(self): # do sth with self.x self.x[0, 1, 2] = 3 a = AClass() a.work() b = AClass() #When calling b.work, taichi will recompile the kernel, this should be avoided. b.work()
@xuhao1 Hi, sorry for my delay reply. Thanks for your suggestion and idea. At present, the kernels with different ti.template
argument are treated as different kernels by online-cache. I think that using offline-cache online can resolve your problem (after fixing some bugs). And maybe taichi can uniform online-cache and offline-cache in the future? cc: @strongoier
@PGZXB thanks for your reply. Is currently offline cache not working properly in the nightly release?
and yes, I think unified online and offline JIT is a good idea. Maybe a unified database store all jit kernels and it read/store to file system and also save online results.
@PGZXB thanks for your reply. Is currently offline cache not working properly in the nightly release?
and yes, I think unified online and offline JIT is a good idea. Maybe a unified database store all jit kernels and it read/store to file system and also save online results.
Offline cache has some bugs now which we are fixing. Could you describe in more detail about 'offline cache not working properly'? Thanks.
@PGZXB thanks for your reply. Is currently offline cache not working properly in the nightly release? and yes, I think unified online and offline JIT is a good idea. Maybe a unified database store all jit kernels and it read/store to file system and also save online results.
Offline cache has some bugs now which we are fixing. Could you describe in more detail about 'offline cache not working properly'? Thanks.
In my case, I found with offline JIT enabled, it takes about a few seconds to run when kernel input changes (which may invoke reload kernel/JIT) According to debug prints, it looks like taichi is recompiling the kernel (a lot of warns). I am not sure if it is loading offline JIT but slow or it is recompiling all the things. Sample output is in https://github.com/taichi-dev/taichi/issues/5376.
Hi @xuhao1. Sorry for my late reply. The problem itself is not related to offline cache. The root cause for the re-compilation is that for now a field doesn't have a type, so each field will be treated as a different thing. The solution has been proposed in https://github.com/taichi-dev/taichi/blob/master/docs/rfcs/20220413-aot-for-all-snode.md, but hasn't been implemented yet (cc: @jim19930609).
That said, if offline cache works properly, it can partially solve your problem - you will only tolerate the long compilation time for the first run.
@strongoier Thanks for your reply. I am now using some hacked way to implement my feature and looking forward to your solution for online JIT. Do you have a timeline for this feature?
Use CacheManager
to manage compilation result of kernels
To unify the compilation for JIT, offline cache, and AOT, I introduce the CacheManager
, which can compile a kernel with auto-caching in memory and disk.
The basic definition of CacheManager
/* using DeviceCaps = std::map<DeviceCapability, uint32_t>; */
class CacheManager { // Defined for a specific backend, e.g., gfx::CacheManager, llvm::CacheManager and metal::CacheManager
using CompiledKernelData = /* implementation defined */;
public:
using Metadata = /* implementation defined */;
enum Mode { NotCache, MemCache, MemAndDiskCache };
struct Params {
// implementation defined
};
CacheManager(Params &&init_params);
// Load from memory || Load from disk || (Compile && Cache the result in memory)
CompiledKernelData load_or_compile(const CompileConfig *config, const DeviceCaps *caps, Kernel *kernel);
// Dump the cached data in memory to disk
void dump_with_merging() const;
// Run offline cache cleaning
void clean_offline_cache(offline_cache::CleanCachePolicy policy,
int max_bytes,
double cleaning_factor) const;
private:
/* implementation defined */;
};
Implement JIT, offline cache and AOT based on CacheManager
We have a cache_mgr: std::unique_ptr<CacheManager>
, then:
- JIT:
// ...
auto compiled = cache_mgr->load_or_compile(config, compute_device_caps, kernel);
auto callable = /* Convert compiled data to callable object */;
// Finally: callable(&ctx);
// ...
- Offline cache:
// Load offline cached data && Run
/* Same as JIT */
// Dump cached data to disk
cache_mgr->dump_with_merging();
// Clean cache files
cache_mgr->clean_offline_cache(/*policy*/, /*max_bytes*/, /*factor*/);
- AOT: (Related: https://github.com/taichi-dev/taichi/issues/5979)
// ...
auto compiled = cache_mgr->load_or_compile(config, /*specified caps*/, kernel);
std::unique_ptr<aot::Kernel> kernel = /* Construct aot kernel from compiled data */;
// ...
More aggressive ...
( ... )
TODO
- Complete the
gfx::CacheManager
- [ ] Consider device caps
- [ ] Remove the dependency on AOT in gfx::CacheManager by extracting common parts of cache and AOT, e.g., https://github.com/taichi-dev/taichi/issues/4565
- [ ] Remove the dependency on library target
gfx_runtime
ingfx_cache
- [ ] Implement gfx AOT based on
gfx::CacheManager
(let AOT depends ongfx::CacheManager
)
- [x] Implement
metal::CacheManager
and offline cache on metal - [ ] Refactor the implementation of LLVM offline cache. Impl llvm::CacheManager and use it to re-impl JIT, AOT and offline cache on LLVM.
Supported or not
Backend | Supported or not | Overhead (running Cornell Box) |
---|---|---|
CPU | ✔ | 393.25ms |
CUDA | ✔ | 882.426ms |
Vulkan | ✔ | 218.030ms |
OpenGL | ✔ | |
Metal | ✔ | |
AMDGPU | ✔ | |
Microsoft DirectX 11 | ✔ | |
Microsoft DirectX 12 | ❌ | N/A |
P.S.
- The "overhead" is the time spent on loading cached compiled data and converting it to a callable object.
- Testing environment:
-
OS: Windows 11
,CPU: Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz 1.61 GHz
,RAM: 16GB
for CPU, CUDA, OpenGL and Vulkan
-
- ⏩: Working in progress