taichi icon indicating copy to clipboard operation
taichi copied to clipboard

Support JIT Offline Cache for Taichi

Open PGZXB opened this issue 2 years ago • 19 comments

Solution

Workflow (on llvm backends)

... → Python Code → Taichi AST → Trans AST to string as key → Hash it(hashed key) → Try find offline cache file by hashed-key:

  • Found: Load cache data from disk → Create kernel form cache → Run kernel
  • Not Found: ( Continue to compile ) ... → Get llvm::Module + offloaded_task_list → Cache them → Run kernel → ... → ( Before exiting ) Dump cache data to disk

Todo & Memo

  • [x] Support for cpu
  • [x] Support for cuda
  • [x] Add ASTKeyGenerator to generator key of Taichi AST instead of IRPrinter, which will holds more information compared with IRPrinter.
  • [x] Fix bugs that some global-vars' changes will not cause re-compiling (Maybe let result of IRPrinter and Expression::serialize hold more information).
  • [x] Fix IRPrinter to generate offline-cache-key more correctly
  • [x] Add tests
  • [x] Consider compile-config's change
  • [x] Trace useless cache-files and delete them. Current implementation causes "cache-file leak"
  • [x] Impl binary ticache file format
  • [x] Run on multi thread/process
  • [x] Support on vulkan
  • [x] Support on opengl
  • [x] Support on metal
  • [x] Refactor (~~see https://github.com/taichi-dev/taichi/issues/4401#issuecomment-1256846965)~~. see https://github.com/taichi-dev/taichi/issues/7002

~~- [ ] Support on dx11~~ ~~- [ ] Support on dx12~~ ~~- [ ] Handle hash collisions~~ ~~- [ ] Allow to set/unset offline_cache per kernel ( Optional~~

Usage

~~Just set offline_cache=True~~ The feature is enabled by default.

import taichi as ti

# ti.init(arch=ti.cpu, offline_cache=True)
ti.init(arch=ti.cpu)

@kernel
def f():
    print("Hello ticache")

f()

Supported backends

See https://github.com/taichi-dev/taichi/issues/4401#issuecomment-1272283442

For more, see Offline Cache

Potential Bugs

  • [ ] https://github.com/taichi-dev/taichi/issues/6877
  • [ ] test_print.py with offline_cache=True (https://github.com/taichi-dev/taichi/actions/runs/6176139423/job/16764496768)

PGZXB avatar Feb 28 '22 04:02 PGZXB

Requesting for extending this to all back ends, considering the huge range of users on Mac or non-cuda laptops

bobcao3 avatar Mar 01 '22 00:03 bobcao3

Requesting for extending this to all back ends, considering the huge range of users on Mac or non-cuda laptops

Yes, impl the feature for all backends is my goal.

PGZXB avatar Mar 01 '22 01:03 PGZXB

https://github.com/taichi-dev/taichi/issues/4401#issue-1153600328

PGZXB avatar Mar 10 '22 01:03 PGZXB

I argue strongly against this solution. We have profiles showing the LLVM codegen takes about 30% of the entire JIT codegen time, it would be much wiser to spend time figuring out AST->CHI-IR caching first.

bobcao3 avatar Mar 14 '22 07:03 bobcao3

A two staged caching gives a major JIT time boost to all backends, I'd argue this is a lot cleaner to implement as well compared to having one stage caching for each individual backend, which will cause maintenance problems down the line

bobcao3 avatar Mar 14 '22 07:03 bobcao3

A two staged caching gives a major JIT time boost to all backends, I'd argue this is a lot cleaner to implement as well compared to having one stage caching for each individual backend, which will cause maintenance problems down the line I argue strongly against this solution. We have profiles showing the LLVM codegen takes about 30% of the entire JIT codegen time, it would be much wiser to spend time figuring out AST->CHI-IR caching first.

Step by step. Maybe temporary solution. We don't have serialization of CHI IR now. After CHI IR's serialization is implemented, maybe 2-level cache is better, especially for multi backends...

ps. I think CHI IR's serialization is very important for standardizing CHI IR, which needs a feasible efficient standard (more .adj to show the importance I think) solution, like llvm-ir, IL, Java bytecode, intel-asm, which is not easy...

PGZXB avatar Mar 14 '22 07:03 PGZXB

Is there a middle ground we can find out? E.g. how easy is it for us to migrate the implementation from caching LLVM to caching CHI IR? If most users don't care about the internal implementation of the cache, I expect the following scenario to happen:

  1. At first, they can only benefit from the caching behavior for CUDA/CPU backends
  2. Then after release X, they find out the caching is working for all the backends automatically.

In addition, IMHO the complexity still comes from the cache key part (considering all the involved global states). The cached contents can be adjusted fairly easily, provided that CHI IR serialization is implemented.

k-ye avatar Mar 14 '22 09:03 k-ye

Is there a middle ground we can find out? E.g. how easy is it for us to migrate the implementation from caching LLVM to caching CHI IR? If most users don't care about the internal implementation of the cache, I expect the following scenario to happen:

  1. At first, they can only benefit from the caching behavior for CUDA/CPU backends
  2. Then after release X, they find out the caching is working for all the backends automatically.

The (new) implementation of offline-cache is transparent. All logic is in C++ side. Frontend only see the offline_cache: bool and offline_cache_file_path: str options. If we have serialization and deserialization of CHI IR, implementing caching CHI IR will be simple. Maybe doing it after standardizing CHI IR is better. After release X, users can also use it by simply set options without any migration cost. And maybe multilevel cache is optional(even better) solution, running backend lang directly is fastest.

PGZXB avatar Mar 14 '22 09:03 PGZXB

In addition, IMHO the complexity still comes from the cache key part (considering all the involved global states). The cached contents can be adjusted fairly easily, provided that CHI IR serialization is implemented.

Can't agree more. Because taichi's kernels depends on global vars/states, generating a key which can uniquely identifies a kernel is difficult and the key of implementing caching a kernel. And, at present, before we have a standardized de/serializable CHI IR, dumping and loading & running backend-language is more simple than CHI IR because they have mature/standard solution.

ps. Overhead of generating key is what we should consider. Python -> Taichi AST -> CHI IR -> Backend lang. From left to right:

  • overhead of generating key ↑ ,
  • overhead of loading & running offline-cache-file ↓ ,
  • difficulty of generating cache key which can uniquely identifies a kernel ↓

PGZXB avatar Mar 14 '22 09:03 PGZXB

Examples

  1. cornell_box (Run ticache_examples/cornell_box.py, which will output time spent 1st running of ti.kernel. The 1st running of ti.kernel triggers JIT, which can reflect time spent in compiling.)
JIT JIT & Generate cache files Load from cache JIT JIT & Generate cache files Load from cache
offline_cache False True True False True True
arch cpu cpu cpu cuda cuda cuda
time spent in 1st running render() 10015.07ms 10110.19ms 1060.54ms 10325.16ms 11186.13ms 865.11ms
  • Running with arch=ti.cpu, offline_cache=False (JIT)
# Output:
PS D:\C\t\taichi> python .\ticache_examples\cornell_box.py
[Taichi] version 1.0.1, llvm 10.0.0, commit 1c3619d9, win, python 3.7.0
[Taichi] Starting on arch=x64
1st run render spent: 10015.0731ms
1st run tonemap spent: 82.79639999999999ms
...
  • 1st running with arch=ti.cpu, offline_cache=True (JIT & Generate cache files)
# Output:
PS D:\C\t\taichi> python .\ticache_examples\cornell_box.py
[Taichi] version 1.0.1, llvm 10.0.0, commit 1c3619d9, win, python 3.7.0
[Taichi] Starting on arch=x64
1st run render spent: 10110.189699999999ms
1st run tonemap spent: 57.540699999999994ms
...
  • 2nd running with arch=ti.cpu, offline_cache=True (Load from cache)
# Output:
PS D:\C\t\taichi> python .\ticache_examples\cornell_box.py
[Taichi] version 1.0.1, llvm 10.0.0, commit 1c3619d9, win, python 3.7.0
[Taichi] Starting on arch=x64
1st run render spent: 1060.5397ms
1st run tonemap spent: 54.1962ms
...
  • Running with arch=ti.gpu, offline_cache=False (JIT)
# Output
PS D:\C\t\taichi> python .\ticache_examples\cornell_box.py
[Taichi] version 1.0.1, llvm 10.0.0, commit 1c3619d9, win, python 3.7.0
[Taichi] Starting on arch=cuda
1st run render spent: 10325.164999999999ms
1st run tonemap spent: 149.37699999999998ms
...
  • 1st running with arch=ti.gpu, offline_cache=True (JIT & Generate cache files)
# Output:
PS D:\C\t\taichi> python .\ticache_examples\cornell_box.py
[Taichi] version 1.0.1, llvm 10.0.0, commit 1c3619d9, win, python 3.7.0
[Taichi] Starting on arch=cuda
1st run render spent: 11186.1319ms
1st run tonemap spent: 83.60759999999999ms
  • 2nd running with arch=ti.gpu, offline_cache=True (Load from cache)
# Output:
PS D:\C\t\taichi> python .\ticache_examples\cornell_box.py
[Taichi] version 1.0.1, llvm 10.0.0, commit 1c3619d9, win, python 3.7.0
[Taichi] Starting on arch=cuda
1st run render spent: 865.1094999999999ms
1st run tonemap spent: 77.90379999999999ms
...

PGZXB avatar May 18 '22 11:05 PGZXB

@PGZXB Hi, is that possible to use the offline-cache and offline-cache-key online? I found for the same kernel inputting ScalarField/VectorField (sparse) with the same size but different instance(address), it will cause recompiling. This also cause low performance. It looks like taichi uses pointer of ti.template() class as the hash and if address changes it will recompile the kernel. For example, I wrap some algorithm and scalarfield data in a class and create different instances of this class https://github.com/taichi-dev/taichi/issues/5376. What I want is to avoid JIT (and also loading cache from disk) when to use a kernel with the same size/datatype.

Personally, I think this feature is also very important as people want to use taichi more flexibly. Also, it's essential to use taichi in an OOP way (every time of instancing a class needs JIT is too slow!). For me, it's necessary for implementing the submap feature in TaichiSLAM https://github.com/taichi-dev/taichi/issues/5380

I guess it's not very difficult to migrate from this offline cache to online. Am I right?

A sample code:

import taichi as ti
ti.init()

@ti.kernel
def test(x: ti.template()):
    #Do something with x
    pass

x = ti.field(dtype=ti.i32)
y = ti.field(dtype=ti.i32)
B0 = ti.root.pointer(ti.ijk, (3, 1, 1)).dense(ti.ijk, (1, 2, 2))
B0.place(x)

B1 = ti.root.pointer(ti.ijk, (3, 1, 1)).dense(ti.ijk, (1, 2, 2))
B1.place(y)

test(x)
#When calling test(y), taichi will recompile the kernel, this should be avoided.
test(y)

@ti.data_oriented
class AClass:
    def __init__(self) -> None:
        x = ti.field(dtype=ti.i32)
        B0 = ti.root.pointer(ti.ijk, (3, 1, 1)).dense(ti.ijk, (1, 2, 2))
        B0.place(x)
        self.x = x
        self.B0 = B0
    @ti.kernel
    def work(self):
        # do sth with self.x
        self.x[0, 1, 2] = 3

a = AClass()
a.work()

b = AClass()
#When calling b.work, taichi will recompile the kernel, this should be avoided.
b.work()

xuhao1 avatar Jul 10 '22 01:07 xuhao1

@PGZXB Hi, is that possible to use the offline-cache and offline-cache-key online? I found for the same kernel inputting ScalarField/VectorField (sparse) with the same size but different instance(address), it will cause recompiling. This also cause low performance. It looks like taichi uses pointer of ti.template() class as the hash and if address changes it will recompile the kernel. For example, I wrap some algorithm and scalarfield data in a class and create different instances of this class #5376. What I want is to avoid JIT (and also loading cache from disk) when to use a kernel with the same size/datatype.

Personally, I think this feature is also very important as people want to use taichi more flexibly. Also, it's essential to use taichi in an OOP way (every time of instancing a class needs JIT is too slow!). For me, it's necessary for implementing the submap feature in TaichiSLAM #5380

I guess it's not very difficult to migrate from this offline cache to online. Am I right?

A sample code:

import taichi as ti
ti.init()

@ti.kernel
def test(x: ti.template()):
    #Do something with x
    pass

x = ti.field(dtype=ti.i32)
y = ti.field(dtype=ti.i32)
B0 = ti.root.pointer(ti.ijk, (3, 1, 1)).dense(ti.ijk, (1, 2, 2))
B0.place(x)

B1 = ti.root.pointer(ti.ijk, (3, 1, 1)).dense(ti.ijk, (1, 2, 2))
B1.place(y)

test(x)
#When calling test(y), taichi will recompile the kernel, this should be avoided.
test(y)

@ti.data_oriented
class AClass:
    def __init__(self) -> None:
        x = ti.field(dtype=ti.i32)
        B0 = ti.root.pointer(ti.ijk, (3, 1, 1)).dense(ti.ijk, (1, 2, 2))
        B0.place(x)
        self.x = x
        self.B0 = B0
    @ti.kernel
    def work(self):
        # do sth with self.x
        self.x[0, 1, 2] = 3

a = AClass()
a.work()

b = AClass()
#When calling b.work, taichi will recompile the kernel, this should be avoided.
b.work()

@xuhao1 Hi, sorry for my delay reply. Thanks for your suggestion and idea. At present, the kernels with different ti.template argument are treated as different kernels by online-cache. I think that using offline-cache online can resolve your problem (after fixing some bugs). And maybe taichi can uniform online-cache and offline-cache in the future? cc: @strongoier

PGZXB avatar Jul 12 '22 07:07 PGZXB

@PGZXB thanks for your reply. Is currently offline cache not working properly in the nightly release?

and yes, I think unified online and offline JIT is a good idea. Maybe a unified database store all jit kernels and it read/store to file system and also save online results.

xuhao1 avatar Jul 12 '22 08:07 xuhao1

@PGZXB thanks for your reply. Is currently offline cache not working properly in the nightly release?

and yes, I think unified online and offline JIT is a good idea. Maybe a unified database store all jit kernels and it read/store to file system and also save online results.

Offline cache has some bugs now which we are fixing. Could you describe in more detail about 'offline cache not working properly'? Thanks.

PGZXB avatar Jul 12 '22 08:07 PGZXB

@PGZXB thanks for your reply. Is currently offline cache not working properly in the nightly release? and yes, I think unified online and offline JIT is a good idea. Maybe a unified database store all jit kernels and it read/store to file system and also save online results.

Offline cache has some bugs now which we are fixing. Could you describe in more detail about 'offline cache not working properly'? Thanks.

In my case, I found with offline JIT enabled, it takes about a few seconds to run when kernel input changes (which may invoke reload kernel/JIT) According to debug prints, it looks like taichi is recompiling the kernel (a lot of warns). I am not sure if it is loading offline JIT but slow or it is recompiling all the things. Sample output is in https://github.com/taichi-dev/taichi/issues/5376.

xuhao1 avatar Jul 12 '22 08:07 xuhao1

Hi @xuhao1. Sorry for my late reply. The problem itself is not related to offline cache. The root cause for the re-compilation is that for now a field doesn't have a type, so each field will be treated as a different thing. The solution has been proposed in https://github.com/taichi-dev/taichi/blob/master/docs/rfcs/20220413-aot-for-all-snode.md, but hasn't been implemented yet (cc: @jim19930609).

That said, if offline cache works properly, it can partially solve your problem - you will only tolerate the long compilation time for the first run.

strongoier avatar Jul 13 '22 07:07 strongoier

@strongoier Thanks for your reply. I am now using some hacked way to implement my feature and looking forward to your solution for online JIT. Do you have a timeline for this feature?

xuhao1 avatar Jul 13 '22 14:07 xuhao1

Use CacheManager to manage compilation result of kernels

To unify the compilation for JIT, offline cache, and AOT, I introduce the CacheManager, which can compile a kernel with auto-caching in memory and disk.

The basic definition of CacheManager

/* using DeviceCaps = std::map<DeviceCapability, uint32_t>; */

class CacheManager { // Defined for a specific backend, e.g., gfx::CacheManager, llvm::CacheManager and metal::CacheManager
  using CompiledKernelData = /* implementation defined */;
 public:
  using Metadata = /* implementation defined */;
  enum Mode { NotCache, MemCache, MemAndDiskCache };

  struct Params {
     // implementation defined
  };

  CacheManager(Params &&init_params);
  
  // Load from memory || Load from disk || (Compile && Cache the result in memory)
  CompiledKernelData load_or_compile(const CompileConfig *config, const DeviceCaps *caps, Kernel *kernel);

  // Dump the cached data in memory to disk
  void dump_with_merging() const;

  // Run offline cache cleaning
  void clean_offline_cache(offline_cache::CleanCachePolicy policy,
                           int max_bytes,
                           double cleaning_factor) const;

 private:
  /* implementation defined */;
};

Implement JIT, offline cache and AOT based on CacheManager

We have a cache_mgr: std::unique_ptr<CacheManager>, then:

  • JIT:
// ...
auto compiled = cache_mgr->load_or_compile(config, compute_device_caps, kernel);
auto callable = /* Convert compiled data to callable object */;
// Finally: callable(&ctx);
// ...
  • Offline cache:
// Load offline cached data && Run
/* Same as JIT */

// Dump cached data to disk
cache_mgr->dump_with_merging();

// Clean cache files
cache_mgr->clean_offline_cache(/*policy*/, /*max_bytes*/, /*factor*/);
  • AOT: (Related: https://github.com/taichi-dev/taichi/issues/5979)
// ...
auto compiled = cache_mgr->load_or_compile(config, /*specified caps*/, kernel);
std::unique_ptr<aot::Kernel> kernel = /* Construct aot kernel from compiled data */;
// ...

More aggressive ...

( ... )

TODO

  • Complete the gfx::CacheManager
    • [ ] Consider device caps
    • [ ] Remove the dependency on AOT in gfx::CacheManager by extracting common parts of cache and AOT, e.g., https://github.com/taichi-dev/taichi/issues/4565
    • [ ] Remove the dependency on library target gfx_runtime in gfx_cache
    • [ ] Implement gfx AOT based on gfx::CacheManager (let AOT depends on gfx::CacheManager)
  • [x] Implement metal::CacheManager and offline cache on metal
  • [ ] Refactor the implementation of LLVM offline cache. Impl llvm::CacheManager and use it to re-impl JIT, AOT and offline cache on LLVM.

PGZXB avatar Sep 24 '22 03:09 PGZXB

Supported or not

Backend Supported or not Overhead (running Cornell Box)
CPU 393.25ms
CUDA 882.426ms
Vulkan 218.030ms
OpenGL
Metal
AMDGPU
Microsoft DirectX 11
Microsoft DirectX 12 N/A

P.S.

  1. The "overhead" is the time spent on loading cached compiled data and converting it to a callable object.
  2. Testing environment:
    • OS: Windows 11, CPU: Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz 1.61 GHz, RAM: 16GB for CPU, CUDA, OpenGL and Vulkan
  3. ⏩: Working in progress

PGZXB avatar Oct 08 '22 10:10 PGZXB