FdyCN

Results 16 comments of FdyCN

> In the jitify2 API (under development) you can do this: https://github.com/NVIDIA/jitify/blob/ca7f794/jitify2.hpp#L2153 @benbarsdell Thx a lot, glad to hear that. i reviewed this branch. and is this tool necessary? https://github.com/NVIDIA/jitify/blob/jitify2/jitify2_preprocess.cpp...

SAME results i got too( same GPU : NVIDIA GeFore RTX 3070 Laptop GPU). Could you please check this??? @mdoijade @Ru7w1k @AndyDick thanks so much.

> Not so much for M2 Max, which always shows CPU and PCPU at 100% > > i got the same issue on M2 Pro. ![image](https://user-images.githubusercontent.com/80800417/233528404-b47cd97c-32b5-44df-afec-c1f655dffdcf.png)

i know it can be added in this way: ``` jitify::Program program = kernel_cache.program( program1, // Code string specified above {example_headers_my_header1_cuh}, // Code string generated by stringify {"--use_fast_math", "-I" ${where...

> I think this should work, as long as the `-I` option has the correct path (e.g., "/usr/local/cuda/include"). If it's still not working for you, could you provide a full...

@Hzfengsy im a liitle bit confused, cause TVM does have [Hexagon backend codegen](https://github.com/apache/tvm/blob/main/tests/python/codegen/test_target_codegen_hexagon.py), and mlc-llm is based on TVM Unity. So why mlc-llm cannot lowering to hexagon target codes? Is...

> I have tried to implement 1.1b llama in hexagon backend before and it was very slow because I did not use cpu scheduling and only added hvx compilation instructions...

> @FdyCN Yes, there are currently some ways to support mlc running in hexagon backend, but I tested it very slowly. Each token of 1.1b llama takes more than 60s...

> Shared BW/cycle is aggregate bandwidth from threadgroup memory. It's the number of bytes that can be shuffled around per core-cycle. On-core is a vendor-agnostic word for "L1 cache", on-GPU...