Yishuo Wang

Results 27 comments of Yishuo Wang

> @MeouSker77 maybe we could add an issue to openvino team jira as well? (will send link offline) There is already an issue about this in openvino team jira

> Do you have a ticket link for this? send the link offline

Nano's init script will try to find `libtcmalloc.so` in two location: `${NANO_DIR}/libs/libtcmalloc.so` and `${LIB_DIR}/libtcmalloc.so`. - `${NANO_DIR}` is the directory where you install nano, in your case, it's `/home/yuwen/BigDL/python/nano/src/bigdl/nano` - `${LIB_DIR}`...

我的理解是,从理论上讲vector可以直接把首尾迭代器相减算出元素数量,然后在实际开始计算之前就可以指定每个线程计算哪一部分元素,这样就只有最后求总和的时候需要把各个线程的结果加起来,其他时候各个线程之间完全不需要通信和同步,所以并行比串行快是有可能的。但如果是非随机访问迭代器,比如链表的迭代器,这时候无法事先知道总共有多少元素,就只能一个个遍历所有元素然后动态分给不同线程来计算,这时候不同线程之间就需要用原子操作来进行同步以确保不同线程不会重复计算同一个元素,如果每前进一个元素都要进行一次原子操作的话,那么原子操作的开销远大于判定一个数是不是 2 的倍数的开销,所以这种情况下不管有多少个元素,并行肯定比串行慢。

> With an AVX512 machine, you may want to look into using `_mm256_dpbssd_epi32` in `mul_sum_i8_pairs_float`, that could give another speed boost. (Preprocessor condition: `#if __AVXVNNIINT8__`) Thank you very much for...

I have tried origin fp16 model without quantization or other optimization with this input, it also repeats outputs, so I think it's probably an issue with the model itself.

tried latest qwen 1.5 7b: ```python # -- coding:utf-8 -- import torch import intel_extension_for_pytorch as ipex from bigdl.llm.transformers import AutoModelForCausalLM, AutoModel from transformers import AutoTokenizer model_path = "" model =...

> I encountered the same Native API failed problem when input token size > 2000. Native API failed -999 or -5 means out of memory in most cases, we are...

This issue is caused by the lack of Visual Studio, and it has been solved. As for the wrong output of whisper-base, could you share your code to run whisper-base...

**This bug is caused by using XMX kernel in a new thread**, it won't happen if running model in current thread. And I think its root cause is a bug...