blueWatermelonFri issues

Results 8 issues of


                                            blueWatermelonFri

Bug in tensor core programming

I encountered a strange bug while programming tensor core using the **WMMA** api in A800. I tried to print the size of the element in the fragment，Normally **sizeof**(fp16) is 2,...

Problem about l2 cache test

Hey, I was runing l2 cache test in my A800 80GB GPU, and i tried to modify the parameters `N`, there are some strange results. In default, `N`=64, and result...

Problem about bandwidht test

I tried to test bandwidth with cuda-stream benchmark, my device is 4060TI, bandwidth is 288GB/s. I changed param `max_buffer_size ` from `128l * 1024 *1024 +2` to `1024 * 1024`as...

yolov5s在RKNN3588上以FP16推理时的精度问题

通过本地的ubuntu服务器对yolov5s模型进行连板调试时，fp16的精度下降了4个点，请问这个现象正常吗？ rknntoolkit2的版本如下： ```rknn-toolkit2 version: 2.0.0b0+9bab5682``` rk3588的驱动版本如下： ``` D RKNNAPI: API: 2.0.0b0 (18eacd0 build@2024-03-22T06:07:59) D RKNNAPI: DRV: rknn_server: 2.0.0b0 (18eacd0 build@2024-03-22T14:07:19) D RKNNAPI: DRV: rknnrt: 2.0.0b0 (35a6907d79@2024-03-24T10:31:14) ``` rknn.config的参数如下： ``` rknn.config(mean_values=[[0,...

Why blocksize is 256 in gpu-cache test

Hey, i find in gpu-cache test the blocksize is `256`, why it is not `1024` ？ When i changed blocksize from `256` to `1024`， L1 cache bandwidth tested has some...

what "dead = torch.diag(H) == 0" meaning in fasterquant() function?

Hi , I was reading source code about autoGPTQ, but I feel confused in fasterquant(). what will happen if zero in diag of hession matrix? ```python dead = torch.diag(H) ==...

关于提供大模型关闭thinking模式以节省tokens数量的功能建议

## 问题描述现在的模型，以doubao-seed-1-6-250615为例，默认是开启**thinking模式**的，这导致翻译时间过长，以及消耗token数目过多的问题。因为thinking过程的token也是要计费的。我同一句话的翻译，使用非**thinking模式**token数不到100，**thinking模式**需要300左右。但是翻译这种简单的任务，不需要进入thinking模式。所以导致我只能使用doubao-1-5-pro-32k-250115，这个是没有**thinking模式**的模型。 ## 问题如何解决可以提供一个是否打开**thinking模式**的开关，或者直接默认大家都不开启**thinking模式**

待确认

Question about saturation check constants (1.99 vs 2.0)

Hello, First, I've been studying the source code to better understand the implementation of the sieving algorithms, and I have a quick question about a specific design choice. I noticed...