AmazDeng

Results 49 comments of AmazDeng

> I recommand using multi-thread to do the batching inference with pipeline.stream_infer api and each thread do one request, the engine will automatically batch llm part if the above conditions...

> In my opinion, using multithreading or threadpool are the same thing. The truth of the matter is that each thread processes one request > > It is worth noting...

> > 我想使用 InternVL2-40B-AWQ+lmdeploy 进行离线视频批量推理,类似于 。根据您的对话上下文,我是否可以假设,如果不使用流式处理功能,使用多线程并行发送请求的速度接近本机批量推理的速度? > > 我想问一下, InternVL2-40B-AWQ启动大概需要多大的显存? 需要A100 80G版本,可以启动

@rajeevsrao @ttyio @pranavm-nvidia @aaronp24 @ilyasher Could you please take a look at this issue?

> The problem is that `trtexec` will use random scaling factors for `int8` mode. If you replace `--best` with `--fp16` (i.e. disable `--int8`), that should improve the accuracy. @pranavm-nvidia Thanks...

> same issue, you can set flash_attn to false and use bf16 to compile, it works for me @seanxcwang I followed the method you provided for testing. In the hf...

@seanxcwang I found that the following section of code in the Hugging Face model caused my TRT engine model export to be in the float32 format, which ensures that the...

@flybird11111 Thanks for your answer. 1.So, what specific models does ColossalAI support? I haven't seen a list of supported models. 2.The Open-Sora project (https://github.com/hpcaitech/Open-Sora) uses the ColossalAI engine. I've noticed...

@flybird11111 I noticed that Open-Sora and ColossalAI are from the same team. Does this mean that if I were to modify Flux myself into a training paradigm of LoRA +...