Nan Qin

Results 13 comments of Nan Qin

one use case I have is serving through google cloud run which has a request size limit of 32MB for http1, and no limit for http2 https://cloud.google.com/run/quotas

Platform and device info: ``` ================ Platform # 1 ================ Platform name : NVIDIA CUDA OpenCL version : OpenCL 1.2 CUDA 10.2.120 Platform vendor : NVIDIA Corporation OpenCL profile :...

According to [NVIDA CUDA download page](https://developer.nvidia.com/cuda-downloads?target_os=Windows&target_arch=x86_64l), CUDA 10 is the latest version.

> Hi, can you paste your vcjob yaml? yeah something like this ``` apiVersion: batch.volcano.sh/v1alpha1 kind: Job metadata: name: test-vj-d18igu spec: maxRetry: 3 minAvailable: 1 minSuccess: 1000 plugins: env: []...

the workaround for me is to put the flyte entities in a different python module

not using websocket. Here is an example app ``` from pydantic import BaseModel from fastapi import FastAPI, File, UploadFile from opentelemetry import trace from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor from opentelemetry.sdk.trace import...

@WoosukKwon @ywang96 @robertgshaw2-neuralmagic the failed tests with ``` ValueError: Cannot use apply_chat_template() because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting...

ready for review @ywang96 @WoosukKwon @robertgshaw2-neuralmagic

script I ran for testing: ``` # %% ASYNC = True USE_RAY = False # %% from transformers import AutoModelForCausalLM, AutoTokenizer import torch device = "cpu" model_path = "/models/huggingface/mistralai/Mistral-7B-Instruct-v0.2" #...