lmdeploy
lmdeploy copied to clipboard
[Feature] Load dataset from ModelScope during quantization
Motivation
由于量化时需要 load calib 数据集,而目前默认是从 hf 进行 load,这可能会对无法连接到 hf 的用户造成一定的困扰。
因此,能否加入从 ModelScope load 相关数据集的逻辑?
Related resources
No response
Additional context
以下是个人一些不成熟的 idea:
对于 https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/lite/utils/calib_dataloader.py 文件,以 c4 数据集为例,#L93-#L105 段为从 hf load 逻辑。修改后大致如下:
try:
# load dataset from hf
is_loaded = True
catch Exception:
is_loaded = False
# logger.warning: Cannot load xxx from hf
if not is_loaded:
try:
# load from ms
is_loaded = True
catch Exception:
is_loaded = False
# logger.warning: Cannot load xxx from ms
if not is_loaded:
raise Exception
...
hi, should we need add this feature? I can support it.
@yinfan98 It's greatly needed!
There are two ways to implement this feature, the first one is the method that @fanqiNO1 mentioned; the second one is to simplify and migrate the process_hf_dataset and process_ms_dataset in xtuner
to lmdeploy
.
The first method is relatively simple to implement, but it can only support a few datasets currently supported in lmdeploy
.
The second method is a bit more complicated, but it can more conveniently add new datasets and support custom ones.