lmdeploy icon indicating copy to clipboard operation
lmdeploy copied to clipboard

[Feature] Load dataset from ModelScope during quantization

Open fanqiNO1 opened this issue 1 year ago • 2 comments

Motivation

由于量化时需要 load calib 数据集,而目前默认是从 hf 进行 load,这可能会对无法连接到 hf 的用户造成一定的困扰。

因此,能否加入从 ModelScope load 相关数据集的逻辑?

Related resources

No response

Additional context

以下是个人一些不成熟的 idea:

对于 https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/lite/utils/calib_dataloader.py 文件,以 c4 数据集为例,#L93-#L105 段为从 hf load 逻辑。修改后大致如下:

try:
    # load dataset from hf
    is_loaded = True
catch Exception:
    is_loaded = False
    # logger.warning: Cannot load xxx from hf

if not is_loaded:
  try:
      # load from ms
      is_loaded = True
  catch Exception:
      is_loaded = False
      # logger.warning: Cannot load xxx from ms

if not is_loaded:
    raise Exception

...

fanqiNO1 avatar Feb 12 '24 04:02 fanqiNO1

hi, should we need add this feature? I can support it.

yinfan98 avatar Feb 28 '24 16:02 yinfan98

@yinfan98 It's greatly needed!

There are two ways to implement this feature, the first one is the method that @fanqiNO1 mentioned; the second one is to simplify and migrate the process_hf_dataset and process_ms_dataset in xtuner to lmdeploy.

The first method is relatively simple to implement, but it can only support a few datasets currently supported in lmdeploy. The second method is a bit more complicated, but it can more conveniently add new datasets and support custom ones.

pppppM avatar Feb 29 '24 03:02 pppppM