Tian, Feng issues

Results 9 issues of


                                            Tian, Feng

[REQUEST] Add more device-agnostic compression algorithms

## **Summary** This is a design discussion RFC for contributing some device-agnostic compression algorithms, like the post training quantization(QDQ quant format) and structural sparsity supported by [Intel(R) Neural Compressor](https://github.com/intel/neural-compressor) into...

enhancement

compression

[BUG] The latest master code doesn't work with pydantic 2.0a2

**Describe the bug** when source build latest code on Ubuntu 18.04.3 LTS and run bert pruning sparse example in DeepSpeedExample, you will see crash. from the log, it's because the...

bug

training

Add snip_momentum structured pruning which supports higher sparse ratio

This PR is used to contribute `snip_momentum` pruning algorithm in [Intel Neural Compress](https://github.com/intel/neural-compressor) to DeepSpeed compression like we proposed in [RFC](https://github.com/microsoft/DeepSpeed/issues/2894). The snip_momentum algo implements the algorithm described in [here](https://github.com/intel/neural-compressor/blob/master/neural_compressor/compression/pruner/README.md)....

Add snip_momentum structured pruning example with 80% sparsity ratio

This PR is used to demonstrate the functionality of snip_momentum structured pruning algo implemented in [here](https://github.com/microsoft/DeepSpeed/pull/3300). User can reproduce below result by running `source ./bash_script/pruning_sparse_snip_momentum.sh` with the PR mentioned at...

support autoTP with weight only quantization in DS inference path

This PR is used to make weight only quantization work with autoTP. The sample code is like below: ```python model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16).to(device) ds_model = deepspeed.init_inference(model, mp_size=world_size, dtype=torch.float16, replace_with_kernel_inject=False) model...

Add RFC dir for submission tracking

## Type of Change Documentation for RFC submission ## Description This is a proposed RFC for DeepSpeed/INC integration

documentation

won't merge

[RFC] HuggingFace compabtible yet flexible WeightOnlyQuantization format for IPEX and INC

This RFC is to propose a Hugging Face-compatible yet flexible Weight Only Quantization (WOQ) format in INC, and then the model quantized by INC can be loaded by IPEX for...

[BUG] run wiki_all_88m on NV A100 with raft-ann-bench will crash

**Describe the bug** it will raise below error on NV A100 GPU. raft_cagra.graph_degree32.intermediate_graph_degree32.graph_build_algoNN_DESCENT/process_time/real_time ERROR OCCURRED: 'Failed to create an algo: std::bad_alloc: out_of_memory: RMM failure at:/sparse/miniconda3/envs/py310/include/rmm/mr/device/pool_memory_resource.hpp:313: Maximum pool size exceeded' **Steps/Code...

bug

[BUG] Run raft-ann-bench with faiss_cpu_flat algo on Xeon cpu will fail

**Describe the bug** it will raise below error on Xeon CPU. Error occurred running benchmark: Command '['/home/ubuntu/wwq/miniconda3/envs/neuralchat_rag/bin/ann/FAISS_CPU_FLAT_ANN_BENCH', '--build', '--data_prefix=./', '--benchmark_out_format=json', '--benchmark_counters_tabular=true', '--benchmark_out=./wiki_all_88M/result/build/faiss_cpu_flat,base.json.lock', '--raft_log_level=3', 'wiki_all_88M_faiss_cpu_flat,base,k10,bs10000_afc3d9c8-d53d-11ee-af72-0a7d5625b4dd.json']' died with . **Steps/Code to reproduce...

bug