feat: Add Offline DeepSeek Model
This PR implements an offline DeepSeek model loader and inference wrapper fulfilling all requirements in issue. It provides a lightweight, memory-efficient, dependency-minimal way to load and run DeepSeek models from HuggingFace which supports all official DeepSeek R1 & Distill variants using HuggingFace + safetensors. Dynamic model config parsing and deeper memory optimizations (e.g., Triton/offloading) can be addressed in a follow-up issue if needed.
NOTES TO REVIEWERS
Tested on DeepSeek-R1-Distill-Qwen-7B that works offline and follows low-level-only policy and here's o/p in local testing :
> python -m unittest intelli.test.integration.test_deepseek_wrapper
----------------------------------------------------------------------
Ran 1 test in 129.257s
OK
Downloading model.safetensors.index.json from deepseek-ai/DeepSeek-R1-Distill-Qwen-7B...
Index downloaded to /home/runner/.cache/deepseek/models--deepseek-ai--DeepSeek-R1-Distill-Qwen-7B/snapshots/91[6](https://github.com/varshith257/Intelli/pull/1/checks#step:6:7)b56a44061fd5cd7d6a8fb63255[7](https://github.com/varshith257/Intelli/pull/1/checks#step:6:8)ed4f724f60/model.safetensors.index.json
Downloading config.json from deepseek-ai/DeepSeek-R1-Distill-Qwen-7B...
Config downloaded to /home/runner/.cache/deepseek/models--deepseek-ai--DeepSeek-R1-Distill-Qwen-7B/snapshots/916b56a44061fd5cd7d6a[8](https://github.com/varshith257/Intelli/pull/1/checks#step:6:9)fb632557ed4f724f60/config.json
Downloading model.safetensors.index.json from deepseek-ai/DeepSeek-R1-Distill-Qwen-7B...
Index downloaded to /home/runner/.cache/deepseek/models--deepseek-ai--DeepSeek-R1-Distill-Qwen-7B/snapshots/[9](https://github.com/varshith257/Intelli/pull/1/checks#step:6:10)16b56a44061fd5cd7d6a8fb632557ed4f724f60/model.safetensors.index.json
Downloading model-00002-of-000002.safetensors from deepseek-ai/DeepSeek-R1-Distill-Qwen-7B...
Model downloaded to /home/runner/.cache/deepseek/models--deepseek-ai--DeepSeek-R1-Distill-Qwen-7B/snapshots/916b56a44061fd5cd7d6a8fb632557ed4f724f60/model-00002-of-000002.safetensors
Downloading model-00001-of-000002.safetensors from deepseek-ai/DeepSeek-R1-Distill-Qwen-7B...
Model downloaded to /home/runner/.cache/deepseek/models--deepseek-ai--DeepSeek-R1-Distill-Qwen-7B/snapshots/916b56a44061fd5cd7d6a8fb632557ed4f724f60/model-00001-of-000002.safetensors
Model weights loaded successfully from split safetensors.
Inference successful, output shape: torch.Size([1, 16, 152064])
DETAILED SETUP AND TESTING IS DOCUMENTED IN README
/claim #82
@intelligentnode @Barqawiz Let me know if I am missing anything and it's ready for review
@intelligentnode Any reviews on this :)
Thanks for the attempt, it is good you used only low level dependencies.
Kindly answer the following questions to help with the review:
- There is no tokenizer implementation. How will text be converted to token IDs ?
- While the code tested using
Qwen-7B, can the same code run deepseek R1 ? - Does this implementation work on both CPU and GPU devices ? How to test in GPU and which device to use ?
@intelligentnode Thanks for taking time to review this PR.
- There is no tokenizer implementation. How will text be converted to token IDs ?
Yeah, thanks for reminding. I have added support for tokenization with minimal overhead and deps
- While the code tested using
Qwen-7B, can the same code run deepseek R1 ?
Yes, the implementation is architecture-agnostic, we rely on config.json + safetensors.index.json it can load any DeepSeek-R1 variant including the full 671B model when provided sufficient RAM/VRAM. R1 is very large (670B total / ~37B active parameters). So we need a lot of RAM + VRAM which most Macs also don’t have unless on a high-end M3 Ultra with unified memory
- Does this implementation work on both CPU and GPU devices ? How to test in GPU and which device to use ?
Yes it works both with the GPU and CPU and to test with GPU you can use
CUDA_VISIBLE_DEVICES=0 python -m unittest intelli.test.integration.test_deepseek_wrapper
For full DeepSeek-R1, the recommend testing could be on A100 80GB or A6000 48GB class GPUs. Quantized versions can fit in 35–40GB VRAM and are loadable via our implementation using safetensors.index.json split loading
( My desktop has access for RTX 3050 GPU, so tested Qwen-7B and other variations on both cpu and GPU env's
Let me know if anything to add or iterate on this MR
Deepseek use Byte-level BPE (BBPE), not whitespace word tokenizer.
Please adjust.
Feedback:
- There are no test cases to validate the output generate valid output. add test cases to generate test and validate the string readable.
- When I switch the model name to "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B" and error printed. The code should support all deepseek models. Entry Not Found for url: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B/resolve/main/model.safetensors.index.json.
When I print the output, it does not print string but the tokens!
output tokenize: tensor([[[-0.4666, 0.2092, -0.5391, ..., 0.5234, -0.0457, 0.3164], [-0.6572, 0.4009, -0.6992, ..., 0.4370, -0.1760, 0.3025], [-0.4568, 0.0883, -0.8530, ..., 0.4050, -0.0996, 0.5566], ..., [-0.6572, 0.4009, -0.6992, ..., 0.4370, -0.1760, 0.3025], [-0.5752, 0.4082, -0.5093, ..., 0.4324, -0.1279, 0.1831], [-0.3921, 0.2646, -0.7368, ..., 0.5825, -0.1241, 0.3960]]],
Kindly do full testing for multiple deepseek models, and make sure the output string printed and readable before next review iteration.
@intelligentnode Thanks for detailed review and yes some variants aren't using shared indices, I have fixed them now and as we discussed earlier i implemented the basic BPE tokenizer and also implemented decode fn to make it more redable now
And testing command has been updated and tested on listed models in the issue of R1 variants that supported CPU. Kindly use thie following cd to test them now it solves dynamically for diff variants
DEEPSEEK_MODEL= <model_name> python -m unittest intelli.test.integration.test_deepseek_wrapper
@intelligentnode @Barqawiz Can you review this, please?
There is conflicts and the outputs not working as expected.
@intelligentnode Can you paste the logs you are getting? I tested it and got fully decode roundtrip string
51s
Run python -m unittest intelli.test.integration.test_deepseek_wrapper
....
----------------------------------------------------------------------
Ran 4 tests in 49.457s
OK
--- Running test: test_bpe_tokenization ---
Downloading config.json from deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B...
Config downloaded to /home/runner/.cache/deepseek/models--deepseek-ai--DeepSeek-R1-Distill-Qwen-1.5B/snapshots/ad9f0ae0864d7fbcd1cd905e3c6c5b069cc8b562/config.json
Downloading model.safetensors from deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B...
Model downloaded to /home/runner/.cache/deepseek/models--deepseek-ai--DeepSeek-R1-Distill-Qwen-1.5B/snapshots/ad9f0ae0864d7fbcd1cd905e3c6c5b069cc8b562/model.safetensors
Loaded single safetensors file.
Loaded model from single safetensors file.
--- Running test: test_encode_decode_roundtrip ---
Downloading config.json from deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B...
Config downloaded to /home/runner/.cache/deepseek/models--deepseek-ai--DeepSeek-R1-Distill-Qwen-1.5B/snapshots/ad9f0ae0864d7fbcd1cd905e3c6c5b069cc8b562/config.json
Downloading model.safetensors from deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B...
Model downloaded to /home/runner/.cache/deepseek/models--deepseek-ai--DeepSeek-R1-Distill-Qwen-1.5B/snapshots/ad9f0ae0864d7fbcd1cd905e3c6c5b069cc8b562/model.safetensors
Loaded single safetensors file.
Loaded model from single safetensors file.
Decoding tokens: [1986, 128, 128244, 285, 128, 128244, 64, 128, 128244, 874, 52899, 128, 128244, 1944, 128, 128244, 917, 128, 128244, 1958, 128, 128244, 5839, 2022, 13]
Raw tokens: ['This', 'Ä', 'is', 'Ä', 'a', 'Ä', 'com', 'prehensive', 'Ä', 'test', 'Ä', 'string', 'Ä', 'for', 'Ä', 'token', 'ization', '.']
Round-trip decoded : This is a comprehensive test string for tokenization.
--- Running test: test_load_and_infer ---
Downloading config.json from deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B...
Config downloaded to /home/runner/.cache/deepseek/models--deepseek-ai--DeepSeek-R1-Distill-Qwen-1.5B/snapshots/ad9f0ae0864d7fbcd1cd905e3c6c5b069cc8b562/config.json
Downloading model.safetensors from deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B...
Model downloaded to /home/runner/.cache/deepseek/models--deepseek-ai--DeepSeek-R1-Distill-Qwen-1.5B/snapshots/ad9f0ae0864d7fbcd1cd905e3c6c5b069cc8b562/model.safetensors
Loaded single safetensors file.
Loaded model from single safetensors file.
Inference successful, output shape: torch.Size([1, 16, 151936])
--- Running test: test_tokenize_and_infer_from_text ---
Downloading config.json from deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B...
Config downloaded to /home/runner/.cache/deepseek/models--deepseek-ai--DeepSeek-R1-Distill-Qwen-1.5B/snapshots/ad9f0ae0864d7fbcd1cd905e3c6c5b069cc8b562/config.json
Downloading model.safetensors from deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B...
Model downloaded to /home/runner/.cache/deepseek/models--deepseek-ai--DeepSeek-R1-Distill-Qwen-1.5B/snapshots/ad9f0ae0864d7fbcd1cd905e3c6c5b069cc8b562/model.safetensors
Loaded single safetensors file.
Loaded model from single safetensors file.
Token IDs: [1986, 128, 128244, 285, 128, 128244, 64, 128, 128244, 5839, 2022, 128, 128244, 1944, 13]
Inference successful on text input, output shape: torch.Size([1, 15, 151936])
Hey @intelligentnode, the issue was closed citing time-sensitivity, but no clear deadline was mentioned upfront. The bounty was also reduced mid-way . Contributors put in genuine effort based on the original scope and that should be acknowledged and fairly rewarded. Could get some clarity on this?
The issue is that there was no testing conducted during this contribution, and there seemed to be limited understanding of DeepSeek’s capabilities and how tokenizers work! The process of testing and providing feedback became quite lengthy. Since DeepSeek is a time-sensitive topic, this delay impacted our ability to proceed efficiently.
Look to the history I gave fair chance for this change. Please also note that the bounty platform does not support setting explicit deadlines and this create to catch event which is passed without ready code.
For the effort you did I can send you small reward contact me here: https://www.intellinode.ai/contact