llama-node
llama-node copied to clipboard
Code only using 4 CPU, when I have 16 CPU
This is the code that I am using
import {RetrievalQAChain} from 'langchain/chains'; import {HNSWLib} from "langchain/vectorstores"; import {RecursiveCharacterTextSplitter} from 'langchain/text_splitter'; import {LLamaEmbeddings} from "llama-node/dist/extensions/langchain.js"; import {LLM} from "llama-node"; import {LLamaCpp} from "llama-node/dist/llm/llama-cpp.js"; import * as fs from 'fs'; import * as path from 'path';
const txtFilename = "TrainData";
const txtPath = ./${txtFilename}.txt
;
const VECTOR_STORE_PATH = ${txtFilename}.index
;
const model = path.resolve(process.cwd(), './h2ogptq-oasst1-512-30B.ggml.q5_1.bin');
const llama = new LLM(LLamaCpp);
const config = {
path: model,
enableLogging: true,
nCtx: 1024,
nParts: -1,
seed: 0,
f16Kv: false,
logitsAll: false,
vocabOnly: false,
useMlock: false,
embedding: true,
useMmap: true,
};
var vectorStore;
const run = async () => {
await llama.load(config);
if (fs.existsSync(VECTOR_STORE_PATH)) {
console.log('Vector Exists..');
vectorStore = await HNSWLib.fromExistingIndex(VECTOR_STORE_PATH, new LLamaEmbeddings({maxConcurrency: 1}, llama));
} else {
console.log('Creating Documents');
const text = fs.readFileSync(txtPath, 'utf8');
const textSplitter = new RecursiveCharacterTextSplitter({chunkSize: 1000});
const docs = await textSplitter.createDocuments([text]);
console.log('Creating Vector');
vectorStore = await HNSWLib.fromDocuments(docs, new LLamaEmbeddings({maxConcurrency: 1}, llama));
await vectorStore.save(VECTOR_STORE_PATH);
}
console.log('Testing Vector via Similarity Search');
const resultOne = await vectorStore.similaritySearch("what is a template", 1);
console.log(resultOne);
console.log('Testing Vector via RetrievalQAChain');
const chain = RetrievalQAChain.fromLLM(llama, vectorStore.asRetriever());
const res = await chain.call({
query: "what is a template",
});
console.log({res});
};
run();
It is only using 4 CPU at the time of "vectorStore = await HNSWLib.fromDocuments(docs, new LLamaEmbeddings({maxConcurrency: 1}, llama));"
Can we change anything for it to use more than 4 CPU?
not yet. llama.cpp seems not supporting parallel inference at the moment. I may find another ways (like implement a round robin in rust level) for this.
@hlhr202 any updates on that?
@hlhr202 u gotta hire us when you make it big :) LGTM