petals Conversational Agent with Llama-2-70b-chat-hf

Hi there,

Anyone here managed to get the conversational agent published in the following article working with the Petals? https://www.pinecone.io/learn/llama-2/

I created the following notebook based on the notebook of the article above and run it in my Ubuntu Linux workstation without GPU, the conversational agent failed to work. I not sure the discrepancies between the Petals runtime with the one of the original notebook, kindly let's me know if you know anything missing here: https://github.com/limcheekin/langchain-playground/blob/main/llama-2-chat-agent.ipynb (You can click the "Open in Colab" or "Open nbviewer" link on top of the notebook to check out the original notebook)

Feel free to download the notebook and the custom Petals class and try it out yourself, appreciate if you could share your outcome here and the changes you made.

I'd love to hear from you! :)

Jul 26 '23 08:07 limcheekin

Yes I did. I ran it on 45 GIB GPU on A6k machine. It is pretty slow to respond. Few of the things that would if you are seeing issues

Run it on Py 3.10 +
Use a higher GPU machine and utilize GPU's in parallel
If you are seeing parsing error then add the following to the agent initialization
handle_parsing_errors=True

though I was able to test the model on its pre-existing knowledge, I had no luck running it on indexed dataset.

Aug 06 '23 12:08 amit4aws

Hi @limcheekin @amit4aws,

Sorry, we didn't have a time to have a closer look yet.

Are you sure that the code keeps an open Petals inference session and doesn't rerun the dialogue history from scratch every time (that would be slow)?

Aug 06 '23 12:08 borzunov

@borzunov yes. The issue that i am seeing specifically is with the conversation chain on memory. I am using langchain retriever based approach for Q&A scenario. I can start a conversation agent and can get a successful response back for the query however when i attach a memory the agent always refer to its existing knowledge instead of looking to the index. I was somehow able to get rid of the Output parsing error after reading the troubleshooting suggestion at the langchain site but no luck yet working with the memory. The model is slow in responding and I think the machine I am using is good enough to run 13 and 70b types.

Aug 06 '23 16:08 amit4aws

Hi @limcheekin @amit4aws,

Sorry, we didn't have a time to have a closer look yet.

Are you sure that the code keeps an open Petals inference session and doesn't rerun the dialogue history from scratch every time (that would be slow)?

Thanks for response. I found info about open_inference_session, but not sure how it relevant to the petals python package.

Slowness is not a problem when I opened the issue 2 weeks ago. But it is an issue now when I re-run the notebook with the latest version of petals python package v2.0.1.post1.

The following code took almost 15 minutes to execute successfully:

llm(prompt=get_prompt("Explain to me the difference between nuclear fission and fusion."))

I see the following warning message for the code above: [WARN] [petals.client.routing.sequence_manager.rpc_info:443] Caught exception when gathering information from peer 12D3KooWPKbfMFfxRbDvyWWaq1sepGeLiJ5o9KAGebUio1a5kcNX (retry in 0 sec): P2PDaemonError('failed to dial 12D3KooWPKbfMFfxRbDvyWWaq1sepGeLiJ5o9KAGebUio1a5kcNX:\n * [/ip6/::1/tcp/31334] dial backoff\n * [/ip4/127.0.0.1/tcp/31334] dial backoff\n * [/ip4/147.189.197.32/tcp/31330/p2p/12D3KooWLuGiv8BZp1R2ocuULd92wWcJbeZ9janxn6j56Uq1iZW6/p2p-circuit] concurrent active dial succeeded\n * [/ip4/193.106.95.184/tcp/35899/p2p/12D3KooWC7vsMrCies8tJN8W9AmCta2vXeFgBmKbYkt5B2PXUiiZ/p2p-circuit] concurrent active dial succeeded\n * [/ip4/172.22.125.77/tcp/31334] dial tcp4 172.22.125.77:31334: i/o timeout')

The following code took more than 3 hours to execute, I'm interrupted it manually before it complete execution:

agent("hey how are you today?")

I saw the warning messages similar to the following: [WARN] [petals.client.inference_session.step:319] Caught exception when running inference via RemoteSpanInfo(peer_id=<libp2p.peer.id.ID (12D3KooW9tRCNUkmzUWE9bSVq4YxoZV58J3G2N3osEWvaDkDJVjh)>, start=20, end=46, server_info=ServerInfo(state=<ServerState.ONLINE: 2>, throughput=8107.200786715217, public_name='https://jobs.trelent.com/', version='2.0.1', network_rps=8107.200786715217, forward_rps=318975.12267155404, inference_rps=409.66460911542833, adapters=(), torch_dtype='float16', quant_type='nf4', using_relay=True, cache_tokens_left=1597440, next_pings={'12D3KooWJKWRP1P2FGgBYWsGUpewf3B8aar4h8QGLWTwQyMWaAxr': 0.12194999209990658})) (retry in 60 sec):

I execute the codes in CPU-only environment. Kindly let's me know if you need more information.

Aug 09 '23 07:08 limcheekin

petals petals copied to clipboard

Conversational Agent with Llama-2-70b-chat-hf

petals
petals copied to clipboard