langchainrb icon indicating copy to clipboard operation
langchainrb copied to clipboard

[Feature Request] Sumarization toolkit and examples

Open ProGM opened this issue 2 years ago • 11 comments

One feature I would love to have in Langchain.rb that may be super-useful is summarization:

https://python.langchain.com/en/latest/modules/chains/index_examples/summarize.html

I don't think it's super hard to implement. (at least: a base version of it)

ProGM avatar May 25 '23 09:05 ProGM

@ProGM Hmm... we can definitely add a def summarize(text:) method to every LLM class.

To list them out:

  1. Cohere already has a Summarize endpoint: https://github.com/andreibondarev/cohere-ruby#summarize
  2. OpenAI has a few prompt-driven examples (search for "summary/summari..."): https://platform.openai.com/examples. I personally kind of like the "TL;DR" method. Do you have a preference?
  3. Google PaLM summarization would be prompt-driven as well.
  4. Hugging Face has a ton of different models focusing on summarization. Have you tried any of these by chance? https://huggingface.co/models?pipeline_tag=summarization&sort=downloads
  5. Same with Replicate -- most likely prompt-driven summarization would be needed.

When I say prompt driven I mean that we'd build something like the following prompt:

Write a concise summary of the following:

#{text_to_be_summarize}

CONCISE SUMMARY:

... and pass this to the LLM. Btw -- this prompt was taken from here.

@ProGM What are your thoughts?

andreibondarev avatar May 25 '23 15:05 andreibondarev

@andreibondarev Not sure if this should be just a method of the LLM classes.

When I say toolkit, I mean the full set of things: a) a method in LLM b) a set of strategies (stuff, map_reduce, refine) c) a way to use it in combination with other stuff

The cool feature you have in python LangChain is that you can configure the ready-to-use summarize chain, declare it as a tool and use it in a chain of thoughts.

Something like (pseudo-ruby-code):

summarization_tool = Langchain::Tool.new(
  name: 'summarizer tool',
  function: Langchain::Summarizer.new(strategy: :map_reduce),
  description: 'This tool can be used to summarize a long text'
)

agent = Agent::ChainOfThoughtAgent.new(
  llm: :openai,
  llm_api_key: ENV["OPENAI_API_KEY"],
  tools: ['search', 'calculator', summarization_tool]
)

ProGM avatar May 26 '23 10:05 ProGM

@ProGM Just a quick iteration on top of your pseudo-ish code:

cohere = LLM::Cohere.new(...) # Let's say you want to use Cohere's summarize endpoint

summarization_tool = Langchain::Tool.new(
  name: "summarization_tool",
  function: -> { |text| cohere.summarize(text: text),
  description: "This tool can be used to summarize a long text."
)

agent = Agent::ChainOfThoughtAgent.new(
  llm: :openai,
  llm_api_key: ENV["OPENAI_API_KEY"],
  tools: ['search', 'calculator', summarization_tool]
)

What're your thoughts?

andreibondarev avatar May 26 '23 21:05 andreibondarev

@ProGM This PR would address the first part of this.

andreibondarev avatar May 26 '23 23:05 andreibondarev

Source: https://docs.langchain.com/docs/use-cases/summarization

A common use case is wanting to summarize long documents. This naturally runs into the context window limitations. Unlike in question-answering, you can't just do some semantic search hacks to only select the chunks of text most relevant to the question (because, in this case, there is no particular question - you want to summarize everything). So what do you do then?

The most common way around this is to split the documents into chunks and then do summarization in a recursive manner. By this we mean you first summarize each chunk by itself, then you group the summaries into chunks and summarize each chunk of summaries, and continue doing that until only one is left.

In order to tackle the issue of summarizing documents that exceed the context window -- I think what we could do is to enhance the summarize() methods to check the length of the text being passed in and if it's too long then recursively split -> summarize -> combine -> summarize.

andreibondarev avatar May 27 '23 01:05 andreibondarev

@ProGM Just a quick iteration on top of your pseudo-ish code:

cohere = LLM::Cohere.new(...) # Let's say you want to use Cohere's summarize endpoint

summarization_tool = Langchain::Tool.new(
  name: "summarization_tool",
  function: -> { |text| cohere.summarize(text: text),
  description: "This tool can be used to summarize a long text."
)

agent = Agent::ChainOfThoughtAgent.new(
  llm: :openai,
  llm_api_key: ENV["OPENAI_API_KEY"],
  tools: ['search', 'calculator', summarization_tool]
)

What're your thoughts? @andreibondarev

That's exactly what I meant. It would be great! 🎉

@ProGM This PR would address the first part of this.

Cool!

Source: https://docs.langchain.com/docs/use-cases/summarization

A common use case is wanting to summarize long documents. This naturally runs into the context window limitations. Unlike in question-answering, you can't just do some semantic search hacks to only select the chunks of text most relevant to the question (because, in this case, there is no particular question - you want to summarize everything). So what do you do then? The most common way around this is to split the documents into chunks and then do summarization in a recursive manner. By this we mean you first summarize each chunk by itself, then you group the summaries into chunks and summarize each chunk of summaries, and continue doing that until only one is left.

In order to tackle the issue of summarizing documents that exceed the context window -- I think what we could do is to enhance the summarize() methods to check the length of the text being passed in and if it's too long then recursively split -> summarize -> combine -> summarize.

Yup, I think that this concept is implemented with the refine strategy in Langchain: https://github.com/hwchase17/langchain/blob/9c0cb90997db9eb2e2a736df458d39fd7bec8ffb/langchain/chains/summarize/refine_prompts.py

And we may need a tokenizer library to count tokens, like this or this.

ProGM avatar May 27 '23 11:05 ProGM

@ProGM I think an incremental next step would be adding tiktoken_ruby to wrap the OpenAI API calls to ensure that token limits are not exceeded when completion endpoint is being hit.

andreibondarev avatar May 27 '23 14:05 andreibondarev

@ProGM Would something like this work as a good starting point? https://github.com/andreibondarev/langchainrb/pull/71

I think the next step in that summarization workflow would be to recursively check the token length as the passed in text is being summarized. BUT I think it has to wait until the chunking work is done!

andreibondarev avatar May 28 '23 00:05 andreibondarev

@andreibondarev Thanks for keeping me up to date! It's a good start for sure.

I think that token limit is not something exclusive of OpenAI. PaLM should have 8000 tokens. Anthropic has 100k (that are a lot, but yet a limit)

ProGM avatar May 28 '23 12:05 ProGM

Yeah, I just meant that the Tiktoken library is only for OpenAI models and no others.

andreibondarev avatar May 28 '23 13:05 andreibondarev

Oh, I didn't know about it! D:

ProGM avatar May 29 '23 07:05 ProGM