dspy icon indicating copy to clipboard operation
dspy copied to clipboard

Running batch predictions using DSPy complied prompts on large dataset

Open sarora-roivant opened this issue 1 year ago • 4 comments

Hi,

I've implemented a clinical entity extraction pipeline using DSPy for processing patient notes. The pipeline extracts various entities (drugs, diseases, procedures, lab tests) and performs condition assessments. Currently, I'm facing challenges in scaling this pipeline to process a large dataset of approximately 500,000 notes efficiently.

Current Implementation:

  1. Data loading
  2. Note aggregation and preprocessing
  3. Entity extraction using DSPy signatures and predictors
  4. Condition assessment using custom DSPy modules
  5. Result processing and export

Challenges:

  1. Processing time: Currently, it takes about 5-6 minutes to process a single note.
  2. Lack of native batch processing: Each note is processed individually, leading to inefficient use of API calls and resources.
  3. Scaling difficulties: The current approach is not feasible for processing 500,000 notes in a reasonable timeframe.

Questions for Scaling:

  1. What is the recommended approach for using DSPy with large datasets (~500,000 notes)?
  2. Are there any best practices for compiling DSPy prompts for batch processing?
  3. How can we optimize the use of compiled DSPy prompts in a distributed computing environment?

Any guidance on efficiently scaling DSPy for large-scale entity extraction tasks would be greatly appreciated. I'm open to restructuring my pipeline or adopting new approaches to achieve better performance.

sarora-roivant avatar Jun 25 '24 20:06 sarora-roivant

Use dspy.evaluate.Evaluate and pass num_threads. For the metric, just pass a metric that always returns True or False

okhat avatar Jun 27 '24 02:06 okhat

@okhat Why would i need Evaluate for an inference pipeline?

Are there any updates to this or ideas where one should look into ?

mancunian1792 avatar Mar 03 '25 12:03 mancunian1792

@mancunian1792 You can now do:

program.batch([dspy.Example(key1=value, ...).with_inputs('key1', ...)])

okhat avatar Mar 28 '25 22:03 okhat

What is the source code for method batch? I see where it's tested in test_parallel.py but nothing beyond that. Is it a LiteLLM utility?

I am hoping for something that somehow conforms to OpenAI's batch inference: https://platform.openai.com/docs/guides/batch

chrico-bu-uab avatar May 16 '25 20:05 chrico-bu-uab