haystack icon indicating copy to clipboard operation
haystack copied to clipboard

fix(JsonSchemaValidator): fix recursive loop and general LLM (claude, mistral...) compatibility

Open lambda-science opened this issue 10 months ago • 8 comments

Related Issues

  • fixes https://github.com/deepset-ai/haystack/issues/7457 https://github.com/deepset-ai/haystack/issues/7455

Proposed Changes:

  • Claude Compatibility: modified the behaviour so that (i) error template is now a single message with generated json, error and schema (ii) make it so that validated messages are always "Assistant" chatmessage (for next pipeline step) and validation_errors are always "User" chatmessage (for LLM loops)
  • Recursive Loop in type conversion: used Claude OPUS to automatically generate a fix based on the written issue.

How did you test it?

Tested on my personal use-case and it solved my issues.

Notes for the reviewer

The behaviour is modified to only include the last messages from the conversation and not the whole history of messages (less cost for long pipeline loops, not necessary to have previous messages).

For the auto-generated fix for recursive, maybe the bug comes from the fact that sometimes json.loads(value) output a string and needs to be called twice to get the actual dict/list in the string. This is weird, but I've seen it happen. I'm not sure about the fundamental difference to be honest. Maybe it doesn't work for nested json.

Checklist

lambda-science avatar Apr 17 '24 14:04 lambda-science

Looks good @lambda-science , would you please add a short reno note (see https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md) and resolve these black issues :-)

vblagoje avatar Apr 22 '24 07:04 vblagoje

Is this still relevant? Let's merge it or close.

masci avatar May 15 '24 19:05 masci

It's missing reno note and unit tests. It's an important addition and would love to see @lambda-science push it towards the finish line 🏁

vblagoje avatar May 16 '24 08:05 vblagoje

Sorry, it went out of my mind I will do it :)

lambda-science avatar May 16 '24 08:05 lambda-science

I know why I stopped, because I had issue setting up the env (on windows). Now that all is set, I can see there was test failling (on top of black/reno missing) so I will work on it

lambda-science avatar May 16 '24 09:05 lambda-science

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar May 16 '24 09:05 CLAassistant

Should be good now. I had to change the test a bit because as I explained I suggested to only validate latest message (and include only latest message for validation) to optimize cost of long loops ! Tell me if you agree or not. (So validation of multi-message history only return a list of 1 message)

lambda-science avatar May 16 '24 10:05 lambda-science

Pull Request Test Coverage Report for Build 9610907174

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 52 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.2%) to 89.953%

Files with Coverage Reduction New Missed Lines %
core/component/component.py 2 98.28%
components/validators/json_schema.py 50 0.0%
<!-- Total: 52
Totals Coverage Status
Change from base Build 9600865720: -0.2%
Covered Lines: 6912
Relevant Lines: 7684

💛 - Coveralls

coveralls avatar May 16 '24 14:05 coveralls

Verified manually to work for OpenAI, Anthropic, and Cohere. The tests were:

OpenAI:

   import json
   from typing import List
    
   from haystack import Pipeline
   from haystack.components.converters import OutputAdapter
   from haystack.components.generators.chat import OpenAIChatGenerator
   from haystack.components.joiners import BranchJoiner
   from haystack.components.validators import JsonSchemaValidator
   from haystack.dataclasses import ChatMessage
    
   person_schema = {
       "type": "object",
       "properties": {
           "first_name": {"type": "string", "pattern": "^[A-Z][a-z]+$"},
           "last_name": {"type": "string", "pattern": "^[A-Z][a-z]+$"},
           "nationality": {"type": "string", "enum": ["Italian", "Portuguese", "American"]},
       },
       "required": ["first_name", "last_name", "nationality"]
   }
    
   # Initialize a pipeline
   pipe = Pipeline()
    
   # Add components to the pipeline
   pipe.add_component('joiner', BranchJoiner(List[ChatMessage]))
   pipe.add_component('fc_llm', OpenAIChatGenerator(model="gpt-4o"))
   pipe.add_component('validator', JsonSchemaValidator(json_schema=person_schema))
   pipe.add_component('adapter', OutputAdapter("{{chat_message}}", List[ChatMessage])),
   # And connect them
   pipe.connect("adapter", "joiner")
   pipe.connect("joiner", "fc_llm")
   pipe.connect("fc_llm.replies", "validator.messages")
   pipe.connect("validator.validation_error", "joiner")
    
   result = pipe.run(data={"adapter": {"chat_message": [ChatMessage.from_user("Create json from Peter Parker")]}})
    
   print(json.loads(result["validator"]["validated"][0].content))

The output was:

{'first_name': 'Peter', 'last_name': 'Parker', 'nationality': 'American', 'alias': 'Spider-Man', 'occupation': 'Photographer', 'affiliations': ['Daily Bugle', 'Avengers'], 'abilities': ['Superhuman strength', 'Enhanced agility', 'Spider-sense', 'Ability to cling to surfaces', 'Web-shooting'], 'personal_info': {'age': 25, 'gender': 'Male', 'height': '5\'10"', 'weight': '167 lbs', 'eye_color': 'Hazel', 'hair_color': 'Brown'}, 'biography': {'origin': 'Bitten by a radioactive spider, high school student Peter Parker gained the speed, strength and powers of a spider.', 'uncle_ben_quote': 'With great power comes great responsibility.'}, 'relationships': {'aunt': 'May Parker', 'girlfriend': 'Mary Jane Watson', 'best_friend': 'Harry Osborn'}}

Anthropic:

   import json  
   from typing import List
    
   from haystack import Pipeline
   from haystack.components.converters import OutputAdapter
   from haystack_integrations.components.generators.anthropic import AnthropicChatGenerator
   from haystack.components.joiners import BranchJoiner
   from haystack.components.validators import JsonSchemaValidator
   from haystack.dataclasses import ChatMessage
    
   person_schema = {
       "type": "object",
       "properties": {
           "first_name": {"type": "string", "pattern": "^[A-Z][a-z]+$"},
           "last_name": {"type": "string", "pattern": "^[A-Z][a-z]+$"},
           "nationality": {"type": "string", "enum": ["Italian", "Portuguese", "American"]},
       },
       "required": ["first_name", "last_name", "nationality"]
   }
    
   # Initialize a pipeline
   pipe = Pipeline()
    
   # Add components to the pipeline
   pipe.add_component('joiner', BranchJoiner(List[ChatMessage]))
   pipe.add_component('fc_llm', AnthropicChatGenerator(model="claude-3-5-sonnet-20240620"))
   pipe.add_component('validator', JsonSchemaValidator(json_schema=person_schema))
   pipe.add_component('adapter', OutputAdapter("{{chat_message}}", List[ChatMessage])),
   # And connect them
   pipe.connect("adapter", "joiner")
   pipe.connect("joiner", "fc_llm")
   pipe.connect("fc_llm.replies", "validator.messages")
   pipe.connect("validator.validation_error", "joiner")
    
   result = pipe.run(data={
                           "adapter": {"chat_message": [ChatMessage.from_user("Create json from Peter Parker")]}})
    
   print(json.loads(result["validator"]["validated"][0].content))

The output was:

{'first_name': 'Peter', 'last_name': 'Parker', 'nationality': 'American'}

And finally Cohere:

   import json
   from typing import List
    
   from haystack import Pipeline
   from haystack.components.converters import OutputAdapter
   from haystack_integrations.components.generators.cohere import CohereChatGenerator
   from haystack.components.joiners import BranchJoiner
   from haystack.components.validators import JsonSchemaValidator
   from haystack.dataclasses import ChatMessage
    
   person_schema = {
       "type": "object",
       "properties": {
           "first_name": {"type": "string", "pattern": "^[A-Z][a-z]+$"},
           "last_name": {"type": "string", "pattern": "^[A-Z][a-z]+$"},
           "nationality": {"type": "string", "enum": ["Italian", "Portuguese", "American"]},
       },
       "required": ["first_name", "last_name", "nationality"]
   }
    
   # Initialize a pipeline
   pipe = Pipeline()
    
   # Add components to the pipeline
   pipe.add_component('joiner', BranchJoiner(List[ChatMessage]))
   pipe.add_component('fc_llm', CohereChatGenerator(model="command-r"))
   pipe.add_component('validator', JsonSchemaValidator(json_schema=person_schema))
   pipe.add_component('adapter', OutputAdapter("{{chat_message}}", List[ChatMessage])),
   # And connect them
   pipe.connect("adapter", "joiner")
   pipe.connect("joiner", "fc_llm")
   pipe.connect("fc_llm.replies", "validator.messages")
   pipe.connect("validator.validation_error", "joiner")
    
   result = pipe.run(data={"adapter": {"chat_message": [ChatMessage.from_user("Create json from Peter Parker")]}})
    
   print(json.loads(result["validator"]["validated"][0].content))

The output was:

{'first_name': 'Peter', 'last_name': 'Parker', 'nationality': 'American'}

vblagoje avatar Jun 21 '24 08:06 vblagoje