langchainrb
langchainrb copied to clipboard
Add integration and/or system tests
I think we need the ability to add and run some 'Integration' tests that exercise interactions in high level components and use actual apis and keys. They would be run only on request and could be run before each release.
Start with a simple question to ChainOfThought with openai like in the README, with expectation that the result should be similar but not exactly equal to the result given in the README, since I assume the ai can respond slightly differently each time the test is called.
I'm going to try implementing a simple https://cucumber.io/ test. It might work well here, but if it doesn't add value we don't have to use it:
Feature: Chain Of Thought
Decompose multi-step problems into intermediate steps
Scenario: Multistep with distance calculation
Given I want to know a difficult distance calculation
When I ask "How many full soccer fields would be needed to cover the distance between NYC and DC in a straight line?")
Then I should be told something like "Approximately 2,945 soccer fields"
@mattlindsey Do you envision that this would actually run in CI?
I'm also struggling a bit figuring out what value these feature tests would bring to this library?
If you run them in CI I think you'd catch errors sooner, like I think there's a gem dependency error now. (Might be wrong.) Also, the agents are fairly high level, so testing interaction with other things using 'integration' testing is certainly necessary somewhere, I think.
I hope @technicalpickles doesn't mind that I pull him in. There was a mention of executing Jupyter notebooks or README code snippets in Discord. Would you have to have any thoughts here?
Also see where I implemented a couple of tests here to give a better idea: https://github.com/andreibondarev/langchainrb/pull/145
And for a wider range of testing it would be good if someone implemented Langchain::LLM::HuggingFace#complete.
Start with a simple question to ChainOfThought with openai like in the README, with expectation that the result should be similar but not exactly equal to the result given in the README, since I assume the ai can respond slightly differently each time the test is called.
I was doing a course on deeplearning.ai, it was talking about how if you set a temperature=0
, you should get the same results. The course was taught using jupyter notebooks, and the results they got doing the exercise matched what the AI was returning when I did them in the notebooks. I think it can be considered relatively stable?
There was a mention of executing Jupyter notebooks or README code snippets in Discord. Would you have to have any thoughts here?
Yep! Here is what I suggested:
I've been thinking about getting the code in the README and in examples to be run as part of CI. did something like that for openfeature-sdk (https://github.com/open-feature/ruby-sdk/pull/40) ... I think the challenge for the README is making sure the fragment is complete enough to run, as well as having the right environment variables to make the call.
In both cases, I'm starting to think we could get pretty far by stubbing the response from the LLM. That could help cover everything leading up to the request. The most common way I've done this is with VCR and/or webmock. The main downside there it doesn't capture changes that happen with the remote end, obviously. If we are using existing libraries to do those interactions though, it's probably a pretty good tradeoff.
Thanks @technicalpickles. I'm going to try the method you used in open-feature to run our README examples with temperature=0. It will still have to be an optional script or spec, since it would require env variables - like you said.
When you say stubbing the response from the LLM, do you mean like below? Or recording responses with VCR for every example? Because the idea was to run everything against live services. https://github.com/andreibondarev/langchainrb/blob/9dd8add0703c8cc9f5d250ee7a3559f45053d7e3/spec/langchain/llm/openai_spec.rb#L68
@mattlindsey I'm going to try implementing a simple https://cucumber.io/ test. I don't see much value in using Cucumber. In the case of web apps -- it brings a lot of value abstracting the engineer out from "clicking" through the UI. It's also useful when "QA Engineers" are primarily writing these tests because it provides them a nice DSL.
We need to figure out whether we'd like these tests to run against real (non-mocked) services, with actual API keys/creds.
If yes -- then let's go with Jupyter notebooks. These would need to be run locally by a dev, we can't run these in CI because it costs $$$ to run.
If not -- then these tests/scripts should be in Rspec.
We have a pretty large testing matrix: think "Num of vectorsearch DBs X Num of LLMs X", i.e. we're saying that any LLM in the project (that supports embed()
) can generate embeddings for any vectorsearch DB.
@mattlindsey @technicalpickles Thoughts?
When you say stubbing the response from the LLM, do you mean like below? Or recording responses with VCR for every example? Because the idea was to run everything against live services.
That is what I meant, yeah. I think we can get still get some value out of having everything but the LLM response, since there are plenty of other moving parts.
If yes -- then let's go with Jupyter notebooks. These would need to be run locally by a dev, we can't run these in CI because it costs $$$ to run.
If that is going to require providing an API key anyways, so may as well do it in plain ruby. We could even have a rspec tag to indicate something uses the API, and have that automatically included/excluded when the ENV['OPENAI_API_KEY'] is present.
describe Whatever, :openai_integration => true do
it "works" do
# ...
end
end
Then run:
$ rspec --tag openai_integration
To exclude by defaut, we can add --tag ~openai_integration
to the .rspec
which is for default arguments.
These would need to be run locally by a dev, we can't run these in CI because it costs $$$ to run.
Saying that, it makes me wonder if they have any policies for open source development?
OpenAI is also on Azure, and Azure has Open Source credits we could apply to https://opensource.microsoft.com/azure-credits/
@andreibondarev Can Jupyter notbooks run ruby? I'm thinking rspec in a separate 'integration' directory with the tags described by Josh sounds good.
Looks like Azure takes 3-4 weeks to reply in case you want to request to use their 'OpenAI Azure' (https://learn.microsoft.com/en-us/azure/cognitive-services/openai/overview). But would that mean a new LLM class in langchainrb? I don't see any ruby examples in the documentation so I'm not sure.
Can Jupyter notbooks run ruby?
I saw it in the boxcars gem, which is in the same space as this gem: https://github.com/BoxcarsAI/boxcars/blob/main/notebooks/boxcars_examples.ipyn
@technicalpickles I added a similar 'getting started' Jupyter notebook in #185, but it was somewhat difficult to get working and seems to give errors sometimes. Take a look if you want, but I don't want to waste your time!
I did get a notebook working, but it's very picky and may not be worth the effort to maintain. I'll post it here just in case: https://gist.github.com/mattlindsey/5f6388d6ff76c2decdccb723bb4ed4c5#file-getting_started-ipynb