Quick Introduction
DeepEval is an open-source evaluation framework for LLMs. DeepEval makes it extremely easy to build and iterate on LLM (applications) and was built with the following principles in mind:
- Easily "unit test" LLM outputs in a similar way to Pytest.
- Plug-and-use 14+ LLM-evaluated metrics, most with research backing.
- Custom metrics are simple to personalize and create.
- Define evaluation datasets in Python code.
- Real-time evaluations in production (available on Confident AI).
Setup A Python Environement
Go to the root directory of your project and create a virtual environement (if you don't already have one). In the CLI, run:
python3 -m venv venv
source venv/bin/activate
Installation
In your newly created virtual environement, run:
pip install -U deepeval
You can also keep track of all evaluation results by logging onto Confident AI, an all in one evaluation platform:
deepeval login
Contact us if you're dealing with sensitive data that has to reside in your private VPCs.
Create Your First Test Case
Run touch test_example.py
to create a test file in your root directory. Open test_example.py
and paste in your first test case:
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
def test_answer_relevancy():
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
test_case = LLMTestCase(
input="What if these shoes don't fit?",
# Replace this with the actual output of your LLM application
actual_output="We offer a 30-day full refund at no extra cost."
)
assert_test(test_case, [answer_relevancy_metric])
Run deepeval test run
from the root directory of your project:
deepeval test run test_example.py
Congratulations! Your test case should have passed ✅ Let's breakdown what happened.
- The variable
input
mimics a user input, andactual_output
is a placeholder for what your application's supposed to output based on this input. - The variable
retrieval_context
contains the retrieved context from your knowledge base, andAnswerRelevancyMetric(threshold=0.5)
is an default metric provided bydeepeval
for you to evaluate your LLM output's relevancy based on the provided retrieval context. - All metric scores range from 0 - 1, which the
threshold=0.5
threshold ultimately determines if your test have passed or not.
You'll need to set your OPENAI_API_KEY
as an enviornment variable before running the AnswerRelevancyMetric
, since the AnswerRelevancyMetric
is an LLM-evaluated metric.
To use ANY custom LLM of your choice, check out this part of the docs.
You can also save test results locally for each test run. Simply set the DEEPEVAL_RESULTS_FOLDER
environement variable to your relative path of choice:
export DEEPEVAL_RESULTS_FOLDER="./data"
Create Your First Metric
deepeval
provides two types of LLM evaluation metrics to evaluate LLM outputs: plug-and-use default metrics, and custom metrics for any evaluation criteria.
Default Metrics
deepeval
offers 14+ research backed default metrics covering a wide range of use-cases (such as RAG and fine-tuning). To create a metric, simply import from the deepeval.metrics
module:
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
test_case = LLMTestCase(input="...", actual_output="...")
relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
relevancy_metric.measure(test_case)
print(relevancy_metric.score, relevancy_metric.reason)
Note that you can run a metric as a standalone or as part of a test run as shown in previous sections.
All default metrics are evaluated using LLMs, and you can use ANY LLM of your choice. For more information, visit the metrics introduction section.
Custom Metrics
deepeval
provides G-Eval, a state-of-the-art LLM evaluation framework for anyone to create a custom LLM-evaluated metric using natural language. Here's an example:
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval
test_case = LLMTestCase(input="...", actual_output="...")
correctness_metric = GEval(
name="Correctness",
criteria="Correctness - determine if the actual output is correct according to the expected output.",
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
strict=True
)
correctness_metric.measure(test_case)
print(correctness_metric.score, correctness_metric.reason)
Under the hood, deepeval
first generates a series of evaluation steps, before using these steps in conjuction with information in an LLMTestCase
for evaluation. For more information, visit the G-Eval documentation page.
If you wish to customize your metrics a bit more, you can choose to implement your own metric. You can find a full tutorial here, but here's a quick example of how you can create a metric that is NOT evaluated using LLMs:
from deepeval.scorer import Scorer
from deepeval.metrics import BaseMetric
class RougeMetric(BaseMetric):
def __init__(self, threshold: float = 0.5):
self.threshold = threshold
self.scorer = Scorer()
def measure(self, test_case: LLMTestCase):
self.score = scorer.rouge_score(
prediction=test_case.actual_output,
target=test_case.expected_output,
score_type="rouge1"
)
self.success = self.score >= self.threshold
return self.score
# Async implementation of measure(). If async version for
# scoring method does not exist, just reuse the measure method.
async def a_measure(self, test_case: LLMTestCase):
return self.measure(test_case)
def is_successful(self):
return self.success
@property
def __name__(self):
return "Rouge Metric"
#####################
### Example Usage ###
#####################
test_case = LLMTestCase(input="...", actual_output="...", expected_output="...")
metric = RougeMetric()
metric.measure(test_case)
print(metric.is_successful())
You'll notice that although not documented, deepeval
additionally offers a scorer
module for more traditional NLP scoring method and can be found here.
You can also create a custom metric to combine several different metrics into one. For example. combining the AnswerRelevancyMetric
and FaithfulnessMetric
to test whether an LLM output is both relevant and faithful (ie. not hallucinating).
Measure Multiple Metrics At Once
To avoid redundant code, deepeval
offers an easy way to apply as many metrics as you wish for a single test case.
...
def test_everything():
assert_test(test_case, [answer_relevancy_metric, correctness_metric])
In this scenario, test_everything
only passes if all metrics are passing. Run deepeval test run
again to see the results:
deepeval test run test_example.py
deepeval
optimizes evaluation speed by running all metrics for each test case concurrently.
Create Your First Dataset
A dataset in deepeval
, or more specifically an evaluation dataset, is simply a collection of LLMTestCases
and/or Goldens
.
We're not going to dive into what a Golden
is here, but it is an important concept if you're looking to generate LLM outputs at evlauation time. To learn more about Golden
s, click here.
To create a dataset, simply initialize an EvaluationDataset
with a list of LLMTestCase
s or Golden
s:
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset
first_test_case = LLMTestCase(input="...", actual_output="...")
second_test_case = LLMTestCase(input="...", actual_output="...")
dataset = EvaluationDataset(
# Optional 'alias', but highly recommended IF you're logged into Confident AI
alias="My first dataset",
test_cases=[first_test_case, second_test_case]
)
Then, using deepeval
's Pytest integration, you can utilize the @pytest.mark.parametrize
decorator to loop through and evaluate your dataset.
import pytest
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric
...
# Loop through test cases using Pytest
@pytest.mark.parametrize(
"test_case",
dataset,
)
def test_customer_chatbot(test_case: LLMTestCase):
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
assert_test(test_case, [answer_relevancy_metric])
You can also evaluate entire datasets without going through the CLI (if you're in a notebook environment):
from deepeval import evaluate
...
evaluate(dataset, [answer_relevancy_metric])
Additionally, you can avoid re-evaluating test cases by reading from deepeval
's local cache using the optional -c
flag:
deepeval test run test_dataset.py -c
Or run test cases in parallel by using the optional -n
flag followed by a number (that determines the number of processes that will be used) when executing deepeval test run
:
deepeval test run test_dataset.py -n 2
Visit the evaluation introduction section to learn about the different types of flags you can use with the deepeval test run
command.
Using Confident AI
If you have reached this point, you've likely ran deepeval test run
multiple times. To keep track of all future evaluation results created by deepeval
, login to Confident AI by running the following command in the CLI:
deepeval login
Confident AI is the platform that unlocks deepeval
's full potential, and allows anyone to easily:
- keep track of and debug historical test run results
- discover optimal hyperparameters, such as the best models and prompt templates to use
- centralize and synthesize evaluation datasets on the cloud
- safeguard against breaking changes in CI/CD pipelines
- run real-time evaluations in production, with custom metrics
Click here for the full documentation on using Confident AI with deepeval
.
Follow the instructions displayed on the CLI to create an account, get your Confident API key, and paste it in the CLI. You should see a message congratulating your successful login.
Once logged in, you'll be able to view test run results on Confident AI each time you execute a test run:
deepeval test run test_example.py
You should now see a link being returned upon test completion. Paste it in your browser to view results.
Optimizing Hyperparameters
Confident AI helps you easily discover the optimal set of hyperparameters, which in deepeval
refers to properties such as the models, prompt templates, etc. used when generating the actual_output
s for each LLMTestCase
.
To discover which set of hyperparameters gives you the best evaluation metrics results, use the @deepeval.log_hyperparameters
decorator:
import deepeval
...
# You should aim to make these values dynamic
@deepeval.log_hyperparameters(model="gpt-4", prompt_template="...")
def hyperparameters():
# Return a dict to log additional hyperparameters.
# You can also return an empty dict {} if there's no additional parameters to log
return {
"temperature": 1,
"chunk size": 500
}
The hyperparameters()
function DOESN'T necessarily have to be named 'hyperparameters'. All you need in order to log hyperparameters on Confident AI is to decorate a function that returns a valid dictionary.
Once you've added this decorator, execute test_example.py
once more:
deepeval test run test_example.py
The @deepeval.log_hyperparameters
decorator helps Confident AI keep track of the hyperparameters used when generating the actual_output
s for a particular test run. This allows you to identify which combination of hyperparameters gives the best evaluation metric results over time.
Monitoring LLM Responses
Confident AI allows anyone to monitor and evaluate LLM responses in real-time. A single API request is all that's required, and this would typically happen at your servers right before returning an LLM response to your users:
import deepeval
# At the end of your LLM call
deepeval.track(
event_name="Chatbot",
model="gpt-4",
input="Example input.",
response="Example response.",
retrieval_context=["..."]
)
Confident AI will automatically run evaluations for each incoming LLM response on metrics you have turned on. Head over to the 'Project Details' page on Confident AI to turn on metrics.
You can find more information on running real-time evaluations here.
Full Example
You can find the full example here on our Github.