Introduction
Quick Summary
Evaluation refers to the process of testing your LLM application outputs, and requires the following components:
- Test cases
- Metrics
- Evaluation dataset
Here's a diagram of what an ideal evaluation workflow looks like using deepeval
:
Your test cases will typically be in a single python file, and executing them will be as easy as running deepeval test run
:
deepeval test run test_example.py
We understand preparing a comprehensive evaluation dataset can be a challenging task, especially if you're doing it for the first time. Contact us if you want a custom evaluation dataset prepared for you.
Metrics
deepeval
offers 14+ evaluation metrics, most of which are evaluated using LLMs (visit the metrics section to learn why).
from deepeval.metrics import AnswerRelevancyMetric
answer_relevancy_metric = AnswerRelevancyMetric()
You'll need to create a test case to run deepeval
's metrics.
Test Cases
In deepeval
, a test case allows you to use evaluation metrics you have defined to unit test LLM applications.
from deepeval.test_case import LLMTestCase
test_case = LLMTestCase(
input="Who is the current president of the United States of America?",
actual_output="Joe Biden",
retrieval_context=["Joe Biden serves as the current president of America."]
)
In this example, input
mimics an user interaction with a RAG-based LLM application, where actual_output
is the output of your LLM application and retrieval_context
is the retrieved nodes in your RAG pipeline. Creating a test case allows you to evaluate using deepeval
's default metrics:
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
answer_relevancy_metric = AnswerRelevancyMetric()
test_case = LLMTestCase(
input="Who is the current president of the United States of America?",
actual_output="Joe Biden",
retrieval_context=["Joe Biden serves as the current president of America."]
)
answer_relevancy_metric.measure(test_case)
print(answer_relevancy_metric.score)
Datasets
Datasets in deepeval
is a collection of test cases. It provides a centralized interface for you to evaluate a collection of test cases using one or multiple metrics.
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset
from deepeval.metrics import AnswerRelevancyMetric
answer_relevancy_metric = AnswerRelevancyMetric()
test_case = LLMTestCase(
input="Who is the current president of the United States of America?",
actual_output="Joe Biden",
retrieval_context=["Joe Biden serves as the current president of America."]
)
dataset = EvaluationDataset(test_cases=[test_case])
dataset.evaluate([answer_relevancy_metric])
You don't need to create an evaluation dataset to evaluate individual test cases. Visit the test cases section to learn how to assert inidividual test cases.
Evaluating With Pytest
Although deepeval
integrates with Pytest, we highly recommend you to AVOID executing LLMTestCase
s directly via the pytest
command to avoid any unexpected errors.
deepeval
allows you to run evaluations as if you're using Pytest via our Pytest integration. Simply create a test file:
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
dataset = EvaluationDataset(test_cases=[...])
@pytest.mark.parametrize(
"test_case",
dataset,
)
def test_customer_chatbot(test_case: LLMTestCase):
answer_relevancy_metric = AnswerRelevancyMetric()
assert_test(test_case, [answer_relevancy_metric])
And run the test file in the CLI:
deepeval test run test_example.py
There are two mandatory and one optional parameter when calling the assert_test()
function:
test_case
: anLLMTestCase
metrics
: a list of metrics of typeBaseMetric
- [Optional]
run_async
: a boolean which when set toTrue
, enables concurrent evaluation of all metrics. Defaulted toTrue
.
@pytest.mark.parametrize
is a decorator offered by Pytest. It simply loops through your EvaluationDataset
to evaluate each test case individually.
Parallelization
Evaluate each test case in parallel by providing a number to the -n
flag to specify how many processes to use.
deepeval test run test_example.py -n 4
Cache
Provide the -c
flag (with no arguments) to read from the local deepeval
cache instead of re-evaluating test cases on the same metrics.
deepeval test run test_example.py -c
This is extremely useful if you're running large amounts of test cases. For example, lets say you're running 1000 test cases using deepeval test run
, but you encounter an error on the 999th test case. The cache functionality would allow you to skip all the previously evaluated 999 test cases, and just evaluate the remaining one.
Repeats
Repeat each test case by providing a number to the -r
flag to specify how many times to rerun each test case.
deepeval test run test_example.py -r 2
You can combine the -r
and -n
flag to rerun test cases in parallel:
deepeval test run test_example.py -r 2 -n 2
Hooks
deepeval
's Pytest integration allosw you to run custom code at the end of each evaluation via the @deepeval.on_test_run_end
decorator:
...
@deepeval.on_test_run_end
def function_to_be_called_after_test_run():
print("Test finished!")
Evaluating Without Pytest
Alternately, you can use deepeval
's evaluate
function. This approach avoids the CLI (if you're in a notebook environment), but does not allow for parallel test execution.
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset(test_cases=[...])
answer_relevancy_metric = AnswerRelevancyMetric()
evaluate(dataset, [answer_relevancy_metric])
There are two mandatory and three optional arguments when calling the evaluate()
function:
test_case
: anLLMTestCase
metrics
: a list of metrics of typeBaseMetric
- [Optional]
run_async
: a boolean which when set toTrue
, enables concurrent evaluation of all metrics. Defaulted toTrue
. - [Optional]
show_indicator
: a boolean which when set toTrue
, shows the progress indicator for each individual metric. Defaulted toTrue
. - [Optional]
print_results
: a boolean which when set toTrue
, prints the result of each evaluation. Defaulted toTrue
.
You can also replace dataset
with a list of test cases, as shown in the test cases section.