G-Eval
G-Eval is a custom, LLM evaluated metric. This means its score is calculated using an LLM. G-Eval is the most verstile type of metric deepeval
has to offer, and is capable of evaluating almost any use case with human-like accuracy.
Required Arguments
To use the GEval
, you'll have to provide the following arguments when creating an LLMTestCase
:
input
actual_output
You'll also need to supply any additional arguments such as expected_output
and context
if your evaluation criteria depends on these parameters.
Example
To create a custom metric that uses LLMs for evaluation, simply instantiate an GEval
class and define an evaluation criteria in everyday language:
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams
coherence_metric = GEval(
name="Coherence",
criteria="Coherence - determine if the actual output is coherent with the input.",
# NOTE: you can only provide either criteria or evaluation_steps, and not both
evaluation_steps=["Check whether the sentences in 'actual output' aligns with that in 'input'"],
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
)
There are three mandatory and five optional parameters required when instantiating an GEval
class:
name
: name of metriccriteria
: a description outlining the specific evaluation aspects for each test case.evaluation_params
: a list of typeLLMTestCaseParams
. Include only the parameters that are relevant for evaluation.- [Optional]
evaluation_steps
: a list of strings outlining the exact steps the LLM should take for evaluation. You can only provide eitherevaluation_steps
ORcriteria
, and not both. - [Optional]
threshold
: the passing threshold, defaulted to 0.5. - [Optional]
model
: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of typeDeepEvalBaseLLM
. Defaulted to 'gpt-4-0125-preview'. - [Optional]
strict_mode
: a boolean which when set toTrue
, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted toFalse
. - [Optional]
async_mode
: a boolean which when set toTrue
, enables concurrent execution within themeasure()
method. Defaulted toTrue
.
For accurate and valid results, only the parameters that are mentioned in criteria
should be included as a member of evaluation_params
.
As mentioned in the metrics introduction section, all of deepeval
's metrics return a score ranging from 0 - 1, and a metric is only successful if the evaluation score is equal to or greater than threshold
, and GEval
is no exception. You can access the score
and reason
for each individual GEval
metric:
from deepeval.test_case import LLMTestCase
...
test_case = LLMTestCase(
input="The sun is shining bright today",
actual_output="The weather's getting really hot."
)
coherence_metric.measure(test_case)
print(coherence_metric.score)
print(coherence_metric.reason)
Did you know that in the original G-Eval paper, the authors used the the probabilities of the LLM output tokens to normalize the score by calculating a weighted summation?
This step was introduced in the paper because it minimizes bias in LLM scoring. This normalization is handled by deepeval
by default (unless you're using a custom model).