JudgementalGPT
JudgementalGPT
is an LLM agent developed in-house by Confident AI that's dedicated to evaluation and is superior to GEval
. While it operates similarly to GEval
by utilizing LLMs for scoring, it:
- offers enhanced accuracy and reliability
- is capable of generating justifications in different languages
- has the ability to conditionally execute code that helps detect logical fallacies during evaluations
Required Arguments
To use JudgementalGPT
, you'll have to provide the following arguments when creating an LLMTestCase
:
input
actual_output
Similar to GEval
, you'll also need to supply any additional arguments such as expected_output
and context
if your evaluation criteria depends on these parameters.
Example
To use JudgementalGPT
, start by logging into Confident AI:
deepeval login
Then paste in the following code to define a metric powered by JudgementalGPT
:
from deepeval.types import Languages
from deepeval.metrics import JudgementalGPT
from deepeval.test_case import LLMTestCaseParams
code_correctness_metric = JudgementalGPT(
name="Code Correctness",
criteria="Code Correctness - determine whether the code in the 'actual output' produces a valid JSON.",
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
language=Languages.SPANISH,
threshold=0.5,
)
Under the hood, JudgementalGPT
sends a request to Confident AI's servers that hosts JudgementalGPT
. JudgementalGPT
accepts four arguments:
name
: name of metriccriteria
: a description outlining the specific evaluation aspects for each test case.evaluation_params
: a list of typeLLMTestCaseParams
. Include only the parameters that are relevant for evaluation.- [Optional]
language
: typeLanguage
, specifies what language to return the reasoning in. - [Optional]
threshold
: the passing threshold, defaulted to 0.5.
Similar to GEval
, you can access the judgemental score
and reason
for JudgementalGPT
:
from deepeval.test_case import LLMTestCase
...
test_case = LLMTestCase(
input="Show me a valid json",
actual_output="{'valid': 'json'}"
)
code_correctness_metric.measure(test_case)
print(code_correctness_metric.score)
print(code_correctness_metric.reason)