Skip to main content

JudgementalGPT

JudgementalGPT is an LLM agent developed in-house by Confident AI that's dedicated to evaluation and is superior to GEval. While it operates similarly to GEval by utilizing LLMs for scoring, it:

  • offers enhanced accuracy and reliability
  • is capable of generating justifications in different languages
  • has the ability to conditionally execute code that helps detect logical fallacies during evaluations

Required Arguments

To use JudgementalGPT, you'll have to provide the following arguments when creating an LLMTestCase:

  • input
  • actual_output

Similar to GEval, you'll also need to supply any additional arguments such as expected_output and context if your evaluation criteria depends on these parameters.

Example

To use JudgementalGPT, start by logging into Confident AI:

deepeval login

Then paste in the following code to define a metric powered by JudgementalGPT:

from deepeval.types import Languages
from deepeval.metrics import JudgementalGPT
from deepeval.test_case import LLMTestCaseParams

code_correctness_metric = JudgementalGPT(
name="Code Correctness",
criteria="Code Correctness - determine whether the code in the 'actual output' produces a valid JSON.",
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
language=Languages.SPANISH,
threshold=0.5,
)

Under the hood, JudgementalGPT sends a request to Confident AI's servers that hosts JudgementalGPT. JudgementalGPT accepts four arguments:

  • name: name of metric
  • criteria: a description outlining the specific evaluation aspects for each test case.
  • evaluation_params: a list of type LLMTestCaseParams. Include only the parameters that are relevant for evaluation.
  • [Optional] language: type Language, specifies what language to return the reasoning in.
  • [Optional] threshold: the passing threshold, defaulted to 0.5.

Similar to GEval, you can access the judgemental score and reason for JudgementalGPT:

from deepeval.test_case import LLMTestCase
...

test_case = LLMTestCase(
input="Show me a valid json",
actual_output="{'valid': 'json'}"
)

code_correctness_metric.measure(test_case)
print(code_correctness_metric.score)
print(code_correctness_metric.reason)