Skip to main content

BIG-Bench Hard

The BIG-Bench Hard (BBH) benchmark comprises 23 challenging BIG-Bench tasks where prior language model evaluations have not outperformed the average human rater. BBH evaluates models using both few-shot and chain-of-thought (CoT) prompting techniques. For more details, you can visit the BIG-Bench Hard GitHub page.

Arguments

There are three optional arguments when using the BigBenchHard benchmark:

  • [Optional] tasks: a list of tasks (BigBenchHardTask enums), which specifies the subject areas for model evaluation. By default, this is set to all tasks. The list of BigBenchHardTask enums can be found here.
  • [Optional] n_shots: the number of "shots" to use for few-shot learning. This number ranges strictly from 0-3, and is set to 3 by default.
  • [Optional] enable_cot: a boolean that determines if CoT prompting is used for evaluation. This is set to True by default.
info

Chain-of-Thought (CoT) prompting is an approach where the model is prompted to articulate its reasoning process to arrive at an answer. Meanwhile, few-shot prompting is a method where the model is provided with a few examples (or "shots") to learn from before making predictions. When combined, few-shot prompting and CoT can significantly enhance performance. You can learn more about CoT here.

Example

The code below assesses a custom mistral_7b model (click here to learn how to use ANY custom LLM) on Boolean Expressions and Causal Judgement in BigBenchHard using 3-shot CoT prompting.

from deepeval.benchmarks import BigBenchHard
from deepeval.benchmarks.tasks import BigBenchHardTask

# Define benchmark with specific tasks and shots
benchmark = BigBenchHard(
tasks=[BigBenchHardTask.BOOLEAN_EXPRESSIONS, BigBenchHardTask.CAUSAL_JUDGEMENT],
n_shots=3,
enable_cot=True
)

# Replace 'mistral_7b' with your own custom model
benchmark.evaluate(model=mistral_7b)
print(benchmark.overall_score)

The overall_score for this benchmark ranges from 0 to 1, which is the proportion of total correct predictions according to the target labels for each respective task. The exact match scorer is used for BIG-Bench Hard.

BBH answers exhibit a greater variety of answers compared to benchmarks that use multiple-choice questions, since different tasks in BBH require different types of outputs (for example, boolean values in boolean expression tasks versus numbers in arithmetic tasks). To enhance benchmark performance, employing CoT prompting will prove to be extremely helpful.

tip

Utilizing more few-shot examples (n_shots) can greatly improve the model's robustness in generating answers in the exact correct format and boost the overall score.

BIG-Bench Hard Tasks

The BigBenchHardTask enum classifies the diverse range of tasks covered in the BIG-Bench Hard benchmark.

from deepeval.benchmarks.tasks import BigBenchHardTask

big_tasks = [BigBenchHardTask.BOOLEAN_EXPRESSIONS]

Below is the comprehensive list of available tasks:

  • BOOLEAN_EXPRESSIONS
  • CAUSAL_JUDGEMENT
  • DATE_UNDERSTANDING
  • DISAMBIGUATION_QA
  • DYCK_LANGUAGES
  • FORMAL_FALLACIES
  • GEOMETRIC_SHAPES
  • HYPERBATON
  • LOGICAL_DEDUCTION_FIVE_OBJECTS
  • LOGICAL_DEDUCTION_SEVEN_OBJECTS
  • LOGICAL_DEDUCTION_THREE_OBJECTS
  • MOVIE_RECOMMENDATION
  • MULTISTEP_ARITHMETIC_TWO
  • NAVIGATE
  • OBJECT_COUNTING
  • PENGUINS_IN_A_TABLE
  • REASONING_ABOUT_COLORED_OBJECTS
  • RUIN_NAMES
  • SALIENT_TRANSLATION_ERROR_DETECTION
  • SNARKS
  • SPORTS_UNDERSTANDING
  • TEMPORAL_SEQUENCES
  • TRACKING_SHUFFLED_OBJECTS_FIVE_OBJECTS
  • TRACKING_SHUFFLED_OBJECTS_SEVEN_OBJECTS
  • TRACKING_SHUFFLED_OBJECTS_THREE_OBJECTS
  • WEB_OF_LIES
  • WORD_SORTING