Skip to main content

MMLU

MMLU (Massive Multitask Language Understanding) is a benchmark for evaluating LLMs through multiple-choice questions. These questions cover 57 subjects such as math, history, law, and ethics. For more information, visit the MMLU GitHub page.

tip

MMLU covers a broad variety and depth of subjects, and is good at detecting areas where a model may lack understanding in a certain topic.

Arguments

There are two optional arguments when using the MMLU benchmark:

  • [Optional] tasks: a list of tasks (MMLUTask enums), specifying which of the 57 subject areas to evaluate in the language model. By default, this is set to all tasks. Detailed descriptions of the MMLUTask enum can be found here.
  • [Optional] n_shots: the number of "shots" to use for few-shot learning. This is set to 5 by default and cannot exceed this number.

Example

The code below evaluates a custom mistral_7b model (click here to learn how to use ANY custom LLM) and assesses its performance on High School Computer Science and Astronomy using 3-shot learning.

from deepeval.benchmarks import MMLU
from deepeval.benchmarks.tasks import MMLUTask

# Define benchmark with specific tasks and shots
benchmark = MMLU(
tasks=[MMLUTask.HIGH_SCHOOL_COMPUTER_SCIENCE, MMLUTask.ASTRONOMY],
n_shots=3
)

# Replace 'mistral_7b' with your own custom model
benchmark.evaluate(model=mistral_7b)
print(benchmark.overall_score)

The overall_score for this benchmark ranges from 0 to 1, where 1 signifies perfect performance and 0 indicates no correct answers. The model's score, based on exact matching, is calculated by determining the proportion of multiple-choice questions for which the model produces the precise correct letter answer (e.g. 'A') in relation to the total number of questions.

As a result, utilizing more few-shot prompts (n_shots) can greatly improve the model's robustness in generating answers in the exact correct format and boost the overall score.

MMLU Tasks

The MMLUTask enum classifies the diverse range of subject areas covered in the MMLU benchmark.

from deepeval.benchmarks.tasks import MMLUTask

mm_tasks = [MMLUTask.HIGH_SCHOOL_EUROPEAN_HISTORY]

Below is the comprehensive list of all available tasks:

  • HIGH_SCHOOL_EUROPEAN_HISTORY
  • BUSINESS_ETHICS
  • CLINICAL_KNOWLEDGE
  • MEDICAL_GENETICS
  • HIGH_SCHOOL_US_HISTORY
  • HIGH_SCHOOL_PHYSICS
  • HIGH_SCHOOL_WORLD_HISTORY
  • VIROLOGY
  • HIGH_SCHOOL_MICROECONOMICS
  • ECONOMETRICS
  • COLLEGE_COMPUTER_SCIENCE
  • HIGH_SCHOOL_BIOLOGY
  • ABSTRACT_ALGEBRA
  • PROFESSIONAL_ACCOUNTING
  • PHILOSOPHY
  • PROFESSIONAL_MEDICINE
  • NUTRITION
  • GLOBAL_FACTS
  • MACHINE_LEARNING
  • SECURITY_STUDIES
  • PUBLIC_RELATIONS
  • PROFESSIONAL_PSYCHOLOGY
  • PREHISTORY
  • ANATOMY
  • HUMAN_SEXUALITY
  • COLLEGE_MEDICINE
  • HIGH_SCHOOL_GOVERNMENT_AND_POLITICS
  • COLLEGE_CHEMISTRY
  • LOGICAL_FALLACIES
  • HIGH_SCHOOL_GEOGRAPHY
  • ELEMENTARY_MATHEMATICS
  • HUMAN_AGING
  • COLLEGE_MATHEMATICS
  • HIGH_SCHOOL_PSYCHOLOGY
  • FORMAL_LOGIC
  • HIGH_SCHOOL_STATISTICS
  • INTERNATIONAL_LAW
  • HIGH_SCHOOL_MATHEMATICS
  • HIGH_SCHOOL_COMPUTER_SCIENCE
  • CONCEPTUAL_PHYSICS
  • MISCELLANEOUS
  • HIGH_SCHOOL_CHEMISTRY
  • MARKETING
  • PROFESSIONAL_LAW
  • MANAGEMENT
  • COLLEGE_PHYSICS
  • JURISPRUDENCE
  • WORLD_RELIGIONS
  • SOCIOLOGY
  • US_FOREIGN_POLICY
  • HIGH_SCHOOL_MACROECONOMICS
  • COMPUTER_SECURITY
  • MORAL_SCENARIOS
  • MORAL_DISPUTES
  • ELECTRICAL_ENGINEERING
  • ASTRONOMY
  • COLLEGE_BIOLOGY