HellaSwag
HellaSwag is a benchmark designed to evaluate language models' commonsense reasoning through sentence completion tasks. It provides 10,000 challenges spanning various subject areas. For more details, you can visit the Hellaswag GitHub page.
Hellaswag
emphasizes commonsense reasoning and depth of understanding in real-world situations, making it an excellent tool for pinpointing where models might struggle with nuanced or complex contexts.
Arguments
There are two optional arguments when using the HellaSwag
benchmark:
- [Optional]
tasks
: a list of tasks (HellaSwagTask
enums), which specifies the subject areas for sentence completion evaluation. By default, this is set to all tasks. The list ofHellaSwagTask
enums can be found here. - [Optional]
n_shots
: the number of "shots" to use for few-shot learning. This is set to 10 by default and cannot exceed 15.
Notice unlike BIGBenchHard
, there is no CoT prompting for the HellaSwag
benchmark.
Example
The code below evaluates a custom mistral_7b
model (click here to learn how to use ANY custom LLM) and its ability to complete sentences related to 'Trimming Branchs or Hedges' and 'Baton Twirling' subjects using 5-shot learning.
from deepeval.benchmarks import HellaSwag
from deepeval.benchmarks.tasks import HellaSwagTask
# Define benchmark with specific tasks and shots
benchmark = HellaSwag(
tasks=[HellaSwagTask.TRIMMING_BRANCHES_OR_HEDGES, HellaSwagTask.BATON_TWIRLING],
n_shots=5
)
# Replace 'mistral_7b' with your own custom model
benchmark.evaluate(model=mistral_7b)
print(benchmark.overall_score)
The overall_score
for this benchmark ranges from 0 to 1, where 1 signifies perfect performance and 0 indicates no correct answers. The model's score, based on exact matching, is calculated by determining the proportion of multiple-choice sentence-completion questions for which the model produces the precise correct letter answer (e.g. 'A') in relation to the total number of questions.
As a result, utilizing more few-shot prompts (n_shots
) can greatly improve the model's robustness in generating answers in the exact correct format and boost the overall score.
HellaSwag Tasks
The HellaSwagTask enum classifies the diverse range of categories covered in the HellaSwag benchmark.
from deepeval.benchmarks.tasks import HellaSwagTask
hella_tasks = [HellaSwagTask.APPLYING_SUNSCREEN]
Below is the comprehensive list of available tasks:
APPLYING_SUNSCREEN
TRIMMING_BRANCHES_OR_HEDGES
DISC_DOG
WAKEBOARDING
SKATEBOARDING
WATERSKIING
WASHING_HANDS
SAILING
PLAYING_CONGAS
BALLET
ROOF_SHINGLE_REMOVAL
HAND_CAR_WASH
KITE_FLYING
PLAYING_POOL
PLAYING_LACROSSE
LAYUP_DRILL_IN_BASKETBALL
HOME_AND_GARDEN
PLAYING_BEACH_VOLLEYBALL
CALF_ROPING
SCUBA_DIVING
MIXING_DRINKS
PUTTING_ON_SHOES
MAKING_A_LEMONADE
UNCATEGORIZED
ZUMBA
PLAYING_BADMINTON
PLAYING_BAGPIPES
FOOD_AND_ENTERTAINING
PERSONAL_CARE_AND_STYLE
CRICKET
SHOVELING_SNOW
PING_PONG
HOLIDAYS_AND_TRADITIONS
ICE_FISHING
BEACH_SOCCER
TABLE_SOCCER
SWIMMING
BATON_TWIRLING
JAVELIN_THROW
SHOT_PUT
DOING_CRUNCHES
POLISHING_SHOES
TRAVEL
USING_UNEVEN_BARS
PLAYING_HARMONICA
RELATIONSHIPS
HIGH_JUMP
MAKING_A_SANDWICH
POWERBOCKING
REMOVING_ICE_FROM_CAR
SHAVING
SHARPENING_KNIVES
WELDING
USING_PARALLEL_BARS
HOME_CATEGORIES
ROCK_CLIMBING
SNOW_TUBING
WASHING_FACE
ASSEMBLING_BICYCLE
TENNIS_SERVE_WITH_BALL_BOUNCING
SHUFFLEBOARD
DODGEBALL
CAPOEIRA
PAINTBALL
DOING_A_POWERBOMB
DOING_MOTOCROSS
PLAYING_ICE_HOCKEY
PHILOSOPHY_AND_RELIGION
ARCHERY
CARS_AND_OTHER_VEHICLES
RUNNING_A_MARATHON
THROWING_DARTS
PAINTING_FURNITURE
HAVING_AN_ICE_CREAM
SLACKLINING
CAMEL_RIDE
ARM_WRESTLING
HULA_HOOP
SURFING
PLAYING_PIANO
GARGLING_MOUTHWASH
PLAYING_ACCORDION
HORSEBACK_RIDING
PUTTING_IN_CONTACT_LENSES
PLAYING_SAXOPHONE
FUTSAL
LONG_JUMP
LONGBOARDING
POLE_VAULT
BUILDING_SANDCASTLES
PLATFORM_DIVING
PAINTING
SPINNING
CARVING_JACK_O_LANTERNS
BRAIDING_HAIR
YOUTH
PLAYING_VIOLIN
CANOEING
CHEERLEADING
PETS_AND_ANIMALS
KAYAKING
CLEANING_SHOES
KNITTING
BAKING_COOKIES
DOING_FENCING
PLAYING_GUITARRA
USING_THE_ROWING_MACHINE
GETTING_A_HAIRCUT
MOOPING_FLOOR
RIVER_TUBING
CLEANING_SINK
GROOMING_DOG
DISCUS_THROW
CLEANING_WINDOWS
FINANCE_AND_BUSINESS
HANGING_WALLPAPER
ROPE_SKIPPING
WINDSURFING
KNEELING
GETTING_A_PIERCING
ROCK_PAPER_SCISSORS
SPORTS_AND_FITNESS
BREAKDANCING
WALKING_THE_DOG
PLAYING_DRUMS
PLAYING_WATER_POLO
BMX
SMOKING_A_CIGARETTE
BLOWING_LEAVES
BULLFIGHTING
DRINKING_COFFEE
BATHING_DOG
TANGO
WRAPPING_PRESENTS
PLASTERING
PLAYING_BLACKJACK
FUN_SLIDING_DOWN
WORK_WORLD
TRIPLE_JUMP
TUMBLING
SKIING
DOING_KICKBOXING
BLOW_DRYING_HAIR
DRUM_CORPS
SMOKING_HOOKAH
MOWING_THE_LAWN
VOLLEYBALL
LAYING_TILE
STARTING_A_CAMPFIRE
SUMO
HURLING
PLAYING_KICKBALL
MAKING_A_CAKE
FIXING_THE_ROOF
PLAYING_POLO
REMOVING_CURLERS
ELLIPTICAL_TRAINER
HEALTH
SPREAD_MULCH
CHOPPING_WOOD
BRUSHING_TEETH
USING_THE_POMMEL_HORSE
SNATCH
CLIPPING_CAT_CLAWS
PUTTING_ON_MAKEUP
HAND_WASHING_CLOTHES
HITTING_A_PINATA
TAI_CHI
GETTING_A_TATTOO
DRINKING_BEER
SHAVING_LEGS
DOING_KARATE
PLAYING_RUBIK_CUBE
FAMILY_LIFE
ROLLERBLADING
EDUCATION_AND_COMMUNICATIONS
FIXING_BICYCLE
BEER_PONG
IRONING_CLOTHES
CUTTING_THE_GRASS
RAKING_LEAVES
PLAYING_SQUASH
HOPSCOTCH
INSTALLING_CARPET
POLISHING_FURNITURE
DECORATING_THE_CHRISTMAS_TREE
PREPARING_SALAD
PREPARING_PASTA
VACUUMING_FLOOR
CLEAN_AND_JERK
COMPUTERS_AND_ELECTRONICS
CROQUET