24 Risk & Harms Evaluation Resources for Foundation Models

Risks & Harms Evaluation

Add Resource

Text 20 Speech 5 Vision 6

Bias Benchmark for QA (BBQ)
A dataset of question-sets constructed by the authors that highlight attested social biases against people belonging to protected classes along nine different social dimensions relevant for U.S. English-speaking contexts.
Text
Crossmodal-3600
Image captioning evaluation with geographically diverse images in 36 languages
Text Vision
FactualityPrompt
A benchmark to measure factuality in language models.
Text
From text to talk
Harnessing conversational corpora for humane and diversity-aware language technology. They show how interactional data from 63 languages (26 families) harbours insights about turn-taking, timing, sequential structure and social action.
- Download Paper
Speech
Hallucinations
Public LLM leaderboard computed using Vectara’s Hallucination Evaluation Model. This evaluates how often an LLM introduces hallucinations when summarizing a document.
Text
HolisticBias
A bias and toxicity benchmark using templated sentences, covering nearly 600 descriptor terms across 13 different demographic axes, for a total of 450k examples
Text
Purple Llama CyberSecEval
A benchmark for coding assistants, measuring their propensity to generate insecure code and level of compliance when asked to assist in cyberattacks.
Text
Purple Llama Guard
A tool to identify and protect against malicious inputs to LLMs.
Text
Racial disparities in automated speech recognition
A discussion of racial disparities and inclusiveness in automated speech recognition.
- Website
Speech
RealToxicityPrompts
A dataset of 100k sentence snippets from the web for researchers to further address the risk of neural toxic degeneration in models.
Text
Red Teaming LMs with LMs
A method for using one language model to automatically find cases where a target LM behaves in a harmful way, by generating test cases (“red teaming”)
- Download Paper
Text
Safety evaluation repository
A repository of safety evaluations, across all modalities and harms, as of late 2023. Useful for delving deeper if the following evaluations don’t meet your needs.
- Website
Text Speech Vision
SimpleSafetyTests
Small probe set (100 English text prompts) covering severe harms: child abuse, suicide, self-harm and eating disorders, scams and fraud, illegal items, and physical harm
Text
SneakyPrompt
Automated jailbreaking method to generate NSFW content even with models that have filters applied
Vision
StableBias
Bias testing benchmark for Image to Text models, based on gender-occupation associations.
Vision
Perspective API
Perspective API for content moderation. It has three classes of categories, each with 6+ attributes. (1) Production (Toxicity, Severe Toxicity, Identity Attack, Insult, Profanity, and Threats), (2) Experimental (Toxicity, Severe Toxicity, Identity Attack, Insult, Profanity, Threat, Sexually Explicit, and Flirtation), (3) NY Times (Attack on author, Attack on commenter, Incoherent, Inflammatory, Likely to Reject, Obscene, Spam, Unsubstantial).
Text
Mistral in-context self-reflection safety prompt
Self-reflection prompt for use as a content moderation filter. It returns a binary value (safe/not) with 13 subcategories: Illegal, Child abuse, Hate Violence Harassment, Malware, Physical Harm, Economic Harm, Fraud, Adult, Political campaigning or lobbying, Privacy invasion, Unqualified law advice, Unqualified financial advice, Unqualified health advice
Text
Google, Gemini API Safety Filters (via Vertex)
Safety filter for Gemini models, available through Vertex. 4 safety attributes are described: Hate speech, Harassment, Sexually Explicit, and Dangerous Content. Probabilities are returned for each attribute (Negligible, Low, Medium, High).
- Website
Text
Google, PaLM API Safety Filters (via Vertex)
Safety filter for PaLM models, available through Vertex. 16 safety attributes are described (some of which are ’topical’ rather than purely safety risks): Derogatory, Toxic, Violent, Sexual, Insult, Profanity, Death Harm & Tragedy, Firearms & Weapons, Public safety, Health, Religion & belief, Illicit drugs, War & conflict, Politics, Finance, Legal.
- Website
Text
Anthropic content moderation prompt
In-context prompt for assessing whether messages and responses contain inappropriate content: “violent, illegal or pornographic activities”
- Website
Text Speech Vision
Cohere in-context content moderation prompt
Few-shot prompt for classifying whether text is toxic or not.
- Website
Text
NVidia NeMo Guardrails
Open-source tooling to create guardrails for LLM applications.
Text
SafetyPrompts
Open repository of datasets for LLM safety
- Website
Text
Model Risk Cards
A framework for structured assessment and documentation of risks associated with an application of language models. Each RiskCard makes clear the routes for the risk to manifest harm, their placement in harm taxonomies, and example prompt-output pairs. The paper also describes 70+ risks identified from a literature survey.
Text Speech Vision