26 Risks & Harms Taxonomy Resources for Foundation Models

Taxonomies provide a way of categorising, defining and understanding risks and hazards created through the use and deployment of AI systems. The following taxonomies focus on the types of interactions and uses that create a risk of harm as well as the negative effects that they lead to.

Risks & Harms Taxonomy Resources for Foundation Models

Risks & Harms Taxonomies

Text 25 Speech 25 Vision 24
  • A Holistic Approach to Undesired Content Detection in the Real World

    Description of five primary categories (Sexual, Hateful, Violent, Self-harm, Harassment) with sub-categories (e.g. Sexual / sexual content involving minors). Also describes a moderation filter (the OpenAI moderation endpoint), and releases a dataset labelled for the categories.

    Text Speech Vision
  • ActiveFence's LLM Safety Review: Benchmarks and Analysis

    ActiveFence's LLM Safety Review: Benchmarks and Analysis

    Description of 4 risk categories, as part of a benchmark review of LLM safety: (1) Hate, (2) Misinformation, (3) Self-harm & Suicide, (4) Child abuse & exploitation.

    Text Speech Vision
  • Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    Description of 20 risk areas, as part of red teaming Anthropics’ models. Two of the tags are not interpretable (“Other” and “N/A - Invalid attempt”): Discrimination & justice, Hate speech & offensive language, Violence & incitement, Non-violent unethical behaviour (e.g. lying, cheating), Bullying & harassment, Other, Theft, N/A - Invalid attempt, Soliciting personally identifiable information, Conspiracy theories & misinformation, Substance abuse & banned substances, Fraud & deception, Weapons, Adult content, Property crime & vandalism, Animal abuse, Terrorism & organized crime, Sexual exploitation & human trafficking, Self-harm, Child abuse.

    Text Speech Vision
  • BEAVERTAILS: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset

    BEAVERTAILS: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset

    Description of 14 risk areas, as part of a QA dataset for aligning models and evaluating their safety: Hate Speech, Offensive Language, Discrimination, Stereotype, Injustice, Violence, Aiding and Abetting, Incitement, Financial Crime, Property Crime, Theft, Privacy Violation, Drug Abuse, Weapons, Banned Substance, Non-Violent Unethical Behavior, Sexually Explicit, Adult Content, Controversial Topics, Politics, Misinformation Re. ethics, laws and safety, Terrorism, Organized Crime, Self-Harm, Animal Abuse, Child Abuse

    Text Speech Vision
  • Safety Assessment of Chinese Large Language Models

    Description of 8 risk areas (called “safety scenarios)”: Insult, Unfairness and Discrimination, Crimes and Illegal Activities, Sensitive Topics, Physical Harm, Mental health, Privacy and Property, Ethics and Morality. Six “instruction attacks” are also described: Goal hijacking, Prompt leaking, RolePlay Instruction, Unsafe Instruction Topic, Inquiry with Unsafe Opinion, Reverse Exposure.

    Text Speech Vision
  • DECODINGTRUST: A Comprehensive Assessment of Trustworthiness in GPT Models

    DECODINGTRUST: A Comprehensive Assessment of Trustworthiness in GPT Models

    Description of 8 evaluation areas: toxicity, stereotypes bias, adversarial robustness, out-of-distribution robustness, robustness against adversarial demonstrations, privacy, machine ethics, fairness.

    Text Speech Vision
  • A Unified Typology of Harmful Content

    A Unified Typology of Harmful Content

    Taxonomy of harmful online content. There are 4 primary categories, which each have subcategories: (1) Hate and harassment (Doxxing, Identity attack, Identity misrepresentation, Insult, Sexual aggression, Threat of violence; (2) Self-inflicted harm (Eating disorder promotion, self-harm), (3) Ideological harm (Extremism Terrorism & Organized crime, Misinformation), (4) Exploitation (Adult sexual services, Child sexual abuse material, Scams).

    Text Speech Vision
  • Towards Safer Generative Language Models: A Survey on Safety Risks, Evaluations, and Improvements

    Towards Safer Generative Language Models: A Survey on Safety Risks, Evaluations, and Improvements

    Description of 7 risk areas, as part of a survey on LLM risks: Toxicity and Abusive Content, Unfairness and Discrimination, Ethics and Morality Issues, Controversial Opinions, Misleading Information, Privacy and Data Leakage, Malicious Use and Unleashing AI Agents.

    Text Speech Vision
  • Llama 2: Open Foundation and Fine-Tuned Chat Models

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Description of 3 risk areas, as part of the safety checks for releasing Llama2: (1) illicit and criminal activities (terrorism, theft, huam trafficking), (2) hateful and harmful activities (defamation, self-harm, eating disorders, discrimination), and (3) unqualified advice (medical, financial and legal advice). Other risk categories are described as part of red teaming and soliciting feedback.

  • Ethical and social risks of harm from Language Models

    Ethical and social risks of harm from Language Models

    Two-tier taxonomy of risks, comprising both classification groups (of which there are 6) and associated harms (3 or 4 for each classification group). The classification groups are: (1) Discrimination, Exclusion and Toxicity, (2) Information Hazards, (3) Misinformation Harms, (4) Malicious Uses, (5) Human-Computer Interaction Harms, and (6) Automation, access, and environmental harms.

    Text Speech Vision
  • Sociotechnical Safety Evaluation of Generative AI Systems

    Sociotechnical Safety Evaluation of Generative AI Systems

    Two-tier taxonomy of risks, comprising both classification groups (of which there are 6) and associated harms (3 or 4 for each classification group). The classification groups are: (1) Representation and Toxicity Harms, (2) Misinformation Harms, (3) Information & Society Harms, (4) Malicious Use, (5) Human Autonomy & Integrity Harms, and (6) Socioeconomic & Environmental Harms.

    Text Speech Vision
  • Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

    Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

    Two-tier taxonomy of risks, with seven major categories of LLM trustworthiness, each of which has several associated sub-categories: (1) Reliability, (2) Safety, (3) Fairness, (4) Resistance to Misuse, (5) Explainability and Reasoning, (6) Social Norms, and (7) Robustness.

    Text Speech Vision
  • Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets

    Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets

    Description of 8 risk areas, as part of describing methods for aligning models: (1) Abuse, Violence and Threat (inclusive of self-harm), (2) Health (phyiscal and mental), (3) Human characteristics and behaviour, (4) Injustice and inequality (incl, discrimination, harmful stereotypes), (5) Political opinion and destabilization, (6) Relationships (romantic, familial friendships), (7) Sexual activity (inclusive of pornography), (8) Terrorism (inclusive of white supremacy).

    Text Speech Vision
  • Sociotechnical Harms of Algorithmic Systems: Scoping a Taxonomy for Harm Reduction

    Sociotechnical Harms of Algorithmic Systems: Scoping a Taxonomy for Harm Reduction

    Description of 5 categories of harm, with detailed subcategories: (1) Representational harms, (2) Allocative harms, (3) Quality of Service harms, (4) Interpersonal harms, and (5) Social system harms.

    Text Speech Vision
  • Deepfakes, Phrenology, Surveillance, and More! A Taxonomy of AI Privacy Risks

    Deepfakes, Phrenology, Surveillance, and More! A Taxonomy of AI Privacy Risks

    Taxonomy of 12 privacy risks, based on reviewing 321 privacy-related incidents, filtered from the AI, Algorithmic and Automation Incident and Controversy Repository (AIAAIC) Database. Risks are split into those that are created by AI (Identification, Distortion, Exposure, Aggregation, Phrenology/Physiognomy) and those that are exacerbated by AI (Intrusion, Surveillance, Exclusion, Secondary Use, Insecurity, Increased Accessibility).

    Text Speech Vision
  • The Ethical Implications of Generative Audio Models: A Systematic Literature Review

    The Ethical Implications of Generative Audio Models: A Systematic Literature Review

    Taxonomy of 12 “negative broader impacts” from generative models involving speech and music.

  • An Overview of Catastrophic AI Risks

    An Overview of Catastrophic AI Risks

    Taxonomy of 4 catastrophic AI risks, with subcategories: (1) Malicious use (Bioterrrorism, Uncontrolled AI agents, AI capabilities for propaganda, Censorship and surveillance), (2) AI race (Autonomous weapons, Cyberwarfare, Automated human labour [mass unemployment and dependence on AI systems], (3) Organizational risks (AI accidentally leaked/stolen), (4) Rogue AIs (Proxy gaming, Goal drift, Power-seeking, Deception).

    Text Speech Vision
  • The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation

    The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation

    Taxonomy of 3 AI security risks, with subcategories: (1) Digital Security, Physical Security, Political Security.

    Text Speech Vision
  • Open-sourcing highly capable foundation models

    Open-sourcing highly capable foundation models

    Description of risks from malicious use of AI: Influence operations, Surveillance and population control, Scamming and spear phishing, Cyber attacks, Biological and chemical weapons development. Some “extreme risks” are also described in the paper (e.g. disruption to key societal functions).

    Text Speech Vision
  • How Does Access Impact Risk? Assessing AI Foundation Model Risk Along a Gradient of Access

    How Does Access Impact Risk? Assessing AI Foundation Model Risk Along a Gradient of Access

    Description of risks from open-sourcing models, including five instances of malicious use: (1) Fraud and other crime schemes, (2) Undermining of social cohesion and democratic processes, (3) Human rights abuses, (4) Disruption of critical infrastructure, and (5) State conflict.

    Text Speech Vision
  • OpenAI Preparedness Framework (Beta)

    OpenAI Preparedness Framework (Beta)

    Description of 4 catastrophic AI risks: (1) Cybersecurity, (2) Chemical, Biological, Nuclear and Radiological (CBRN) threats, (3) Persuasion, and (4) Model autonomy. The paper also highlights the risk of “unknown unknowns”.

    Text Speech Vision
  • Anthropic's Responsible Scaling Policy

    Anthropic's Responsible Scaling Policy

    Framework with four tiers of model capability, ffrom ASL-1 (smaller models) to ASL-4 (speculative), with increasing risk as models’ capability increases. It also describes 4 catastrophic AI risks: (1) Misuse risks, (2) CBRN risks, (3) Cyber risks, and (4) Autonomy and replication risks.

    Text Speech Vision
  • Model evaluation for extreme risks

    Model evaluation for extreme risks

    Framework of 9 dangerous capabilities of AI models: (1) Cyber-offense, (2) Deception, (3) Persuasion & manipulation, (4) Politial strategy, (5) Weapons acquisition, (6) Long-horizon planning, (7) AI development, (8) Situational awareness, (9) Self-proliferation.

    Text Speech Vision
  • Frontier AI Regulation: Managing Emerging Risks to Public Safety

    Frontier AI Regulation: Managing Emerging Risks to Public Safety

    Description of “sufficiently dangerous capabilities” of AI models to cause serious harm and disruption on a global scale, such as synthesing new biological or chemical weapons and evading human control through means of deception and obfuscation.

    Text Speech Vision
  • The Fallacy of AI Functionality

    The Fallacy of AI Functionality

    Taxonomy of four AI failure points: (1) Impossible tasks (either Conceptually impossible or Practically impossible), (2) Engineering failures (Design failures, Implementation failures, Missing Safety Features), (3) Post-Deployment Failures (Robustness Issues, Failure under Adversarial Attacks, Unanticipated Intractions, (4) Communication Failures (Falsified or Overstated Capabilities, Misrepresented Capabilities).

    Text Speech Vision
  • TASRA: a Taxonomy and Analysis of Societal-Scale Risks from AI

    TASRA: a Taxonomy and Analysis of Societal-Scale Risks from AI

    Framework of 3 potential harms from AI: (1) Harm to people (individual harm, Group/community harm, Societal harm), (2) Harm to an Organisation or Enterprise, (3) Harm to a system.

    Text Speech Vision