OpenAI safety capability thresholds – 2026

v1.2 (2026-04-01) current

The starting probabilities of this market were calculated using logistic CDF models fitted to anchor points calibrated via cross-lab safety-framework history, METR capability-trend data, and published System Card evaluations. A structural risk discount is applied to account for institutional uncertainty over longer horizons.

Events

This market tracks three milestones from OpenAI’s Preparedness Framework:

AI R&D (High): A GPT model meets the “High” capability threshold for Autonomous AI Research & Development, defined as equivalent to a “performant mid-career ML research engineer.” Evaluated via Monorepo-Bench, OpenAI-Proof Q&A, and MLE-Bench. As of early 2025, o3 reportedly scores “medium” on this evaluation. The gap from “medium” to “high” requires reliable multi-hour to multi-day autonomous performance with long-horizon planning, debugging, and iteration across messy research environments — a meaningfully harder bar than exam or coding excellence.
Critical Bio/Chem: OpenAI publicly classifies a model as “Critical” in Biological/Chemical capability — the most severe level in the framework. GPT-4o was rated “low” (2024 System Card); o3 reportedly near “medium.” “Critical” requires the model to provide significant uplift for creating biological or chemical weapons beyond freely available information. This is the hardest category: scarcer domain training signal, strongest deployment inhibition, and strongest reputational reluctance to publicly acknowledge a crossing. Notably, Anthropic’s activation of ASL-3 protections in May 2025 was driven specifically by CBRN concerns, confirming that bio/chem capabilities are advancing — but ASL-3 is roughly equivalent to OpenAI’s “high” level, not “critical.”
Critical Cyber: OpenAI publicly classifies a model as “Critical” in Cybersecurity capability. GPT-4o was rated “low” and o3 reportedly “medium.” “Critical” requires autonomous identification and exploitation of novel zero-day vulnerabilities in hardened systems, or end-to-end offensive cyber operations against state-level targets. More digital training signal is available than for bio/chem, and there is stronger overlap with general software capability, but “critical” remains an extreme threshold.

Cross-lab threshold history

A key calibration input is the rate at which comparable labs have publicly declared that a model crossed capability thresholds in their safety frameworks. The pattern is: intermediate crossings have occurred, but no lab has publicly reported a top-level crossing.

Anthropic Responsible Scaling Policy (RSP)

Date	Event
Sep 2023	RSP launched with ASL-1 through ASL-4+. All current models (including Claude) assessed as ASL-2.
Oct 2024	Major RSP update. All current models still ASL-2. Stronger safeguards defined for Autonomous AI R&D and CBRN thresholds.
Mar 2025	RSP v2.1: disaggregated AI R&D thresholds into two levels; added further CBRN threshold detail.
May 2025	ASL-3 activated for Claude Opus 4, driven by CBRN concerns (steadily increasing performance on biological/chemical weapons evaluations). Enhanced deployment safeguards (Constitutional Classifiers, bug bounty) and security controls deployed. Claude Opus 4.5 and Sonnet 4.5 subsequently also deployed under ASL-3.
Feb 2026	Anthropic states Claude Opus 4.6 does not cross the AI R&D-4 threshold, though ruling this out is getting harder. RSP v3.0 rewrite published.

Key takeaway: Anthropic crossed from ASL-2 to ASL-3 (an intermediate threshold, roughly comparable to OpenAI’s “high”) 20 months after the framework launched. But 2.5 years after launch, no model has crossed the top AI R&D threshold (ASL-4 equivalent). This confirms capability is advancing while showing that top-level public classifications remain rare.

Google DeepMind Frontier Safety Framework (FSF)

Date	Event
May 2024	FSF launched with Critical Capability Levels (CCLs) for autonomy, biosecurity, cybersecurity, and ML R&D. Described as exploratory, targeting full implementation by early 2025.
Feb 2025	FSF updated with stronger security-level recommendations, more consistent deployment mitigation, and an explicit deceptive-alignment approach. Used in governance for Gemini 2.0.
Apr 2025	Google publishes AGI safety governance discussion. Reaffirms ongoing dangerous-capability evaluations. No public severe-CCL crossing announced.

Key takeaway: 11 months of framework operation, multiple updates, no public severe-threshold crossing.

What the cross-lab pattern means

The market resolves on OpenAI publicly classifying a model at the most severe level. The cross-lab evidence shows that:

Labs do publicly classify when intermediate thresholds are crossed (Anthropic ASL-3 in May 2025)
But top-level crossings have not been publicly reported by any lab
Framework definitions are frequently revised, adding structural uncertainty

This means the probability of a publicly reported critical-level crossing is materially lower than the probability that latent capability reaches that level — a gap we call “public reporting drag.”

METR capability trend

METR’s March 2025 analysis of frontier-agent task performance provides the strongest quantitative outside-view anchor for the AI R&D event:

Frontier-agent task horizon has been doubling roughly every 7 months
Current systems (e.g. Claude 3.7 Sonnet) achieve ~1-hour task horizon at 50% reliability
Success rate drops to <10% on tasks taking humans more than ~4 hours
If the trend continues, week-long tasks become plausible in 2–4 years; month-long tasks by end of decade

The “High” AI R&D threshold (mid-career ML research engineer) likely requires reliable multi-hour to multi-day performance with long-horizon planning, debugging, context maintenance, and iteration. This pushes against very high near-term crossing probability for the AI R&D event, even though the trend is steep.

OpenAI-specific evidence

GPT-4o System Card (2024): rated “low” in both bio and cyber capability categories
o3 (early 2025): reportedly “medium” in cyber and AI R&D, nearer “medium” in bio than prior models
The “High” AI R&D bar is explicitly benchmarked via Monorepo-Bench, OpenAI-Proof Q&A, and MLE-Bench — a materially harder standard than general coding or exam performance

Methodology

For each event, we set anchor points (month, cumulative probability) balancing four inputs:

Current capability levels from published System Cards (GPT-4o rated “low”; o3 reportedly “medium” in several categories)
METR task-horizon doubling trend (~7 months), which anchors the rate of AI R&D capability progress
Cross-lab threshold-classification history: Anthropic took 20 months from RSP launch to the first intermediate crossing (ASL-3); no top-level crossing has occurred at any lab. Google’s FSF has produced no public severe CCL crossing in 11 months.
Public reporting drag: labs classify capabilities conservatively in public — the event of a public severe-threshold crossing lags underlying capability. This is modeled as a separate factor reducing probabilities versus raw capability estimates.

A logistic CDF is fitted to each event’s anchors via least-squares. A structural / institutional risk discount is then applied multiplicatively to the fitted output: ~1.3%/year constant hazard rate for “framework becomes unresolvable” (discontinuation, restructuring beyond recognition, reporting refusal). This is calibrated so cumulative structural risk reaches ~7% at the 5.5-year horizon (Dec 2030), compressing later-year probabilities — for example, AI R&D high at Dec 2030 drops from 97% raw to ~91% after discount. Near-term effects are negligible. The discount level is consistent with the superforecaster calibration literature, which finds that well-calibrated forecasters rarely assign >95% to events more than a few years out.

Key assumptions

OpenAI continues publishing capability assessments (System Cards or equivalent)
The Preparedness Framework categories and thresholds remain substantially similar (the structural discount accounts for risk they do not)
AI R&D capability progresses fastest (most training signal, direct commercial incentive), but the “High” bar is hard
Cyber sits below AI R&D (extreme threshold, no top-level cross-lab classification history)
Bio/Chem progresses slowest (scarce domain training signal, strongest deployment inhibition, strongest reluctance to publicise) — though Anthropic’s ASL-3 activation confirms underlying bio/chem capability is advancing
Public reporting of top-level thresholds consistently lags underlying capability

Sources

Liquidity over time

ⁱ

Open this panel to load liquidity history.

0 comments

filter:

sort: