No probability history is available for this market yet.
OpenAI safety capability thresholds – 2026
This market contains three independent events, each resolving to the earliest month in which the condition is met:
- AI R&D (High): OpenAI publicly reports (in a System Card, blog post, or Preparedness Framework update) that one of its models meets or exceeds the 'High' capability threshold for 'Autonomous AI Research & Development' (previously 'AI Self-Improvement'). The 'High' threshold is defined in the Preparedness Framework as equivalent to a 'performant mid-career ML research engineer', evaluated via benchmarks including Monorepo-Bench, OpenAI-Proof Q&A, and MLE-Bench.
- Critical Bio/Chem: OpenAI publicly classifies one of its models as 'Critical' in the Biological/Chemical capability category in a System Card, safety report, or Preparedness Framework assessment.
- Critical Cyber: OpenAI publicly classifies one of its models as 'Critical' in the Cybersecurity capability category in a System Card, safety report, or Preparedness Framework assessment.
Each event resolves independently. 'Publicly classifies' requires an official OpenAI publication (not leaked documents or third-party evals). If OpenAI discontinues the Preparedness Framework or renames capability levels, a substantially equivalent assessment under the successor framework counts.
The starting probabilities of this market were calculated using logistic CDF models fitted to anchor points calibrated via cross-lab safety-framework history, METR capability-trend data, and published System Card evaluations. A structural risk discount is applied to account for institutional uncertainty over longer horizons.
Events
This market tracks three milestones from OpenAI’s Preparedness Framework:
- AI R&D (High): A GPT model meets the “High” capability threshold for Autonomous AI Research & Development, defined as equivalent to a “performant mid-career ML research engineer.” Evaluated via Monorepo-Bench, OpenAI-Proof Q&A, and MLE-Bench. As of early 2025, o3 reportedly scores “medium” on this evaluation. The gap from “medium” to “high” requires reliable multi-hour to multi-day autonomous performance with long-horizon planning, debugging, and iteration across messy research environments — a meaningfully harder bar than exam or coding excellence.
- Critical Bio/Chem: OpenAI publicly classifies a model as “Critical” in Biological/Chemical capability — the most severe level in the framework. GPT-4o was rated “low” (2024 System Card); o3 reportedly near “medium.” “Critical” requires the model to provide significant uplift for creating biological or chemical weapons beyond freely available information. This is the hardest category: scarcer domain training signal, strongest deployment inhibition, and strongest reputational reluctance to publicly acknowledge a crossing. Notably, Anthropic’s activation of ASL-3 protections in May 2025 was driven specifically by CBRN concerns, confirming that bio/chem capabilities are advancing — but ASL-3 is roughly equivalent to OpenAI’s “high” level, not “critical.”
- Critical Cyber: OpenAI publicly classifies a model as “Critical” in Cybersecurity capability. GPT-4o was rated “low” and o3 reportedly “medium.” “Critical” requires autonomous identification and exploitation of novel zero-day vulnerabilities in hardened systems, or end-to-end offensive cyber operations against state-level targets. More digital training signal is available than for bio/chem, and there is stronger overlap with general software capability, but “critical” remains an extreme threshold.
Cross-lab threshold history
A key calibration input is the rate at which comparable labs have publicly declared that a model crossed capability thresholds in their safety frameworks. The pattern is: intermediate crossings have occurred, but no lab has publicly reported a top-level crossing.
Anthropic Responsible Scaling Policy (RSP)
| Date | Event |
|---|---|
| Sep 2023 | RSP launched with ASL-1 through ASL-4+. All current models (including Claude) assessed as ASL-2. |
| Oct 2024 | Major RSP update. All current models still ASL-2. Stronger safeguards defined for Autonomous AI R&D and CBRN thresholds. |
| Mar 2025 | RSP v2.1: disaggregated AI R&D thresholds into two levels; added further CBRN threshold detail. |
| May 2025 | ASL-3 activated for Claude Opus 4, driven by CBRN concerns (steadily increasing performance on biological/chemical weapons evaluations). Enhanced deployment safeguards (Constitutional Classifiers, bug bounty) and security controls deployed. Claude Opus 4.5 and Sonnet 4.5 subsequently also deployed under ASL-3. |
| Feb 2026 | Anthropic states Claude Opus 4.6 does not cross the AI R&D-4 threshold, though ruling this out is getting harder. RSP v3.0 rewrite published. |
Key takeaway: Anthropic crossed from ASL-2 to ASL-3 (an intermediate threshold, roughly comparable to OpenAI’s “high”) 20 months after the framework launched. But 2.5 years after launch, no model has crossed the top AI R&D threshold (ASL-4 equivalent). This confirms capability is advancing while showing that top-level public classifications remain rare.
Google DeepMind Frontier Safety Framework (FSF)
| Date | Event |
|---|---|
| May 2024 | FSF launched with Critical Capability Levels (CCLs) for autonomy, biosecurity, cybersecurity, and ML R&D. Described as exploratory, targeting full implementation by early 2025. |
| Feb 2025 | FSF updated with stronger security-level recommendations, more consistent deployment mitigation, and an explicit deceptive-alignment approach. Used in governance for Gemini 2.0. |
| Apr 2025 | Google publishes AGI safety governance discussion. Reaffirms ongoing dangerous-capability evaluations. No public severe-CCL crossing announced. |
Key takeaway: 11 months of framework operation, multiple updates, no public severe-threshold crossing.
What the cross-lab pattern means
The market resolves on OpenAI publicly classifying a model at the most severe level. The cross-lab evidence shows that:
- Labs do publicly classify when intermediate thresholds are crossed (Anthropic ASL-3 in May 2025)
- But top-level crossings have not been publicly reported by any lab
- Framework definitions are frequently revised, adding structural uncertainty
This means the probability of a publicly reported critical-level crossing is materially lower than the probability that latent capability reaches that level — a gap we call “public reporting drag.”
METR capability trend
METR’s March 2025 analysis of frontier-agent task performance provides the strongest quantitative outside-view anchor for the AI R&D event:
- Frontier-agent task horizon has been doubling roughly every 7 months
- Current systems (e.g. Claude 3.7 Sonnet) achieve ~1-hour task horizon at 50% reliability
- Success rate drops to <10% on tasks taking humans more than ~4 hours
- If the trend continues, week-long tasks become plausible in 2–4 years; month-long tasks by end of decade
The “High” AI R&D threshold (mid-career ML research engineer) likely requires reliable multi-hour to multi-day performance with long-horizon planning, debugging, context maintenance, and iteration. This pushes against very high near-term crossing probability for the AI R&D event, even though the trend is steep.
OpenAI-specific evidence
- GPT-4o System Card (2024): rated “low” in both bio and cyber capability categories
- o3 (early 2025): reportedly “medium” in cyber and AI R&D, nearer “medium” in bio than prior models
- The “High” AI R&D bar is explicitly benchmarked via Monorepo-Bench, OpenAI-Proof Q&A, and MLE-Bench — a materially harder standard than general coding or exam performance
Methodology
For each event, we set anchor points (month, cumulative probability) balancing four inputs:
- Current capability levels from published System Cards (GPT-4o rated “low”; o3 reportedly “medium” in several categories)
- METR task-horizon doubling trend (~7 months), which anchors the rate of AI R&D capability progress
- Cross-lab threshold-classification history: Anthropic took 20 months from RSP launch to the first intermediate crossing (ASL-3); no top-level crossing has occurred at any lab. Google’s FSF has produced no public severe CCL crossing in 11 months.
- Public reporting drag: labs classify capabilities conservatively in public — the event of a public severe-threshold crossing lags underlying capability. This is modeled as a separate factor reducing probabilities versus raw capability estimates.
A logistic CDF is fitted to each event’s anchors via least-squares. A structural / institutional risk discount is then applied multiplicatively to the fitted output: ~1.3%/year constant hazard rate for “framework becomes unresolvable” (discontinuation, restructuring beyond recognition, reporting refusal). This is calibrated so cumulative structural risk reaches ~7% at the 5.5-year horizon (Dec 2030), compressing later-year probabilities — for example, AI R&D high at Dec 2030 drops from 97% raw to ~91% after discount. Near-term effects are negligible. The discount level is consistent with the superforecaster calibration literature, which finds that well-calibrated forecasters rarely assign >95% to events more than a few years out.
Key assumptions
- OpenAI continues publishing capability assessments (System Cards or equivalent)
- The Preparedness Framework categories and thresholds remain substantially similar (the structural discount accounts for risk they do not)
- AI R&D capability progresses fastest (most training signal, direct commercial incentive), but the “High” bar is hard
- Cyber sits below AI R&D (extreme threshold, no top-level cross-lab classification history)
- Bio/Chem progresses slowest (scarce domain training signal, strongest deployment inhibition, strongest reluctance to publicise) — though Anthropic’s ASL-3 activation confirms underlying bio/chem capability is advancing
- Public reporting of top-level thresholds consistently lags underlying capability
Sources
Liquidity over time
i Shows the market’s available liquidity over time. Liquidity can decay continuously and may change discretely if liquidity adjustments are applied.Open this panel to load liquidity history.