No probability history is available for this market yet.
Google DeepMind safety capability thresholds – 2027
This market contains four independent events, each resolving to the earliest month in which the condition is met:
- CBRN Uplift 1: Google publicly reports (in an FSF report, model card, or blog post) that a Gemini model meets Critical Capability Level CBRN Uplift Level 1. Under FSF v3.0, this is ‘provides low to medium resourced actors uplift in reference scenarios resulting in additional expected harm at severe scale.’
- Cyber Uplift 1: Google publicly reports a Gemini model meets CCL Cyber Uplift Level 1. Under FSF v3.0, this is ‘provides sufficient uplift with high impact cyber attacks for additional expected harm at severe scale.’
- ML R&D Automation 1: Google publicly reports a Gemini model meets CCL ML R&D Automation Level 1. Under FSF v3.0, this is ‘can fully automate the work of any team of researchers at Google focused on improving AI capabilities, with approximately comparable all-inclusive costs.’
- ML R&D Acceleration 1: Google publicly reports a Gemini model meets CCL ML R&D Acceleration Level 1. Under FSF v3.0, this is ‘has been used to accelerate AI development, resulting in AI progress substantially accelerating from historical rates.’
Each event resolves independently. Resolution requires an official Google DeepMind publication (FSF report, model card, safety blog post) — not leaked documents or third-party evals. If Google discontinues the FSF or renames CCLs, a substantially equivalent assessment under the successor framework counts. An ‘alert threshold’ being met does NOT count as the CCL being met.
The starting probabilities of this market were calculated using logistic CDF models fitted to anchor points calibrated via Google DeepMind’s own FSF assessment history, cross-lab safety-framework patterns, and METR capability-trend data. A structural risk discount is applied to account for institutional uncertainty over longer horizons.
Events
This market tracks four Critical Capability Level (CCL) milestones from Google DeepMind’s Frontier Safety Framework (FSF):
- CBRN Uplift 1: Under FSF v3.0, this is defined as “provides low to medium resourced actors uplift in reference scenarios resulting in additional expected harm at severe scale.” Earlier versions describe this as the model significantly assisting a low-resourced actor with dual-use scientific protocols, resulting in a substantial increase in ability to cause a mass casualty event. The alert threshold was reached by Gemini 2.5 Deep Think (Aug 2025), but Google could not initially rule out the CCL had been met. Subsequent FSF 3.0 analysis confirmed it was not. Gemini 3 Pro (Nov 2025) and 3.1 Pro (Feb 2026) remain below the CCL.
- Cyber Uplift 1: “Provides sufficient uplift with high impact cyber attacks for additional expected harm at severe scale.” Earlier versions: “can be used to significantly assist with high impact cyber attacks, resulting in overall cost/resource reductions of an order of magnitude or more.” The alert threshold has been persistently met since Gemini 2.5 Pro (mid-2025), but the CCL itself has not been reached through Gemini 3.1 Pro.
- ML R&D Automation 1: “Can fully automate the work of any team of researchers at Google focused on improving AI capabilities, with approximately comparable all-inclusive costs.” This is the hardest threshold in this market — “any team” + “fully automate” at competitive cost. FSF v3.0 recommends Security Level 4 (the highest). Gemini 3 Pro performance on RE-Bench “continues to fall substantially below the capability required.”
- ML R&D Acceleration 1: “Has been used to accelerate AI development, resulting in AI progress substantially accelerating from historical rates.” Earlier versions specify ~2x acceleration from 2020–2024 rates. This is an observational threshold — it requires demonstrated real-world impact, not just capability. Easier than full automation but requires that AI tools have actually been deployed and measurably accelerated AI progress.
Google FSF assessment history
The FSF has gone through three major versions in 16 months, showing rapid framework evolution:
| Date | Event |
|---|---|
| May 2024 | FSF v1.0 launched with CCLs for autonomy, biosecurity, cybersecurity, and ML R&D. Described as exploratory. |
| Feb 2025 | FSF v2.0: stronger security-level recommendations, explicit deceptive-alignment approach. Used in governance for Gemini 2.0. |
| Aug 2025 | Gemini 2.5 Deep Think model card: CBRN Uplift 1 alert threshold reached. Google could not initially rule out CCL had been met; deployed with precautionary mitigations. Cyber Uplift 1 alert threshold also met. |
| Sep 2025 | FSF v3.0: major restructuring, recalibrated to be “more appropriately calibrated to real world harm.” Under updated framework, CBRN alert threshold no longer considered met for latest models. |
| Nov 2025 | Gemini 3 Pro FSF Report: all CCLs confirmed not met. ML R&D Automation “substantially below.” Cyber alert threshold continues to be met. |
| Feb 2026 | Gemini 3.1 Pro: remains below all CCLs. Cyber alert threshold still met. |
Key pattern: alert thresholds serve as early warnings and have been triggered for CBRN and Cyber, but actual CCL crossings remain elusive. The FSF 3.0 recalibration also illustrates framework churn — thresholds that seemed close to being met were retroactively assessed as further away under a revised framework.
Cross-lab reference points
- Anthropic: ASL-3 activated May 2025 (CBRN-driven). Claude Opus 4.6 does not cross AI R&D-4 (Feb 2026). Confirms intermediate thresholds being crossed but top-level thresholds remaining out of reach.
- OpenAI: o3 reportedly “medium” on AI R&D and cyber. GPT-4o rated “low” in bio/cyber. No “critical” classification.
METR capability trend
METR’s March 2025 analysis: frontier-agent task horizon doubling every ~7 months. Current systems at ~1-hour task horizon. ML R&D Automation 1 (“fully automate any team”) likely requires sustained multi-day reliable performance, placing it furthest out. CBRN and Cyber CCLs depend more on domain-specific knowledge uplift than raw task duration.
Methodology
For each event, we set anchor points (month, cumulative probability) balancing:
- Google’s own assessments: model cards and FSF reports provide direct evidence on proximity to each CCL (alert thresholds reached, CCLs not met, “substantially below”)
- Alert-threshold-to-CCL lag: CBRN and Cyber alert thresholds have been met since mid-2025 without the CCLs being crossed, suggesting meaningful distance between alert and CCL
- METR task-horizon trend (~7-month doubling), primarily relevant for ML R&D thresholds
- Cross-lab evidence: no lab has reported top-level crossings; intermediate crossings (Anthropic ASL-3) taking ~20 months from framework launch
- FSF framework churn: three versions in 16 months, including recalibration that moved models further from thresholds
A logistic CDF is fitted to each event’s anchors via least-squares. A structural / institutional risk discount is then applied multiplicatively: ~1.3%/year constant hazard rate for “framework becomes unresolvable” (FSF restructuring, CCL redefinition, reporting changes). The Google FSF’s rapid evolution (v1 → v3 in 16 months) makes this structural risk particularly relevant.
Key assumptions
- Google continues publishing capability assessments (FSF reports, model cards, or equivalent)
- The CCL concepts remain substantially similar even as FSF versions evolve
- Cyber Uplift 1 slightly ahead of CBRN Uplift 1 (persistent alert threshold)
- ML R&D Acceleration 1 easier than ML R&D Automation 1 (partial acceleration vs full automation)
- ML R&D Automation 1 is the hardest threshold (highest recommended security level, “substantially below” assessment)
- Alert-threshold-to-CCL distance is meaningful (months to years, not weeks)
Sources
Liquidity over time
i Shows the market’s available liquidity over time. Liquidity can decay continuously and may change discretely if liquidity adjustments are applied.Open this panel to load liquidity history.