The starting probabilities of this market were calculated using logistic CDF
models fitted to anchor points calibrated via Google DeepMind’s own FSF
assessment history, cross-lab safety-framework patterns, and METR capability-trend
data. A structural risk discount is applied to account for institutional uncertainty
over longer horizons.
Events
This market tracks four Critical Capability Level (CCL) milestones from Google
DeepMind’s
Frontier
Safety Framework (FSF):
- CBRN Uplift 1: Under
FSF
v3.0, this is defined as “provides low to medium resourced actors uplift
in reference scenarios resulting in additional expected harm at severe scale.”
Earlier versions describe this as the model significantly assisting a low-resourced
actor with dual-use scientific protocols, resulting in a substantial increase in
ability to cause a mass casualty event. The alert threshold was reached by Gemini 2.5
Deep Think (Aug 2025), but Google could not initially rule out the CCL had been met.
Subsequent FSF 3.0 analysis confirmed it was not. Gemini 3 Pro (Nov 2025) and 3.1 Pro
(Feb 2026) remain below the CCL.
- Cyber Uplift 1: “Provides sufficient uplift with high
impact cyber attacks for additional expected harm at severe scale.” Earlier
versions: “can be used to significantly assist with high impact cyber attacks,
resulting in overall cost/resource reductions of an order of magnitude or
more.” The alert threshold has been persistently met since Gemini 2.5 Pro
(mid-2025), but the CCL itself has not been reached through Gemini 3.1 Pro.
- ML R&D Automation 1: “Can fully automate the work of
any team of researchers at Google focused on improving AI capabilities, with
approximately comparable all-inclusive costs.” This is the hardest threshold in
this market — “any team” + “fully automate” at
competitive cost. FSF v3.0 recommends Security Level 4 (the highest). Gemini 3 Pro
performance on RE-Bench “continues to fall substantially below the capability
required.”
- ML R&D Acceleration 1: “Has been used to accelerate AI
development, resulting in AI progress substantially accelerating from historical
rates.” Earlier versions specify ~2x acceleration from 2020–2024 rates.
This is an observational threshold — it requires demonstrated real-world impact,
not just capability. Easier than full automation but requires that AI tools have
actually been deployed and measurably accelerated AI progress.
Google FSF assessment history
The FSF has gone through three major versions in 16 months, showing rapid framework
evolution:
| Date | Event |
| May 2024 | FSF v1.0 launched with CCLs for autonomy, biosecurity,
cybersecurity, and ML R&D. Described as exploratory. |
| Feb 2025 | FSF v2.0: stronger security-level recommendations, explicit
deceptive-alignment approach. Used in governance for Gemini 2.0. |
| Aug 2025 | Gemini 2.5 Deep Think model card: CBRN Uplift 1 alert
threshold reached. Google could not initially rule out CCL had been met; deployed with
precautionary mitigations. Cyber Uplift 1 alert threshold also met. |
| Sep 2025 | FSF v3.0: major restructuring, recalibrated to be
“more appropriately calibrated to real world harm.” Under updated
framework, CBRN alert threshold no longer considered met for latest models. |
| Nov 2025 | Gemini 3 Pro FSF Report: all CCLs confirmed not met. ML
R&D Automation “substantially below.” Cyber alert threshold
continues to be met. |
| Feb 2026 | Gemini 3.1 Pro: remains below all CCLs. Cyber alert threshold
still met. |
Key pattern: alert thresholds serve as early warnings and have been triggered for
CBRN and Cyber, but actual CCL crossings remain elusive. The FSF 3.0 recalibration
also illustrates framework churn — thresholds that seemed close to being met
were retroactively assessed as further away under a revised framework.
Cross-lab reference points
- Anthropic: ASL-3 activated May 2025 (CBRN-driven). Claude Opus
4.6 does not cross AI R&D-4 (Feb 2026). Confirms intermediate thresholds being
crossed but top-level thresholds remaining out of reach.
- OpenAI: o3 reportedly “medium” on AI R&D and
cyber. GPT-4o rated “low” in bio/cyber. No “critical”
classification.
METR capability trend
METR’s
March 2025 analysis: frontier-agent task horizon doubling every ~7 months.
Current systems at ~1-hour task horizon. ML R&D Automation 1 (“fully
automate any team”) likely requires sustained multi-day reliable performance,
placing it furthest out. CBRN and Cyber CCLs depend more on domain-specific
knowledge uplift than raw task duration.
Methodology
For each event, we set anchor points (month, cumulative probability) balancing:
- Google’s own assessments: model cards and FSF reports
provide direct evidence on proximity to each CCL (alert thresholds reached, CCLs
not met, “substantially below”)
- Alert-threshold-to-CCL lag: CBRN and Cyber alert thresholds have
been met since mid-2025 without the CCLs being crossed, suggesting meaningful
distance between alert and CCL
- METR task-horizon trend (~7-month doubling), primarily relevant
for ML R&D thresholds
- Cross-lab evidence: no lab has reported top-level crossings;
intermediate crossings (Anthropic ASL-3) taking ~20 months from framework
launch
- FSF framework churn: three versions in 16 months, including
recalibration that moved models further from thresholds
A logistic CDF is fitted to each event’s anchors via least-squares. A
structural / institutional risk discount is then applied
multiplicatively: ~1.3%/year constant hazard rate for “framework becomes
unresolvable” (FSF restructuring, CCL redefinition, reporting changes). The
Google FSF’s rapid evolution (v1 → v3 in 16 months) makes this structural
risk particularly relevant.
Key assumptions
- Google continues publishing capability assessments (FSF reports, model cards, or
equivalent)
- The CCL concepts remain substantially similar even as FSF versions evolve
- Cyber Uplift 1 slightly ahead of CBRN Uplift 1 (persistent alert threshold)
- ML R&D Acceleration 1 easier than ML R&D Automation 1 (partial
acceleration vs full automation)
- ML R&D Automation 1 is the hardest threshold (highest recommended security
level, “substantially below” assessment)
- Alert-threshold-to-CCL distance is meaningful (months to years, not weeks)
Sources