As with the Anthropic question, the requirement of an official public report that explicitly mentions a capability threshold being met is reasonable for the sake of a clear resolution, but introduces noise with respect to forecasts about the underlying reality. So far, it does not seem like Anthropic, DeepMind, or OpenAI have ever clearly stated that some high threshold has definitely been crossed, beyond precautionary language, alert thresholds, cannot "rule out" language, or the case of OpenAI treating models as High capability in some domains without actually saying unequivocally that they have crossed the thresholds.
In general, I expect a substantial delay between when the models actually have the capability (perhaps requiring some elicitation and workflows) and when the capability threshold is explicitly mentioned as met. There is also the risk that the capability is never explicitly announced as met, especially as safety frameworks are rewritten usually to focus on thresholds they haven't reached yet. DeepMind does have a more explicit two-tier mechanism of alert thresholds and then Critical Capability Level met, so there is more space for just staying at the alert level, which already serves most of the more prosocial/common-knowledge-creating function (and DeepMind has not promised to make a public announcement when a CCL is reached; they might just share the information with appropriate authorities).
Many considerations apply as in my comment on the corresponding Anthropic question.
The thresholds are quite vague (which creates a large part of the uncertainty), but:
DeepMind's CBRN Uplift 1 seems roughly analogous to Anthropic's CBRN-3, and to OpenAI's High biological/chemical capability. Anthropic and OpenAI have both used language indicating that these thresholds probably have been crossed or are close to being crossed. Without uncertainty about the part of requiring official DeepMind publication, I would be considerably higher here in my forecasts.
Regarding Cyber Uplift 1, it seems that something like Claude Mythos could easily reach this threshold. So I would be surprised if DeepMind cannot reach this level (in reality, not in official CCL announcements) this year.
Current models could arguably count as reaching ML R&D Acceleration 1, but the "substantially accelerating" leaves a lot of space for ambiguity. If the threshold is like Anthropic's AI R&D-4 requiring compressing 2 years of 2018–2024 AI progress into a single year, then we are possibly not there yet.
ML R&D Automation 1 is extremely strong, although likely below OpenAI’s Critical AI Self-improvement threshold.
A nitpick: Restricting to a "Gemini" model in the resolution criteria is probably not intended (I don't expect the idea is to exclude models from DeepMind if for some reason they are not called "Gemini")
Some disconnected, rough thoughts:
As far as I can tell from some research, Anthropic hasn’t yet even clearly stated that they have crossed the CBRN-3 threshold (“the ability to significantly assist
individuals or groups with basic STEM backgrounds in obtaining, producing, or deploying CBRN weapons”), although they did activate ASL-3 protections. In general, it seems that the standards of evidence they are using for the “rule in” thresholds of capability are high, and I also expect they don’t have many incentives for performing strict experiments just in order to clearly confirm dangerous capabilities (relatively little upside in doing so, as long as they internally act precautionarily when not “ruling out” dangerous capability levels).
It is also not clear how this question is handling “AI R&D-4”. The standards for automated R&D from RSP v3.0 seem substantially higher than AI R&D-4 from the previous RSP, particularly when reading things such as Claude Mythos Preview System Card, which among other things suggests that a productivity uplift on individual tasks on the order of 40× (an order of magnitude higher than the current geometric mean of 4× from surveys) would be required to yield an overall progress multiplier of 2×.
My reading is that Anthropic will officially need to acknowledge that any of the definitions has been met.
Now that AI R&D-4 has been deprecated, this lower threshold may never be officially tested again.
In general, one fundamental issue with these questions is that there will likely be a substantial (at least months) distance between the point where we have the actual underlying capability (particularly under refined scaffolds and with people being prepared to smartly leverage the model’s strengths and having good incentives for successful elicitations, which probably does not happen in the CBRN tests from Anthropic or similar tests from other companies) and the point where the capability is officially announced.
Reading things like Biorisk \ red.anthropic.com, info about Mythos, and generally about METR time horizons progress (even though this is purely about software engineering), makes me expect that (actual) CBRN-4 is much closer than what is suggested by the baseline starting probabilities, if not already here. However, related to a previous observation, there seems to be little benefit in the current environment to clearly state that their models have high CBRN risk: it seems that the current administration or adversaries could use an honest assessment against Anthropic, and the upside how showing others a very clear lower bound on dangerous capabilities seems low at the moment, on both commercial grounds and even e.g. for clear signaling with the hopes of generating common knowledge and promoting urgent safety regulations. The political/regulatory considerations might change in the following years as AI and its associated risks become more mainstream, and I think there could be substantial changes in policy here after the 2028 presidential elections. Importantly, the tradeoffs for Anthropic officially declaring some dangerous capability thresholds could also change if another company faces a public incident that makes it clear that some capability level has been reached, or if another AI company declares equivalent capability thresholds as met (and this consideration made me adjust my initial probabilities up).
AI R&D is likely much more dangerous overall on the medium and long term than CBRN, but this is less obviously clear, and also has more obvious commercial upside and thus potential of attracting inversion. Also, the people surveyed for determining current AI R&D acceleration within Anthropic are much better at eliciting the underlying capabilities of the models than the people surveyed to determine the CBRN uplift. So, in general, I would typically expect less gap between actual capability and official announcement, but on the other hand the substantial increase in difficulty (and widening of definitional wiggle room) from RSP v2.1 to 3.0 makes me think that Anthropic might prefer to express the AI R&D gains while delaying actually indicating that the thresholds have been crossed.
Finally, I would expect that at some point the AI capabilities are so strong that Anthropic could very well just say what everyone already knows about some thresholds having been crossed. But there seems to be little benefit (from the perspective of providing public knowledge and that of signalling trustworthiness and a safety-conscious approach) once we are at that point. And the CBRN-3 example already seems to point in the direction of previous thresholds being treated as surpassed without ever being explicitly recognized as such.
I'm assuming that extinction or equivalent would translate as LFPR of 0. Since I assign this possibility a high probability even in the Slow scenario (>15% of it happening during this period), this is dragging down many of the values.
My general distribution at the moment is extremely rough, and I would likely adjust it considerably with further time and reflection.
A very large part of the question seems to depend on sociopolitical dynamics about unemployment insurance, UBI-like programs, creation of fake jobs, etc.
Reaching only the Slow Progress scenario by the end of 2030 seems very unlikely if thinking purely on capability or economic considerations, and it would instead likely be the result of very strong/restrictive policies which I think may be correlated to trying to control the impact of AI on labor markets. Nonetheless, even on the Slow scenario, the provided baseline forecast values for ≥2035 seem excessively confident, and do not seem to be taking the possibility and consequences of AGI or ASI seriously at all.
After 2040, even from the conditional of 2030 ending in the Slow Progress scenario, and even without existential catastrophes, most of my probability is on "weird" scenarios were normal notions of (valuable) human work do not make much sense; reaching 2040 in a still somewhat "normal" world would require very unexpected barriers to further AI progress and diffusion. Still, it makes sense to leave some (diminishing) probability for scenarios with very strong and long-lasting pauses on AI progress or use (but note that that purely national pauses are not stable), as well as some scenarios where human labor still exists despite not being economically necessary.
Slight nitpicking about phrasing: Strict reading of the conjunctive conditionals inside each scenario could potentially lead to no scenario being applicable, such as if all capabilities are low but we have level-5 self-driving (due to the clause "Self-driving improves but true level-5 systems do not yet exist"). I'm assuming that the scenario descriptions serve more like fuzzy general guidelines rather than strict requirements.
Slight nitpick of resolution criteria, largely mattering only for 2030 and perhaps the very few next years: Since the scenarios are based on the state at the end of 2030, it might have been clearer to use the U.S. labor force participation rate for December of each year, rather than for January of that same year.
I don't think the PLA purges mean much of anything. The PLA has a longstanding modernization program dedicated to producing an army that can take Taiwan if the political leadership decides it needs to. A few sackings/eliminations, even of elite generals, are not going to change the PLA's trajectory nor IMO do they tell us much about how the political leadership perceives PLA readiness: there are a LOT of plausible reasons to purge someone, ranging from genuine corruption and incompetence to political disloyalty to simply wanting/needing to flex party control over the army (something that has not typically been entirely guaranteed in modern China).
More significant IMO are the 2028 Taiwan elections. If the KMT wins and manages to put together enough internal coherence that it is perceived as a credible negotiating partner in Beijing, this IMO lowers the risk of invasion very substantially out to at least 2032. The CCP leadership is relatively risk averse and I think would prefer to settle the Taiwan question through political means rather than war. I think everyone in the West thinks this is impossible (especially after Hong Kong) and maybe it is, but
a) the CCP may well not agree
b) although Hong Kong is perhaps something of a failure as far as the "one China, a few different systems" model goes, Macao is still right there!
The KMT is a broad coalition with pro-US and more open-to-Beijing factions, and has historically struggled to stick to a coherent unified party platform enforced by the party leadership: everything is worked out by negotiation between factions instead. That said, the pro-Beijing faction seems to be in the ascendancy, winning the party chairmanship via Cheng Li-wun, who is currently visiting China and is expected to meet Xi (this is the first trip to China by a senior KMT figure in at least 10 years). The governing DPP is unpopular and widely perceived to be dubiously competent, and only won the last election due to vote-splitting between the KMT and TPP. I do expect the DPP to lose power in 2028, and so most of my risk is concentrated in the period after 2032: if there is no move towards a more pro-China politics on Taiwan by then, and no progress at all towards reunification talks, then invasion risk rises considerably.
A useful thread for the cyber threshold markets:
A few reasons why we can be a bit more confident than the starting probabilities here:
Ofgem RIIO-ET3 Final Determinations (December 2025):
Total price control revenue reduced 5.3% (from £48.8bn to £46.2bn over the ET3 period)
Revenue explicitly reduced in 2026/27 and 2027/28 (increases in later years to smooth the consumer bill impact)
Plus, NESO's tariffs for 2026/27 ended up substantially below the reference class-based forecasts (£7.61bn, 14.7% below the September 2025 forecast of £8.918bn). Was driven by lower Ofgem allowances, changes to how they're handling expenses from offshore wind, and updated forecasts on demand and embedded generation.
Got Claude Code to generate my forecasts via the CLI, using lognormal distributions and the anticipated 8% reductions in the coming years:
Year | NESO Sep-25 | My Median | My Sigma | Rationale |
|---|---|---|---|---|
2027/28 | £10.278bn | £9.5bn | 0.13 | RIIO-ET3 reprofiling reduces early years; ~8% downward revision; NESO initial forecast due April 2026 |
2028/29 | £11.657bn | £11.0bn | 0.17 | 5.3% envelope cut partially offset by reprofiling towards later years |
2029/30 | £12.685bn | £12.5bn | 0.20 | Reprofiling adds revenue to later years; roughly in line with September forecast |
2030/31 | £13.629bn | £13.5bn | 0.22 | Strongest reprofiling benefit; widest uncertainty at 4-5 year lead |
Resolved the first week:
Date | Day | n_total |
|---|---|---|
March 16 | Mon | 4 |
March 17 | Tue | 2 |
March 18 | Wed | 1 |
March 19 | Thu | 3 |
March 20 | Fri | 6 |
March 21 | Sat | 0 |
March 22 | Sun | 3 |
Weekly total | 19 |
Sentinel forecasters in their latest newsletter said the following:
Will the US place 1,000 or more boots on the ground in Iran before July 2026? Our forecasters think there’s a 37% chance (14% to 55%).
Will the US place >10,000 boots on the ground before July? Our forecasters think there’s a 13.5% probability (4% to 27%).
I'm at 50% and 11% on these thresholds for June currently
Contrary to what Trump claims, there are no ongoing negotiations between Iran and the United States.
Indirect messages have reached Iran, but they do not include any proposals of value or worthy of consideration. Moreover, the channel for sending these one-sided messages has been ongoing since the beginning of the war.
Iran will continue the war until it achieves its objectives, and there is a conviction that the five-day deadline announced by Trump is nothing but a trick to lower Iran's level of alertness, with the possibility of any strike by the adversary occurring during this period.
Iran views a response to the energy infrastructure in the region and Israel's power grid as inevitable, and Trump's timings hold no value from its perspective.
The Strait of Hormuz will remain closed.
USS Tripoli (LHA-7), USS New Orleans (LPD-18) and elements of the embarked 31st Marine Expeditionary Unit are getting closer to the Middle East after transiting the Strait of Malacca this week. USS San Diego (LPD-22), which was operating with Tripoli and New Orleans earlier this month, is now in port in Sasebo, Japan, according to a U.S. Pacific Fleet spokesperson.
A Marine Expeditionary Unit (MEU) is a ~2,200–2,400 Marine rapid-deployment force sized to fit a standard three-ship Amphibious Ready Group: one large-deck assault ship (LHA/LHD) carrying most of the Marines and aircraft, plus two transport docks (LPDs) carrying additional troops, landing craft, and vehicles. With USS San Diego back in Sasebo, only Tripoli (the large-deck) and New Orleans are heading to the Middle East at the moment - a degraded MEU of around 1,500–2,000 Marines I'd guess. The 11th MEU (Boxer ARG) is also en route (same USNI article) from San Diego. Number sent over could easily increase further, since the Pentagon has submitted a ~$200 billion supplemental budget request framed around a sustained campaign.
Kharg Island is Iranian sovereign territory, so any troops ashore there would count toward this market.
An MEU could plausibly assault Kharg, but holding it is the harder problem, since Kharg is ~25km off the Iranian coast. Suppressing those threats continuously while occupying the island would stretch a force of 1,500–2,000 Marines. A scenario where the US raids Kharg, destroys the oil terminal infrastructure, and withdraws (never reaching large sustained troop numbers on the island) seems at least as plausible as a prolonged occupation. Trump said this in 1998 to Poly Toynbee in the Guardian, so there seems to be some fixation on it:
They’ve been beating us psychologically, making us look a bunch of fools. One bullet shot at one of our men or ships and I’d do a number on Kharg Island. I’d go in and take it. Iran can’t even beat Iraq, yet they push the United States around. It’d be good for the world to take them on.
Lindsey Graham is up for it too, though I really don't get the logic (controlling it isn't going to help open up the Strait of Hormuz, is it?):
- Week ending 2026-03-08: 34 (4+5+3+4+5+7+6)
- Week ending 2026-03-15: 41 (9+4+5+8+3+3+9)
Upping my median in the near-term to reflect the increasing number of vessels being allowed through by Iran (eg to India, carrying Iranian cargo, etc). Not expecting a rapid return to normal here, though; it seems that mines have been planted:
Although the strait is 21 miles wide at its narrowest point, there are only two 1.86-mile stretches where water is deep enough for large oil tankers to pass, given how low these ships sit in the water. “This creates a two-lane highway for large vessels, one lane in and one out. The bottleneck here could last a few weeks at the very least,” he said.
— WSJ
Very unlikely in the next couple of years given the extensive purges in the PLA.
إيران بالعربية | عاجل (@IranArabic_ir): “خلافًا لما يدّعيه ترامب، لا توجد أي مفاوضات جارية بين إيران والولايات المتحدة. وقد وصلت رسائل غير مباشرة إلى إيران، لكنها لا تتضمن أي مقترحات ذات قيمة ”