antistatic.exchange
Register Log in
Back to feed

Some disconnected, rough thoughts: 

 

As far as I can tell from some research, Anthropic hasn’t yet even clearly stated that they have crossed the CBRN-3 threshold (“the ability to significantly assist

individuals or groups with basic STEM backgrounds in obtaining, producing, or deploying CBRN weapons”), although they did activate ASL-3 protections. In general, it seems that the standards of evidence they are using for the “rule in” thresholds of capability are high, and I also expect they don’t have many incentives for performing strict experiments just in order to clearly confirm dangerous capabilities (relatively little upside in doing so, as long as they internally act precautionarily when not “ruling out” dangerous capability levels). 

 

 

It is also not clear how this question is handling “AI R&D-4”. The standards for automated R&D from RSP v3.0 seem substantially higher than AI R&D-4 from the previous RSP, particularly when reading things such as Claude Mythos Preview System Card, which among other things suggests that a productivity uplift on individual tasks on the order of 40× (an order of magnitude higher than the current geometric mean of 4× from surveys) would be required to yield an overall progress multiplier of 2×. 

My reading is that Anthropic will officially need to acknowledge that any of the definitions has been met. 

Now that AI R&D-4 has been deprecated, this lower threshold may never be officially tested again.

 


 

In general, one fundamental issue with these questions is that there will likely be a substantial (at least months) distance between the point where we have the actual underlying capability (particularly under refined scaffolds and with people being prepared to smartly leverage the model’s strengths and having good incentives for successful elicitations, which probably does not happen in the CBRN tests from Anthropic or similar tests from other companies) and the point where the capability is officially announced.



Reading things like Biorisk \ red.anthropic.com, info about Mythos, and generally about METR time horizons progress (even though this is purely about software engineering), makes me expect that (actual) CBRN-4 is much closer than what is suggested by the baseline starting probabilities, if not already here. However, related to a previous observation, there seems to be little benefit in the current environment to clearly state that their models have high CBRN risk: it seems that the current administration or adversaries could use an honest assessment against Anthropic, and the upside how showing others a very clear lower bound on dangerous capabilities seems low at the moment, on both commercial grounds and even e.g. for clear signaling with the hopes of generating common knowledge and promoting urgent safety regulations. The political/regulatory considerations might change in the following years as AI and its associated risks become more mainstream, and I think there could be substantial changes in policy here after the 2028 presidential elections. Importantly, the tradeoffs for Anthropic officially declaring some dangerous capability thresholds could also change if another company faces a public incident that makes it clear that some capability level has been reached, or if another AI company declares equivalent capability thresholds as met (and this consideration made me adjust my initial probabilities up).

 

AI R&D is likely much more dangerous overall on the medium and long term than CBRN, but this is less obviously clear, and also has more obvious commercial upside and thus potential of attracting inversion. Also, the people surveyed for determining current AI R&D acceleration within Anthropic are much better at eliciting the underlying capabilities of the models than the people surveyed to determine the CBRN uplift. So, in general, I would typically expect less gap between actual capability and official announcement, but on the other hand the substantial increase in difficulty (and widening of definitional wiggle room) from RSP v2.1 to 3.0 makes me think that Anthropic might prefer to express the AI R&D gains while delaying actually indicating that the thresholds have been crossed.

 

Finally, I would expect that at some point the AI capabilities are so strong that Anthropic could very well just say what everyone already knows about some thresholds having been crossed. But there seems to be little benefit (from the perspective of providing public knowledge and that of signalling trustworthiness and a safety-conscious approach) once we are at that point. And the CBRN-3 example already seems to point in the direction of previous thresholds being treated as surpassed without ever being explicitly recognized as such.

0 replies

No replies yet.