Date open-weight model reaches 80% on CyberGym Level 1

This market resolves on the first credible public report that a publicly downloadable open-weight AI model achieves CyberGym Level 1 Success Rate (%) ≥80.0 with Trials = 1 on the full official 1,507-task benchmark.

A model counts as open-weight if the weights needed to run the evaluated model are publicly downloadable before or at the time of the report. API-only, cloud-only, private-weight, or hosted-only models do not count. Public adapters or fine-tunes count only if the full runnable model, including the base model, is publicly downloadable.

The result must be for CyberGym Level 1: the agent receives the vulnerability description and unpatched codebase, then must generate a working proof-of-concept that reproduces the vulnerability. "Trials = 1" means one attempt per task; best-of-N runs, multiple independent restarts, reranking across attempts, or multiple-trial scores do not count.

The run may use an agent harness, local code search, compilation, tests, fuzzing tools, and execution feedback, but it must not use web browsing, hidden ground-truth PoCs, patched-code diffs, private vulnerability databases, human assistance during evaluation, or closed/API models as helpers. All AI model calls in the run must be to the qualifying open-weight model or publicly downloadable variants of it.

General coding, cybersecurity, or agentic fine-tuning may count. A model trained, selected, or optimized specifically on CyberGym benchmark tasks, target PoCs, labels, or evaluation results does not count.

Credible reports include the official CyberGym leaderboard, a model-card or technical report from the model developer, a peer-reviewed paper or arXiv preprint, or a recognized independent evaluation report. The report must identify the model, confirm public weight availability, report the Level 1 Trials = 1 score, and describe the protocol well enough to rule out a materially easier setup.

If the report gives task counts, ≥80.0 means at least 1,206/1,507 successful tasks. If only a percentage is reported, a reported Success Rate (%) of 80.0 or higher counts.

The resolution date is the UTC publication date of the first qualifying report. Each monthly threshold resolves YES if that date is on or before the threshold date, otherwise NO. If no qualifying report appears by 2029-12-31 23:59:59 UTC, all remaining thresholds resolve NO.

0 comments

filter:

sort: