CogBench — Verifiable Cognitive Constraint Benchmark

Leaderboard

Models ranked by Adversarial Gap. Lower gap = better cognitive understanding beyond keyword-stuffing.

Rank	Model	Params	Std CSR	Adv CSR	Gap ↓	Constraint %	Prompt %

Click any model name for detailed analytics. Std CSR = Standard Constraint Satisfaction Rate (prompt-level strict). Adv CSR = Adversarial mode. Gap = Std − Adv (lower is better). Constraint % = Individual constraint pass rate. Prompt % = % of prompts where ALL constraints pass.

Verifiable Constraints

28 deterministic rules across 3 tiers. Every constraint returns pass/fail with no subjectivity.

Tier 1 — Structural

String matching, word counting, keyword detection. Zero external dependencies.

Is output a question? (ends with "?")
Word count within 10-150 range
Contains level-appropriate vocabulary
Answer length and format checks

Tier 2 — Semantic

NLI-based entailment verification using DeBERTa-v3 (fixed threshold, not a judge).

D3: Answer is supported by passage (>65% entailment)
P2: Apply scenario is novel (not in passage)
C2: Create output is original (not extractable)

Complete Constraint Map

Level	ID	Constraint	What It Checks	Tier
Universal	U1	IsQuestion	Output is a question or task prompt (ends with "?" or uses interrogative/task structure)	Tier 1
	U2	WordCount	Question is 5–150 words (5 for Remember, 10 for others)	Tier 1
	U3	PassageRelevance	≥1 key term (Remember) or ≥2 (others) from passage in Q+A	Tier 1
	U4	NoDegenerateOutput	No word repeated >3×, not empty	Tier 1
L1 Remember	R1	Vocabulary	Uses recall verbs (define, list, identify…) or natural recall patterns (What is, How many, When did…)	Tier 1
	R2	SingleConcept	Targets ≤2 key concepts (deduplicated: cell/cell cycle = 1)	Tier 1
	R3	ShortAnswer	Reference answer ≤20 words	Tier 1
	R4	Extractable	≥60% of answer words in passage	Tier 1
L2 Understand	D1	Vocabulary	Uses comprehension verbs (explain, describe…) or natural patterns (Why is, How does, What is the purpose…)	Tier 1
	D2	NotCopied	<70% trigram overlap between answer and passage	Tier 1
	D3	Supported	NLI: answer not contradicted by passage (contradiction <50%)	Tier 2
	D4	AsksMeaning	Contains meaning markers (how, why, explain, describe…)	Tier 1
L3 Apply	P1	Vocabulary	Uses apply verbs (calculate, solve…) or scenario framing (If a…, Given that…, A researcher…)	Tier 1
	P2	NewScenario	NLI: passage does NOT entail scenario (>55%)	Tier 2
	P3	MethodReference	Q+A references ≥1 key concept or method from passage	Tier 1
	P4	SpecificResult	Answer contains concrete reasoning (numbers, causal logic, outcomes, method application)	Tier 1
L4 Analyze	A1	Vocabulary	Uses analysis verbs (compare, contrast…) or analytical patterns (How do X differ, What is the relationship…)	Tier 1
	A2	MultipleConcepts	≥2 key concepts in Q+A (with acronym matching)	Tier 1
	A3	Relationship	Contains relationship markers (between, differ, compare…)	Tier 1
	A4	AnswerCoverage	Answer addresses all concepts from question	Tier 1
L5 Evaluate	E1	Vocabulary	Uses judgment verbs (assess, critique, justify…)	Tier 1
	E2	Claim	Presents something to evaluate (should, justify, claim, position, strengths/weaknesses…)	Tier 1
	E3	EvidenceRequest	Asks for evidence-based reasoning	Tier 1
	E4	Argumentation	Answer contains argumentation (contrastive, causal, evaluative, or structured reasoning)	Tier 1
L6 Create	C1	Vocabulary	Uses creation verbs (design, propose, develop…)	Tier 1
	C2	Novel	NLI: passage does NOT entail answer (>60%)	Tier 2
	C3	Specifications	≥2 specification markers (must, should, include…)	Tier 1
	C4	SubstantialAnswer	Answer is >50 words	Tier 1

Adversarial Mode

The headline metric: can LLMs satisfy cognitive constraints when forced to use misleading vocabulary?

What Is Adversarial Mode?

In standard mode, an LLM generates an "Analyze" question and naturally uses words like "compare" and "contrast". Easy.

In adversarial mode, the LLM must generate an "Analyze" question but using Remember vocabulary ("identify", "name", "list"). The structural constraints for Analyze (A2, A3, A4) must still pass.

This tests whether the model truly understands cognitive complexity or just keyword-stuffs.

The Adversarial Gap

Gap = Standard CSR − Adversarial CSR

If a model scores 85% standard but 25% adversarial, its 60pp gap means most "success" came from vocabulary cues, not genuine understanding.

Ideal model: Gap ≈ 0 (cognitive control is vocabulary-independent).

Weak model: Large gap (relies on keywords for apparent competence).

Adversarial Pairings

Generate This Level	Using Vocabulary From	What It Tests
L2 Understand	L1 Remember verbs	Can you explain using "identify" and "list"?
L3 Apply	L2 Understand verbs	Can you apply using "explain" and "describe"?
L4 Analyze	L1 Remember verbs	Can you analyze using "name" and "what is"?
L5 Evaluate	L2 Understand verbs	Can you judge using "summarize" and "discuss"?
L6 Create	L3 Apply verbs	Can you create using "solve" and "calculate"?
L1 Remember	L4 Analyze verbs	Can you recall facts using "compare" and "examine"?

Analysis

Visual breakdown of standard vs. adversarial performance across models and Bloom's levels.

Fig 1. Adversarial Gap per model. Standard CSR (blue) vs. Adversarial CSR (orange). The gap between them reveals keyword reliance.

Fig 2. Standard mode CSR by Bloom's level. Higher levels (Evaluate, Create) are typically harder.

Fig 3. Adversarial mode CSR by Bloom's level. The drop from standard reveals which levels rely most on vocabulary cues.

Fig 4. Per-constraint pass rate across all models (standard mode). Identifies which constraints are hardest to satisfy.

Methodology

How CogBench evaluates cognitive-level control with deterministic, verifiable constraints.

IFEval-Style Evaluation

Inspired by IFEval, every constraint is a deterministic rule. No trained classifier, no LLM-as-judge.

Same input → same score, every time

This eliminates inter-rater disagreement and makes results perfectly reproducible.

Bloom's Taxonomy Levels

L1 Remember — Recall facts directly from text
L2 Understand — Explain ideas in own words
L3 Apply — Use knowledge in new situations
L4 Analyze — Break apart and find relationships
L5 Evaluate — Make evidence-based judgments
L6 Create — Produce something original

Metrics

Prompt-level strict — % of questions where ALL constraints pass
Prompt-level loose — Same with score ≥0.5 threshold
Constraint-level strict — % of individual constraints that pass
Adversarial gap — Standard − Adversarial CSR (headline metric)

Source Passages

120 passages scraped from OpenStax (CC-BY licensed textbooks). 15 passages per subject across 8 disciplines:

Biology, Chemistry, Physics, Mathematics, Psychology, Economics, History, Computer Science

Each passage is 300-800 words with extracted key concepts and method/principle terms.

Why Verifiable Constraints?

Previous approaches use trained classifiers or LLM-as-judge to evaluate question quality. These have fundamental problems:

Circular reasoning — A model judging a model. If the judge is wrong, all scores are wrong.
Non-reproducible — Different runs of the same judge give different scores.
Low agreement — Inter-annotator agreement for Bloom's classification is often <50%.

CogBench defines cognitive complexity through checkable structural properties. A Remember question must have an extractable answer (R4). An Analyze question must reference ≥2 concepts (A2). These are verifiable facts, not subjective judgments.

Submit Your Model

Evaluate your LLM and get ranked on the leaderboard.

1. Install CogBench

pip install cogbench

Verify the installation:

cogbench info

Or install from source:

git clone https://github.com/cogbench/cogbench
cd cogbench && pip install -e .

2. Run the Benchmark

Generate questions from your model across all 120 passages and 6 Bloom's levels in both standard and adversarial modes:

cogbench run --model your-model:size \
  --mode both

This produces 1,440 questions (720 standard + 720 adversarial) and evaluates all 28 constraints automatically. Models are served via Ollama.

3. Submit Results

Submit your evaluated results to appear on the leaderboard. Requires the GitHub CLI (gh):

cogbench submit --name "Your Name"

This packages your results, computes metrics, and creates a GitHub issue on the CogBench repo. The maintainers will review and add your model to the leaderboard.

To submit a specific model: cogbench submit --model qwen2.5:14b --name "Your Name"

Citation

If you use CogBench in your research, please cite:

@misc{cogbench2026,
  title     = {CogBench: A Verifiable Constraint
               Benchmark for Cognitive-Level Control
               in LLM Question Generation},
  author    = {Anonymous},
  year      = {2026},
  note      = {NeurIPS 2026 Evaluations \& Datasets Track submission}
}