Leaderboard
Models ranked by Adversarial Gap. Lower gap = better cognitive understanding beyond keyword-stuffing.
| Rank | Model | Params | Std CSR | Adv CSR | Gap ↓ | Constraint % | Prompt % |
|---|
Click any model name for detailed analytics. Std CSR = Standard Constraint Satisfaction Rate (prompt-level strict). Adv CSR = Adversarial mode. Gap = Std − Adv (lower is better). Constraint % = Individual constraint pass rate. Prompt % = % of prompts where ALL constraints pass.
Verifiable Constraints
28 deterministic rules across 3 tiers. Every constraint returns pass/fail with no subjectivity.
Tier 1 — Structural
String matching, word counting, keyword detection. Zero external dependencies.
- Is output a question? (ends with "?")
- Word count within 10-150 range
- Contains level-appropriate vocabulary
- Answer length and format checks
Tier 2 — Semantic
NLI-based entailment verification using DeBERTa-v3 (fixed threshold, not a judge).
- D3: Answer is supported by passage (>65% entailment)
- P2: Apply scenario is novel (not in passage)
- C2: Create output is original (not extractable)
Complete Constraint Map
| Level | ID | Constraint | What It Checks | Tier |
|---|---|---|---|---|
| Universal | U1 | IsQuestion | Output is a question or task prompt (ends with "?" or uses interrogative/task structure) | Tier 1 |
| U2 | WordCount | Question is 5–150 words (5 for Remember, 10 for others) | Tier 1 | |
| U3 | PassageRelevance | ≥1 key term (Remember) or ≥2 (others) from passage in Q+A | Tier 1 | |
| U4 | NoDegenerateOutput | No word repeated >3×, not empty | Tier 1 | |
| L1 Remember | R1 | Vocabulary | Uses recall verbs (define, list, identify…) or natural recall patterns (What is, How many, When did…) | Tier 1 |
| R2 | SingleConcept | Targets ≤2 key concepts (deduplicated: cell/cell cycle = 1) | Tier 1 | |
| R3 | ShortAnswer | Reference answer ≤20 words | Tier 1 | |
| R4 | Extractable | ≥60% of answer words in passage | Tier 1 | |
| L2 Understand | D1 | Vocabulary | Uses comprehension verbs (explain, describe…) or natural patterns (Why is, How does, What is the purpose…) | Tier 1 |
| D2 | NotCopied | <70% trigram overlap between answer and passage | Tier 1 | |
| D3 | Supported | NLI: answer not contradicted by passage (contradiction <50%) | Tier 2 | |
| D4 | AsksMeaning | Contains meaning markers (how, why, explain, describe…) | Tier 1 | |
| L3 Apply | P1 | Vocabulary | Uses apply verbs (calculate, solve…) or scenario framing (If a…, Given that…, A researcher…) | Tier 1 |
| P2 | NewScenario | NLI: passage does NOT entail scenario (>55%) | Tier 2 | |
| P3 | MethodReference | Q+A references ≥1 key concept or method from passage | Tier 1 | |
| P4 | SpecificResult | Answer contains concrete reasoning (numbers, causal logic, outcomes, method application) | Tier 1 | |
| L4 Analyze | A1 | Vocabulary | Uses analysis verbs (compare, contrast…) or analytical patterns (How do X differ, What is the relationship…) | Tier 1 |
| A2 | MultipleConcepts | ≥2 key concepts in Q+A (with acronym matching) | Tier 1 | |
| A3 | Relationship | Contains relationship markers (between, differ, compare…) | Tier 1 | |
| A4 | AnswerCoverage | Answer addresses all concepts from question | Tier 1 | |
| L5 Evaluate | E1 | Vocabulary | Uses judgment verbs (assess, critique, justify…) | Tier 1 |
| E2 | Claim | Presents something to evaluate (should, justify, claim, position, strengths/weaknesses…) | Tier 1 | |
| E3 | EvidenceRequest | Asks for evidence-based reasoning | Tier 1 | |
| E4 | Argumentation | Answer contains argumentation (contrastive, causal, evaluative, or structured reasoning) | Tier 1 | |
| L6 Create | C1 | Vocabulary | Uses creation verbs (design, propose, develop…) | Tier 1 |
| C2 | Novel | NLI: passage does NOT entail answer (>60%) | Tier 2 | |
| C3 | Specifications | ≥2 specification markers (must, should, include…) | Tier 1 | |
| C4 | SubstantialAnswer | Answer is >50 words | Tier 1 |
Adversarial Mode
The headline metric: can LLMs satisfy cognitive constraints when forced to use misleading vocabulary?
What Is Adversarial Mode?
In standard mode, an LLM generates an "Analyze" question and naturally uses words like "compare" and "contrast". Easy.
In adversarial mode, the LLM must generate an "Analyze" question but using Remember vocabulary ("identify", "name", "list"). The structural constraints for Analyze (A2, A3, A4) must still pass.
This tests whether the model truly understands cognitive complexity or just keyword-stuffs.
The Adversarial Gap
If a model scores 85% standard but 25% adversarial, its 60pp gap means most "success" came from vocabulary cues, not genuine understanding.
Ideal model: Gap ≈ 0 (cognitive control is vocabulary-independent).
Weak model: Large gap (relies on keywords for apparent competence).
Adversarial Pairings
| Generate This Level | Using Vocabulary From | What It Tests |
|---|---|---|
| L2 Understand | L1 Remember verbs | Can you explain using "identify" and "list"? |
| L3 Apply | L2 Understand verbs | Can you apply using "explain" and "describe"? |
| L4 Analyze | L1 Remember verbs | Can you analyze using "name" and "what is"? |
| L5 Evaluate | L2 Understand verbs | Can you judge using "summarize" and "discuss"? |
| L6 Create | L3 Apply verbs | Can you create using "solve" and "calculate"? |
| L1 Remember | L4 Analyze verbs | Can you recall facts using "compare" and "examine"? |
Analysis
Visual breakdown of standard vs. adversarial performance across models and Bloom's levels.
Methodology
How CogBench evaluates cognitive-level control with deterministic, verifiable constraints.
IFEval-Style Evaluation
Inspired by IFEval, every constraint is a deterministic rule. No trained classifier, no LLM-as-judge.
This eliminates inter-rater disagreement and makes results perfectly reproducible.
Bloom's Taxonomy Levels
- L1 Remember — Recall facts directly from text
- L2 Understand — Explain ideas in own words
- L3 Apply — Use knowledge in new situations
- L4 Analyze — Break apart and find relationships
- L5 Evaluate — Make evidence-based judgments
- L6 Create — Produce something original
Metrics
- Prompt-level strict — % of questions where ALL constraints pass
- Prompt-level loose — Same with score ≥0.5 threshold
- Constraint-level strict — % of individual constraints that pass
- Adversarial gap — Standard − Adversarial CSR (headline metric)
Source Passages
120 passages scraped from OpenStax (CC-BY licensed textbooks). 15 passages per subject across 8 disciplines:
Biology, Chemistry, Physics, Mathematics, Psychology, Economics, History, Computer Science
Each passage is 300-800 words with extracted key concepts and method/principle terms.
Why Verifiable Constraints?
Previous approaches use trained classifiers or LLM-as-judge to evaluate question quality. These have fundamental problems:
- Circular reasoning — A model judging a model. If the judge is wrong, all scores are wrong.
- Non-reproducible — Different runs of the same judge give different scores.
- Low agreement — Inter-annotator agreement for Bloom's classification is often <50%.
CogBench defines cognitive complexity through checkable structural properties. A Remember question must have an extractable answer (R4). An Analyze question must reference ≥2 concepts (A2). These are verifiable facts, not subjective judgments.
Submit Your Model
Evaluate your LLM and get ranked on the leaderboard.
1. Install CogBench
Verify the installation:
Or install from source:
2. Run the Benchmark
Generate questions from your model across all 120 passages and 6 Bloom's levels in both standard and adversarial modes:
This produces 1,440 questions (720 standard + 720 adversarial) and evaluates all 28 constraints automatically. Models are served via Ollama.
3. Submit Results
Submit your evaluated results to appear on the leaderboard. Requires the GitHub CLI (gh):
This packages your results, computes metrics, and creates a GitHub issue on the CogBench repo. The maintainers will review and add your model to the leaderboard.
To submit a specific model: cogbench submit --model qwen2.5:14b --name "Your Name"
Citation
If you use CogBench in your research, please cite: