CogBench

A verifiable constraint benchmark for evaluating cognitive-level control in LLM question generation using Bloom's Taxonomy

pip install cogbench
Deterministic evaluation — no LLM judge, no trained classifier, same input → same score

Leaderboard

Models ranked by Adversarial Gap. Lower gap = better cognitive understanding beyond keyword-stuffing.

Rank Model Params Std CSR Adv CSR Gap ↓ Constraint % Prompt %

Click any model name for detailed analytics. Std CSR = Standard Constraint Satisfaction Rate (prompt-level strict). Adv CSR = Adversarial mode. Gap = Std − Adv (lower is better). Constraint % = Individual constraint pass rate. Prompt % = % of prompts where ALL constraints pass.

Verifiable Constraints

28 deterministic rules across 3 tiers. Every constraint returns pass/fail with no subjectivity.

Tier 1 — Structural

String matching, word counting, keyword detection. Zero external dependencies.

  • Is output a question? (ends with "?")
  • Word count within 10-150 range
  • Contains level-appropriate vocabulary
  • Answer length and format checks

Tier 2 — Semantic

NLI-based entailment verification using DeBERTa-v3 (fixed threshold, not a judge).

  • D3: Answer is supported by passage (>65% entailment)
  • P2: Apply scenario is novel (not in passage)
  • C2: Create output is original (not extractable)

Complete Constraint Map

Level ID Constraint What It Checks Tier
Universal U1IsQuestionOutput is a question or task prompt (ends with "?" or uses interrogative/task structure)Tier 1
U2WordCountQuestion is 5–150 words (5 for Remember, 10 for others)Tier 1
U3PassageRelevance≥1 key term (Remember) or ≥2 (others) from passage in Q+ATier 1
U4NoDegenerateOutputNo word repeated >3×, not emptyTier 1
L1 Remember R1VocabularyUses recall verbs (define, list, identify…) or natural recall patterns (What is, How many, When did…)Tier 1
R2SingleConceptTargets ≤2 key concepts (deduplicated: cell/cell cycle = 1)Tier 1
R3ShortAnswerReference answer ≤20 wordsTier 1
R4Extractable≥60% of answer words in passageTier 1
L2 Understand D1VocabularyUses comprehension verbs (explain, describe…) or natural patterns (Why is, How does, What is the purpose…)Tier 1
D2NotCopied<70% trigram overlap between answer and passageTier 1
D3SupportedNLI: answer not contradicted by passage (contradiction <50%)Tier 2
D4AsksMeaningContains meaning markers (how, why, explain, describe…)Tier 1
L3 Apply P1VocabularyUses apply verbs (calculate, solve…) or scenario framing (If a…, Given that…, A researcher…)Tier 1
P2NewScenarioNLI: passage does NOT entail scenario (>55%)Tier 2
P3MethodReferenceQ+A references ≥1 key concept or method from passageTier 1
P4SpecificResultAnswer contains concrete reasoning (numbers, causal logic, outcomes, method application)Tier 1
L4 Analyze A1VocabularyUses analysis verbs (compare, contrast…) or analytical patterns (How do X differ, What is the relationship…)Tier 1
A2MultipleConcepts≥2 key concepts in Q+A (with acronym matching)Tier 1
A3RelationshipContains relationship markers (between, differ, compare…)Tier 1
A4AnswerCoverageAnswer addresses all concepts from questionTier 1
L5 Evaluate E1VocabularyUses judgment verbs (assess, critique, justify…)Tier 1
E2ClaimPresents something to evaluate (should, justify, claim, position, strengths/weaknesses…)Tier 1
E3EvidenceRequestAsks for evidence-based reasoningTier 1
E4ArgumentationAnswer contains argumentation (contrastive, causal, evaluative, or structured reasoning)Tier 1
L6 Create C1VocabularyUses creation verbs (design, propose, develop…)Tier 1
C2NovelNLI: passage does NOT entail answer (>60%)Tier 2
C3Specifications≥2 specification markers (must, should, include…)Tier 1
C4SubstantialAnswerAnswer is >50 wordsTier 1

Adversarial Mode

The headline metric: can LLMs satisfy cognitive constraints when forced to use misleading vocabulary?

What Is Adversarial Mode?

In standard mode, an LLM generates an "Analyze" question and naturally uses words like "compare" and "contrast". Easy.

In adversarial mode, the LLM must generate an "Analyze" question but using Remember vocabulary ("identify", "name", "list"). The structural constraints for Analyze (A2, A3, A4) must still pass.

This tests whether the model truly understands cognitive complexity or just keyword-stuffs.

The Adversarial Gap

Gap = Standard CSR − Adversarial CSR

If a model scores 85% standard but 25% adversarial, its 60pp gap means most "success" came from vocabulary cues, not genuine understanding.

Ideal model: Gap ≈ 0 (cognitive control is vocabulary-independent).

Weak model: Large gap (relies on keywords for apparent competence).

Adversarial Pairings

Generate This Level Using Vocabulary From What It Tests
L2 UnderstandL1 Remember verbsCan you explain using "identify" and "list"?
L3 ApplyL2 Understand verbsCan you apply using "explain" and "describe"?
L4 AnalyzeL1 Remember verbsCan you analyze using "name" and "what is"?
L5 EvaluateL2 Understand verbsCan you judge using "summarize" and "discuss"?
L6 CreateL3 Apply verbsCan you create using "solve" and "calculate"?
L1 RememberL4 Analyze verbsCan you recall facts using "compare" and "examine"?

Analysis

Visual breakdown of standard vs. adversarial performance across models and Bloom's levels.

Fig 1. Adversarial Gap per model. Standard CSR (blue) vs. Adversarial CSR (orange). The gap between them reveals keyword reliance.
Fig 2. Standard mode CSR by Bloom's level. Higher levels (Evaluate, Create) are typically harder.
Fig 3. Adversarial mode CSR by Bloom's level. The drop from standard reveals which levels rely most on vocabulary cues.
Fig 4. Per-constraint pass rate across all models (standard mode). Identifies which constraints are hardest to satisfy.

Methodology

How CogBench evaluates cognitive-level control with deterministic, verifiable constraints.

IFEval-Style Evaluation

Inspired by IFEval, every constraint is a deterministic rule. No trained classifier, no LLM-as-judge.

Same input → same score, every time

This eliminates inter-rater disagreement and makes results perfectly reproducible.

Bloom's Taxonomy Levels

  • L1 Remember — Recall facts directly from text
  • L2 Understand — Explain ideas in own words
  • L3 Apply — Use knowledge in new situations
  • L4 Analyze — Break apart and find relationships
  • L5 Evaluate — Make evidence-based judgments
  • L6 Create — Produce something original

Metrics

  • Prompt-level strict — % of questions where ALL constraints pass
  • Prompt-level loose — Same with score ≥0.5 threshold
  • Constraint-level strict — % of individual constraints that pass
  • Adversarial gap — Standard − Adversarial CSR (headline metric)

Source Passages

120 passages scraped from OpenStax (CC-BY licensed textbooks). 15 passages per subject across 8 disciplines:

Biology, Chemistry, Physics, Mathematics, Psychology, Economics, History, Computer Science

Each passage is 300-800 words with extracted key concepts and method/principle terms.

Why Verifiable Constraints?

Previous approaches use trained classifiers or LLM-as-judge to evaluate question quality. These have fundamental problems:

  • Circular reasoning — A model judging a model. If the judge is wrong, all scores are wrong.
  • Non-reproducible — Different runs of the same judge give different scores.
  • Low agreement — Inter-annotator agreement for Bloom's classification is often <50%.

CogBench defines cognitive complexity through checkable structural properties. A Remember question must have an extractable answer (R4). An Analyze question must reference ≥2 concepts (A2). These are verifiable facts, not subjective judgments.

Submit Your Model

Evaluate your LLM and get ranked on the leaderboard.

1. Install CogBench

pip install cogbench

Verify the installation:

cogbench info

Or install from source:

git clone https://github.com/cogbench/cogbench cd cogbench && pip install -e .

2. Run the Benchmark

Generate questions from your model across all 120 passages and 6 Bloom's levels in both standard and adversarial modes:

cogbench run --model your-model:size \ --mode both

This produces 1,440 questions (720 standard + 720 adversarial) and evaluates all 28 constraints automatically. Models are served via Ollama.

3. Submit Results

Submit your evaluated results to appear on the leaderboard. Requires the GitHub CLI (gh):

cogbench submit --name "Your Name"

This packages your results, computes metrics, and creates a GitHub issue on the CogBench repo. The maintainers will review and add your model to the leaderboard.

To submit a specific model: cogbench submit --model qwen2.5:14b --name "Your Name"

Citation

If you use CogBench in your research, please cite:

@misc{kunuku2026cogbench, title = {CogBench: A Verifiable Constraint Benchmark for Cognitive-Level Control in LLM Question Generation}, author = {Kunuku, Mourya Teja and Nasrin, Dena}, year = {2026} }