CogBench

A benchmark for evaluating cognitive-level control in LLM question generation using Bloom's Taxonomy

Leaderboard

Models ranked by CCS-Control score. Click any column header to sort.

Rank Model Params CCS-Control Exact Acc Adj Acc MAE Best Template

Analysis

Visual breakdown of model performance across metrics, cognitive levels, and prompt strategies.

Fig 1. Multi-metric profile (top 10 models). CCS-Control (composite) balances exact accuracy, adjacent accuracy, and normalized MAE.
Per-Level Accuracy: Best Template per Model
Fig 2. Per-level accuracy heatmap (top 10 models, best template). L2 Understand and L3 Apply form the "hard zone" where most models struggle.

Methodology

How CogBench evaluates cognitive-level control in LLM question generation.

CCS-Control Score

The primary ranking metric is a weighted composite:

CCS-Control = 0.5 × ExactAcc + 0.3 × AdjAcc + 0.2 × (1 − MAE/5)

Balances precise level matching (50%), near-miss tolerance (30%), and ordinal distance penalty (20%).

Bloom's Taxonomy Levels

  • L1 Remember — Recall facts and basic concepts
  • L2 Understand — Explain ideas or concepts
  • L3 Apply — Use information in new situations
  • L4 Analyze — Draw connections among ideas
  • L5 Evaluate — Justify a stand or decision
  • L6 Create — Produce new or original work

Prompt Templates

  • Name Only — Just the level name (e.g., "Remember")
  • With Definition — Level name + Bloom's definition
  • With Exemplar — Definition + example question at that level
  • Chain of Thought — Step-by-step reasoning about the level

CCS Classifier

A fine-tuned BERT-base model trained on 739 expert-labeled questions via 5-fold cross-validation.

Performance: 82.0% exact accuracy, 89.6% adjacent accuracy, beating all LLM baselines (best LLM: 74.0%).

Submit Your Model

Evaluate your LLM's cognitive-level control and get ranked on the leaderboard.

1. Get the Evaluation Kit

Fetch prompt templates, Bloom's level definitions, and submission format:

curl https://api.cogbench.us/api/evaluation-kit

2. Generate Questions

Use your LLM to generate questions at each of the 6 Bloom's Taxonomy levels. Include at least 1 question per level (recommended: 12+ total across multiple subjects).

Tip: Use the with_exemplar template for best results.

3. Submit for Scoring

POST your generated questions to the evaluation API. The CCS classifier scores each question and returns your CCS-Control score:

curl -X POST https://api.cogbench.us/api/evaluate \ -H "Content-Type: application/json" \ -d '{ "model_name": "Your-Model 7B", "model_params": "7B", "model_type": "external", "template": "with_exemplar", "questions": [ {"question_text": "What is the chemical formula for water?", "target_level": 1, "subject": "chemistry"}, {"question_text": "Explain how osmosis works.", "target_level": 2, "subject": "biology"}, {"question_text": "Calculate the pH of 0.1M HCl.", "target_level": 3, "subject": "chemistry"}, {"question_text": "Compare mitosis and meiosis.", "target_level": 4, "subject": "biology"}, {"question_text": "Evaluate whether the controls are adequate.", "target_level": 5, "subject": "biology"}, {"question_text": "Design an experiment to test growth factors.", "target_level": 6, "subject": "biology"} ] }'

Full API docs: api.cogbench.us/docs

Citation

If you use CogBench in your research, please cite:

@inproceedings{kunuku2026cogbench, title = {CogBench: A Benchmark for Evaluating Cognitive-Level Control in LLM Question Generation}, author = {Kunuku, Mourya Teja and Nasrin, Dena}, booktitle = {NeurIPS 2026 Datasets and Benchmarks Track}, year = {2026} }