CogBench
A benchmark for evaluating cognitive-level control in LLM question generation using Bloom's Taxonomy
Leaderboard
Models ranked by CCS-Control score. Click any column header to sort.
| Rank | Model | Params | CCS-Control | Exact Acc | Adj Acc | MAE | Best Template |
|---|
Analysis
Visual breakdown of model performance across metrics, cognitive levels, and prompt strategies.
Methodology
How CogBench evaluates cognitive-level control in LLM question generation.
CCS-Control Score
The primary ranking metric is a weighted composite:
Balances precise level matching (50%), near-miss tolerance (30%), and ordinal distance penalty (20%).
Bloom's Taxonomy Levels
- L1 Remember — Recall facts and basic concepts
- L2 Understand — Explain ideas or concepts
- L3 Apply — Use information in new situations
- L4 Analyze — Draw connections among ideas
- L5 Evaluate — Justify a stand or decision
- L6 Create — Produce new or original work
Prompt Templates
- Name Only — Just the level name (e.g., "Remember")
- With Definition — Level name + Bloom's definition
- With Exemplar — Definition + example question at that level
- Chain of Thought — Step-by-step reasoning about the level
CCS Classifier
A fine-tuned BERT-base model trained on 739 expert-labeled questions via 5-fold cross-validation.
Performance: 82.0% exact accuracy, 89.6% adjacent accuracy, beating all LLM baselines (best LLM: 74.0%).
Submit Your Model
Evaluate your LLM's cognitive-level control and get ranked on the leaderboard.
1. Get the Evaluation Kit
Fetch prompt templates, Bloom's level definitions, and submission format:
2. Generate Questions
Use your LLM to generate questions at each of the 6 Bloom's Taxonomy levels. Include at least 1 question per level (recommended: 12+ total across multiple subjects).
Tip: Use the with_exemplar template for best results.
3. Submit for Scoring
POST your generated questions to the evaluation API. The CCS classifier scores each question and returns your CCS-Control score:
Full API docs: api.cogbench.us/docs
Citation
If you use CogBench in your research, please cite: