Get Started
Evaluate your LLM's cognitive-level control using our standardized evaluation kit. Your model generates questions at specified Bloom's Taxonomy levels, and our CCS classifier scores how well it controlled the cognitive complexity. Every leaderboard entry runs the same protocol.
How It Works
CogBench measures whether your LLM can generate questions at the right cognitive level. Ask it to generate an "Analyze-level" question — does it actually produce one?
A single Python file that handles everything — generating questions with your LLM using standardized prompts, and submitting them to the CCS API for scoring. Only requires requests and optionally tqdm.
The script includes all 4 prompt templates, 6 Bloom's level definitions, 8 subjects, and exemplar questions — everything needed to run a standardized evaluation.
Point the script at your model. It generates 144 questions (6 Bloom's levels × 8 subjects × 3 samples per combination) using a standardized prompt template. Takes ~15-40 minutes depending on model speed.
The model responds with a generated question. Your model is prompted to generate questions at each level — the same 144 prompts for every submission, ensuring fair comparison.
When generation finishes, the script automatically submits your questions to the CogBench API. Our CCS classifier (BERT-base, 82% accuracy) predicts the actual Bloom's level of each generated question. The score measures how well your model controlled the cognitive level — did a question prompted at "Analyze" actually come out as Analyze?
Use --no-submit to generate without submitting. Questions are saved locally as JSON — you can submit later via the API.
Each generated question is scored by the CCS classifier — a BERT-base model fine-tuned on 739 expert-labeled questions achieving 82% accuracy. The classifier predicts the actual Bloom's level of your generated question, and the score measures the match between target and predicted levels.
CCS-Control = 0.5 × ExactAcc + 0.3 × AdjAcc + 0.2 × (1 − MAE/5)
- Exact Accuracy (50%) — How often the CCS-predicted level exactly matches the level you asked for
- Adjacent Accuracy (30%) — How often it's within ±1 level. Rewards near-misses (e.g., generating L3 when asked for L4)
- Normalized MAE (20%) — Average distance between target and actual levels, on a 0-1 scale
The per-level breakdown reveals where your model excels and struggles. Common patterns: most LLMs score well on L1 (Remember) and L4 (Analyze), but struggle to generate true L3 (Apply) questions — they often produce L4 (Analyze) instead. L5 (Evaluate) vs L6 (Create) confusion is also very common.
Templates dramatically affect scores. Our benchmark shows with_exemplar consistently produces the best CCS-Control scores across all models (+15% vs name_only on average). The leaderboard shows each model's best template result. See the Templates page for details.
FAQ
Common questions about the CogBench evaluation process.
[subject] question_text and classified into levels 1-6.pip install cogbench is planned with direct HuggingFace transformers support: cogbench run --model meta-llama/Llama-3.1-8B. For now, the downloadable script provides the same standardized evaluation.