How It Works

CogBench measures whether your LLM can generate questions at the right cognitive level. Ask it to generate an "Analyze-level" question — does it actually produce one?

1
Download
Get the evaluation script (single Python file, no dataset needed)
2
Generate
Your LLM generates 144 questions across 6 levels × 8 subjects
3
Score
CCS classifier predicts the actual Bloom's level of each question
4
Leaderboard
Your CCS-Control score appears on the public leaderboard
1 Download the Evaluation Script

A single Python file that handles everything — generating questions with your LLM using standardized prompts, and submitting them to the CCS API for scoring. Only requires requests and optionally tqdm.

curl -O https://cogbench.us/cogbench_eval.py pip install requests tqdm

The script includes all 4 prompt templates, 6 Bloom's level definitions, 8 subjects, and exemplar questions — everything needed to run a standardized evaluation.

2 Generate Questions

Point the script at your model. It generates 144 questions (6 Bloom's levels × 8 subjects × 3 samples per combination) using a standardized prompt template. Takes ~15-40 minutes depending on model speed.

Ollama (local model)
python cogbench_eval.py --model llama3.1:8b --backend ollama \ --display-name "Llama 3.1 8B" --params "8B"
OpenAI API
python cogbench_eval.py --model gpt-4o --backend openai --api-key sk-...
Any OpenAI-compatible endpoint (vLLM, LMStudio, Together)
python cogbench_eval.py --model my-model --backend openai \ --api-base http://localhost:8000/v1 --api-key token
Choose a template (default: with_exemplar)
Name Only With Definition With Exemplar Chain of Thought
# Use a specific template python cogbench_eval.py --model llama3.1:8b --backend ollama --template with_definition # Run all 4 templates (576 questions total, comprehensive evaluation) python cogbench_eval.py --model llama3.1:8b --backend ollama --all-templates
What the prompts look like (with_exemplar example)
Generate a question about biology at Bloom's Taxonomy Analyze level. At the Analyze level, students should be able to break material into its constituent parts and determine how the parts relate to one another and to an overall structure or purpose. Typical action verbs: compare, contrast, examine, differentiate, categorize, organize. Here is an example of an Analyze-level question: "Compare and contrast the processes of mitosis and meiosis in terms of their outcomes and biological significance." Generate a DIFFERENT question at the same cognitive level about biology. Respond with ONLY the question. No explanation, no preamble.

The model responds with a generated question. Your model is prompted to generate questions at each level — the same 144 prompts for every submission, ensuring fair comparison.

8 subjects
Biology Chemistry Physics Mathematics Psychology Economics History Computer Science
3 Automatic CCS Scoring

When generation finishes, the script automatically submits your questions to the CogBench API. Our CCS classifier (BERT-base, 82% accuracy) predicts the actual Bloom's level of each generated question. The score measures how well your model controlled the cognitive level — did a question prompted at "Analyze" actually come out as Analyze?

Example Output
============================================================ RESULTS --- Llama 3.1 8B (with_exemplar) ------------------------------------------------------------ CCS-Control Score: 0.8163 Exact Accuracy: 76.3% Adjacent Accuracy: 84.6% MAE: 0.4744 Questions Scored: 144 ------------------------------------------------------------ Per-Level Breakdown: L1: 91.7% exact 95.8% adj (n= 24) ██████████████████ L2: 79.2% exact 91.7% adj (n= 24) ███████████████ L3: 62.5% exact 75.0% adj (n= 24) ████████████ L4: 83.3% exact 87.5% adj (n= 24) ████████████████ L5: 70.8% exact 83.3% adj (n= 24) ██████████████ L6: 70.8% exact 75.0% adj (n= 24) ██████████████ ============================================================ View leaderboard: https://cogbench.us
Dry Run (no submission)

Use --no-submit to generate without submitting. Questions are saved locally as JSON — you can submit later via the API.

python cogbench_eval.py --model llama3.1:8b --backend ollama --no-submit # Submit later manually curl -X POST https://api.cogbench.us/api/evaluate \ -H 'Content-Type: application/json' -d @cogbench_llama3_1_8b_with_exemplar_*.json
4 Understand the Scoring

Each generated question is scored by the CCS classifier — a BERT-base model fine-tuned on 739 expert-labeled questions achieving 82% accuracy. The classifier predicts the actual Bloom's level of your generated question, and the score measures the match between target and predicted levels.

CCS-Control Formula

CCS-Control = 0.5 × ExactAcc + 0.3 × AdjAcc + 0.2 × (1 − MAE/5)

  • Exact Accuracy (50%) — How often the CCS-predicted level exactly matches the level you asked for
  • Adjacent Accuracy (30%) — How often it's within ±1 level. Rewards near-misses (e.g., generating L3 when asked for L4)
  • Normalized MAE (20%) — Average distance between target and actual levels, on a 0-1 scale
What's a good score?
> 0.93 — Excellent Top-tier, precise cognitive-level control
0.85 – 0.93 — Strong Good control, minor confusion at adjacent levels
0.75 – 0.85 — Moderate Struggles with name_only template, better with exemplars
< 0.75 — Needs work Significant level confusion, often defaults to L4

The per-level breakdown reveals where your model excels and struggles. Common patterns: most LLMs score well on L1 (Remember) and L4 (Analyze), but struggle to generate true L3 (Apply) questions — they often produce L4 (Analyze) instead. L5 (Evaluate) vs L6 (Create) confusion is also very common.

Template impact

Templates dramatically affect scores. Our benchmark shows with_exemplar consistently produces the best CCS-Control scores across all models (+15% vs name_only on average). The leaderboard shows each model's best template result. See the Templates page for details.

FAQ

Common questions about the CogBench evaluation process.

What is cognitive-level control?
It measures whether an LLM can generate content at a specific cognitive complexity level. If you ask it to generate a "Remember-level" question, does it produce a simple recall question? Or does it accidentally generate an "Analyze-level" comparison question instead? CogBench quantifies this ability using Bloom's Taxonomy.
Why 144 questions per template?
6 Bloom's levels × 8 subjects × 3 samples = 144 questions. This gives 24 questions per level (enough for statistical significance per-level), across diverse subjects to test generalization. Running all 4 templates gives 576 questions for a comprehensive evaluation. The leaderboard shows your best template result.
Is the evaluation fair for everyone?
Yes. Every submission generates from the exact same prompts — same templates, same Bloom's definitions, same subjects, same number of samples. The CCS classifier scores every submission identically. No cherry-picking or custom prompting.
Can I use any LLM?
Yes. Any model that generates text — local models (Ollama, vLLM, llama.cpp), API models (OpenAI, Anthropic, Together, Gemini), or custom fine-tuned models. The script supports Ollama and any OpenAI-compatible endpoint out of the box.
Do I need a GPU?
Only for local models. If you're running Ollama or vLLM locally, you need a GPU. If calling an API (OpenAI, Together, etc.), no GPU needed — the script just makes HTTP requests. CCS scoring happens on our server.
How long does it take?
~15-40 minutes per template on a local model via Ollama (~1-3 seconds per generation). API models are typically faster (5-15 minutes). All 4 templates take ~1-2 hours total. The script shows a real-time progress bar.
How does the CCS classifier work?
The CCS (Cognitive Control Scorer) is a BERT-base model fine-tuned on 739 expert-labeled questions to predict Bloom's Taxonomy levels. It achieves 82% exact accuracy, 89.6% adjacent accuracy (MAE=0.356) on 5-fold cross-validation. Questions are formatted as [subject] question_text and classified into levels 1-6.
Will there be a pip package?
Yes — pip install cogbench is planned with direct HuggingFace transformers support: cogbench run --model meta-llama/Llama-3.1-8B. For now, the downloadable script provides the same standardized evaluation.