Prompt Templates
CogBench evaluates models using four prompt templates of increasing specificity. Each template gives the LLM a different level of guidance about the target Bloom's Taxonomy level, testing how well the model responds to various instruction formats.
The Four Templates
Ordered from least to most guidance. Click through to see the exact prompt format and a real generated example.
The minimal baseline. The LLM receives only the Bloom's level name and the subject. No definitions, no examples, no scaffolding. Tests whether the model has internalized Bloom's Taxonomy from pretraining alone.
Typical performance: Models average ~55% exact accuracy. Without guidance, most LLMs default to higher cognitive levels (L4-L5) regardless of the target.
Adds the Bloom's level definition and action verbs. The LLM now knows what cognitive process the level requires and which verbs characterize it. Tests whether explicit pedagogical scaffolding improves level targeting.
Typical performance: Significant jump to ~73% exact accuracy. The definitions help models distinguish between adjacent levels (especially L2 vs L3 and L4 vs L5).
Adds a subject-specific example question at the target level. The LLM gets the definition, verbs, and a concrete demonstration of what a correct question looks like. This is the best-performing template for most models.
Typical performance: Best overall at ~87% exact accuracy. The exemplar anchors the model's understanding, dramatically reducing overshooting at lower levels (L1-L2).
Asks the model to reason step-by-step about what makes a question at the target level before generating one. The LLM must first explain the cognitive process, then produce a question. Tests whether explicit reasoning improves level precision.
Typical performance: ~77% exact accuracy. Reasoning helps with harder levels (L5-L6) but can introduce overthinking on simpler ones (L1-L2). Higher latency due to two-step output.
Template Comparison
Average CCS-Control score across all models for each prompt template.
Recommendation
Use the with_exemplar template for the best CCS-Control scores. It provides the optimal balance of guidance — enough context to anchor the model's understanding of each Bloom's level, without the overthinking risk of chain-of-thought. When submitting to the CogBench leaderboard, results from any template are accepted, but we recommend evaluating with at least with_exemplar for comparability.