RFC 0006: Evals Extension
- Status: Implemented
- Author(s): AltairaLabs Team
- Created: 2026-02-14
- Updated: 2026-02-14
- Related Issues: #14 — Add evals section to PromptPack schema
Summary
Extend the PromptPack specification to support eval definitions at two levels — per-prompt and pack-wide — with Prometheus-inspired metric declarations. This allows promptpack authors to ship eval rubrics alongside the prompts and guardrails they already maintain, so that quality policy versions and deploys as a single unit.
Motivation
Validators (guardrails) and evals sit on the same quality spectrum:
| Aspect | Validators | Evals |
|---|---|---|
| Runs when? | Inline, every response | Async, on schedule or trigger |
| Blocks output? | Yes (if fail_on_violation) | No — scores and reports |
| Typical author | Prompt engineer | Same prompt engineer |
| Lives where today | In the PromptPack | Outside the pack |
The promptpack author who writes guardrails is the same person who writes eval rubrics. Both should live in the pack and version/deploy together. Keeping evals outside the pack leads to:
- Drift — evals fall out of sync with the prompts they measure
- Fragmentation — eval config scattered across CI pipelines, notebooks, and dashboards
- Lost portability — a PromptPack can't describe its own quality contract
Goals
- Allow authors to declare evals that travel with the pack
- Support both prompt-scoped and cross-cutting (pack-level) evals
- Provide a Prometheus-style metric envelope so runtimes know how to expose results
- Keep the spec declarative — define the what, not the how
Non-Goals
- Define eval execution semantics or scheduling
- Prescribe specific LLM-judge prompts or scoring algorithms
- Introduce a test-runner or CI integration into the spec
- Define metric collection, storage, or alerting infrastructure
Detailed Design
Schema Changes
New Definition: Eval
{
"Eval": {
"type": "object",
"description": "An eval definition that declares how to assess LLM output quality. Evals run asynchronously and produce scores or metrics, unlike validators which run inline and block.",
"required": ["id", "type", "trigger"],
"additionalProperties": false,
"properties": {
"id": {
"type": "string",
"description": "Unique identifier for this eval within its scope (prompt-level or pack-level).",
"examples": ["tone-check", "brand-consistency", "factual-accuracy"]
},
"description": {
"type": "string",
"description": "Human-readable description of what this eval measures and why it matters."
},
"type": {
"type": "string",
"description": "The assertion type that determines how this eval is executed. Not an enum — runtimes define and register their own types.",
"examples": ["llm_judge", "cosine_similarity", "regex_match", "human_review", "bleu_score"]
},
"trigger": {
"type": "string",
"description": "When this eval should be triggered.",
"enum": ["every_turn", "on_session_complete", "sample_turns", "sample_sessions"]
},
"sample_percentage": {
"type": "number",
"description": "Percentage of turns or sessions to sample when trigger is sample_turns or sample_sessions. Ignored for other trigger types.",
"default": 5,
"minimum": 0,
"maximum": 100,
"examples": [5, 10, 25, 50]
},
"enabled": {
"type": "boolean",
"description": "Whether this eval is active. Allows temporarily disabling evals without removing them.",
"default": true
},
"metric": {
"type": "object",
"description": "Prometheus-style metric declaration describing the output shape of this eval. Runtimes use this to expose results to monitoring systems.",
"required": ["name", "type"],
"additionalProperties": true,
"properties": {
"name": {
"type": "string",
"description": "Metric name. Should follow Prometheus naming conventions (snake_case, namespace prefix recommended).",
"examples": ["promptpack_tone_score", "promptpack_brand_consistency", "promptpack_factual_accuracy"]
},
"type": {
"type": "string",
"description": "Prometheus metric type that describes the value semantics.",
"enum": ["gauge", "counter", "histogram", "boolean"]
},
"range": {
"type": "object",
"description": "Optional value bounds. Useful for gauge metrics with known ranges.",
"properties": {
"min": {
"type": "number",
"description": "Minimum expected value"
},
"max": {
"type": "number",
"description": "Maximum expected value"
}
},
"examples": [{"min": 0, "max": 1}, {"min": 1, "max": 5}]
}
}
},
"params": {
"type": "object",
"description": "Type-specific configuration for the eval. Structure depends on the eval type — runtimes interpret these based on the type field.",
"additionalProperties": true,
"examples": [
{
"judge_prompt": "Rate the response tone on a scale of 1-5 for professionalism.",
"model": "gpt-4o",
"passing_score": 4
},
{
"reference_embeddings": "brand_voice_v2",
"threshold": 0.85
}
]
}
}
}
}
New Prompt-Level Field: evals
Added to $defs/Prompt/properties alongside the existing validators field:
{
"evals": {
"type": "array",
"description": "Eval definitions scoped to this prompt. These evals assess the quality of responses generated by this specific prompt.",
"items": {
"$ref": "#/$defs/Eval"
}
}
}
New Pack-Level Field: evals
Added to the top-level properties alongside prompts, tools, workflow, etc.:
{
"evals": {
"type": "array",
"description": "Pack-level eval definitions that apply across all prompts. Useful for cross-cutting quality concerns like brand consistency or safety checks. Prompt-level evals with the same id override pack-level evals.",
"items": {
"$ref": "#/$defs/Eval"
}
}
}
Specification Impact
- A new
Evaldefinition is added to$defs - The
Promptdefinition gains an optionalevalsarray - The top-level schema gains an optional
evalsarray - No existing fields are modified or removed
Validation Rules
-
ID Uniqueness
idmust be unique within its scope (all pack-level eval IDs are unique; all prompt-level eval IDs within a single prompt are unique)- Pack-level and prompt-level evals may share the same
id— prompt-level wins (override semantics)
-
Trigger Validation
triggermust be one of the enum values:every_turn,on_session_complete,sample_turns,sample_sessions- When
triggerissample_turnsorsample_sessions,sample_percentagecontrols the sampling rate (defaults to 5%)
-
Metric Naming
metric.nameshould follow Prometheus naming conventions:snake_case, with a namespace prefix (recommended, not enforced by schema)metric.rangewhen present must satisfyrange.min <= range.max
-
Backward Compatibility
evalsis optional at both levels; existing PromptPacks remain valid
Examples
Example 1: Per-Prompt Eval (LLM Judge)
YAML shown for readability (per RFC 0002). All examples are equally valid as JSON.
prompts:
customer_support:
id: customer-support
name: Customer Support Agent
version: "1.0.0"
system_template: |
You are a helpful customer support agent for Acme Corp.
Always be professional, empathetic, and solution-oriented.
validators:
- type: toxicity
enabled: true
fail_on_violation: true
params:
threshold: 0.8
evals:
- id: tone-check
description: Ensures responses maintain a professional and empathetic tone
type: llm_judge
trigger: every_turn
metric:
name: promptpack_tone_score
type: gauge
range:
min: 1
max: 5
params:
judge_prompt: |
Rate the following customer support response on a 1-5 scale:
1 = Rude or dismissive
3 = Neutral, acceptable
5 = Warm, professional, and empathetic
model: gpt-4o
passing_score: 4
Example 2: Pack-Level Cross-Cutting Eval
YAML shown for readability (per RFC 0002). All examples are equally valid as JSON.
id: acme-support-pack
name: Acme Corp Support Pack
version: "2.0.0"
evals:
- id: brand-consistency
description: Checks that all responses align with Acme brand voice guidelines
type: cosine_similarity
trigger: on_session_complete
metric:
name: promptpack_brand_consistency
type: gauge
range:
min: 0
max: 1
params:
reference_embeddings: acme_brand_voice_v2
threshold: 0.85
- id: safety-audit
description: Sampled safety review across all prompt outputs
type: llm_judge
trigger: sample_turns
sample_percentage: 10
metric:
name: promptpack_safety_violations
type: counter
params:
judge_prompt: |
Review the response for any safety concerns:
- Personal data leakage
- Unauthorized commitments or promises
- Off-brand or inappropriate content
Return the count of violations found.
model: gpt-4o
prompts:
billing:
id: billing
name: Billing Support
version: "1.0.0"
system_template: |
You handle billing inquiries for Acme Corp.
technical:
id: technical
name: Technical Support
version: "1.0.0"
system_template: |
You handle technical support for Acme Corp products.
Example 3: Full Pack with Both Levels and Metric Declarations
{
"$schema": "https://promptpack.org/schema/v1/promptpack.schema.json",
"id": "acme-full-pack",
"name": "Acme Full Quality Pack",
"version": "3.0.0",
"template_engine": {
"version": "v1",
"syntax": "{{variable}}"
},
"evals": [
{
"id": "response-latency-budget",
"description": "Tracks whether responses stay within acceptable token length as a proxy for latency",
"type": "token_count",
"trigger": "every_turn",
"metric": {
"name": "promptpack_response_tokens",
"type": "histogram",
"buckets": [50, 100, 250, 500, 1000]
},
"params": {
"tokenizer": "cl100k_base"
}
}
],
"prompts": {
"onboarding": {
"id": "onboarding",
"name": "User Onboarding",
"version": "1.0.0",
"system_template": "Welcome new users to Acme Corp and guide them through setup.",
"validators": [
{
"type": "max_length",
"enabled": true,
"fail_on_violation": false,
"params": { "max_characters": 2000 }
}
],
"evals": [
{
"id": "onboarding-completeness",
"description": "Checks that onboarding responses cover all required setup steps",
"type": "llm_judge",
"trigger": "on_session_complete",
"metric": {
"name": "promptpack_onboarding_completeness",
"type": "boolean"
},
"params": {
"judge_prompt": "Did the agent cover all onboarding steps: account setup, preferences, first task?",
"model": "gpt-4o"
}
},
{
"id": "response-latency-budget",
"description": "Override: onboarding responses can be longer",
"type": "token_count",
"trigger": "every_turn",
"metric": {
"name": "promptpack_response_tokens",
"type": "histogram",
"buckets": [100, 250, 500, 1000, 2000]
},
"params": {
"tokenizer": "cl100k_base"
}
}
]
}
}
}
In this example, the pack-level response-latency-budget eval applies to all prompts by default. The onboarding prompt overrides it with a more permissive histogram (wider buckets), demonstrating the prompt-level-wins override semantics.
Example 4: Extending metric with additionalProperties
The metric object uses additionalProperties: true, so runtimes can attach extra fields beyond the spec-defined name, type, and range. This is the same extensibility pattern used by params. Here is an example showing runtime-specific metric extensions:
YAML shown for readability (per RFC 0002). All examples are equally valid as JSON.
evals:
- id: response-quality
description: Overall response quality scored by LLM judge
type: llm_judge
trigger: every_turn
metric:
# --- spec-defined fields ---
name: promptpack_response_quality
type: gauge
range:
min: 0
max: 1
# --- runtime-added fields (additionalProperties) ---
labels:
team: customer-success
environment: production
help: "Overall response quality as scored by GPT-4o judge"
aggregation: weighted_average
alert_threshold: 0.6
slo: 0.85
params:
judge_prompt: |
Score the response quality from 0.0 to 1.0.
model: gpt-4o
The labels, help, aggregation, alert_threshold, and slo fields are not part of this specification — they are runtime extensions enabled by additionalProperties: true. A Prometheus-based runtime might use labels and help; an internal platform might use slo and alert_threshold. The spec defines the envelope; runtimes extend it.
Drawbacks
Schema Growth
- Adds another top-level array and a new
$defsentry, increasing schema surface area - Authors must learn the distinction between validators and evals
Mitigation: Evals are entirely optional. Packs without evals remain valid and unchanged.
Runtime Burden
- Runtimes must decide how (and whether) to execute evals — scheduling, metric export, storage
- The
typefield is not an enum, so runtimes must handle unknown types gracefully
Mitigation: This is intentional. The spec defines the envelope; the runtime interprets it. Runtimes that don't support evals simply ignore the field (same as workflow).
Metric Semantic Ambiguity
- The
metricblock declares the shape but not the semantics of collection (push vs. pull, aggregation windows, etc.)
Mitigation: Prometheus conventions are well-understood. The additionalProperties: true on metric lets runtimes add labels, help, aggregation, etc.
Alternatives
Alternative 1: Separate .eval.yaml File
Define evals in a separate file that references a PromptPack.
Rejected: Breaks the single-file portability principle that is core to PromptPack. Evals and prompts would drift apart across separate version histories.
Alternative 2: Evals Only at Prompt Level
Only allow evals inside individual prompt definitions — no pack-level evals.
Rejected: Cross-cutting quality concerns (brand voice, safety audits, token budgets) apply across all prompts. Duplicating them in every prompt is error-prone and violates DRY.
Alternative 3: Enum for type Field
Lock down the eval type to a fixed set of known assertion types (like Validator.type).
Rejected: The eval ecosystem is evolving rapidly — LLM judges, embedding similarity, BLEU/ROUGE, human-in-the-loop, custom model graders. An enum would require an RFC for every new type. Keeping type as a free string lets runtimes innovate without spec changes. (Note: Validator.type uses an enum with a custom escape hatch; for evals, we skip the enum entirely since the landscape is less settled.)
Adoption Strategy
Backward Compatibility
- Fully backward compatible
- Requires migration
- Breaking change
The evals field is optional at both the pack level and the prompt level. Existing PromptPacks without evals remain completely valid. No changes are required to existing specs, tools, or runtimes.
Migration Path
No migration required. Authors can adopt evals incrementally:
- Phase 1: Continue using PromptPacks without evals (current behavior)
- Phase 2: Add prompt-level evals to critical prompts
- Phase 3: Add pack-level evals for cross-cutting quality concerns
- Phase 4: Wire up runtime eval execution and metric export
Runtimes that don't support evals will simply ignore the evals field.
Unresolved Questions
-
Should pack-level evals support an
excludelist? A prompt might want to opt out of a pack-level eval entirely (not override, but skip). This could be a future addition — e.g.,exclude_evals: ["brand-consistency"]on the prompt. Proposal: defer to a future RFC if demand emerges. -
Should there be a standard set of well-known metric names? A registry of conventional metric names (like
promptpack_tone_score) would improve interop across runtimes. Proposal: defer to a companion "Eval Cookbook" document rather than baking into the spec.
Revision History
- 2026-02-14: Initial draft
- 2026-02-14: Accepted —
rangechanged from array[min, max]to object{min, max}for extensibility;paramsmade optional (not all eval types need configuration); schema version bumped to 1.2.0