Skip to main content

Arena Testing

Status

Available Now - PromptArena is actively being used for LLM testing

Overview

PromptArena (promptarena) is a CLI tool for running multi-turn conversation simulations across multiple LLM providers, validating conversation flows, and generating comprehensive test reports.

Arena enables systematic testing of conversational AI systems with support for:

  • Multi-Provider Testing - Run the same tests across OpenAI, Anthropic, Google, and more
  • Multi-Turn Conversations - Test complex conversation flows and state management
  • Self-Play Mode - Simulate realistic user interactions with configurable roles
  • Multimodal Testing - Test with images, audio, and video content
  • Comprehensive Reporting - HTML, JSON, JUnit XML, and Markdown reports
  • Mock Testing - Fast, cost-free testing with mock providers
  • CI/CD Integration - Built for automated testing pipelines

Installation

PromptArena is available as part of the PromptKit toolkit. See the PromptKit repository for installation instructions.

Quick Start

Basic Usage

# Run all tests with default configuration
promptarena run

# Specify configuration file
promptarena run --config my-arena.yaml

# Run specific providers only
promptarena run --provider openai,anthropic

# Run specific scenarios
promptarena run --scenario basic-qa,edge-cases

Configuration File

Create an arena.yaml configuration file:

apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Arena
metadata:
name: my-arena
spec:
prompt_configs:
- id: assistant
file: prompts/assistant.yaml

providers:
- file: providers/openai.yaml
- file: providers/anthropic.yaml

scenarios:
- file: scenarios/test.yaml

defaults:
output:
dir: out
formats: ["json", "html"]

Key Features

Multi-Provider Testing

Test the same scenarios across multiple LLM providers:

# Compare OpenAI, Anthropic, and Gemini
promptarena run --provider openai,anthropic,gemini --format html

Supported providers include:

  • OpenAI (GPT-4, GPT-3.5, etc.)
  • Anthropic (Claude 3 Opus, Sonnet, Haiku)
  • Google (Gemini Pro)
  • Azure OpenAI
  • And more

Multi-Turn Conversations

Define complex conversation flows in scenario files:

apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Scenario
metadata:
name: customer-support
spec:
task_type: support
turns:
- role: user
parts:
- type: text
text: "I need help with my account"
- role: assistant
# Expected assistant response
- role: user
parts:
- type: text
text: "Can you reset my password?"

Multimodal Content

Test with images, audio, and video:

turns:
- role: user
parts:
- type: text
text: "What's in this image?"
- type: image
media:
file_path: test-data/sample.jpg
detail: high

Supported media types:

  • Images: JPEG, PNG, GIF, WebP
  • Audio: MP3, WAV, OGG, M4A
  • Video: MP4, WebM, MOV

Self-Play Mode

Simulate realistic conversations with configurable roles:

# Enable self-play testing
promptarena run --selfplay

# Self-play with specific roles
promptarena run --selfplay --roles frustrated-customer,tech-support

Mock Testing

Fast, cost-free testing during development:

# Use mock provider instead of real APIs
promptarena run --mock-provider

# Use custom mock configuration
promptarena run --mock-config mock-responses.yaml

Output Formats

Arena generates comprehensive reports in multiple formats:

HTML Reports

Interactive HTML reports with:

  • Side-by-side provider comparison
  • Response times and token usage
  • Cost analysis
  • Media content visualization
  • Test assertions pass/fail status
promptarena run --format html
open out/report-[timestamp].html

JUnit XML

Standard JUnit XML for CI/CD integration:

promptarena run --format junit --junit-file out/junit.xml

Properties include:

  • media.images.total - Count of images tested
  • media.loaded.success - Successfully loaded media
  • media.loaded.errors - Failed media loads
  • Test pass/fail status

JSON Reports

Machine-readable JSON for programmatic analysis:

promptarena run --format json

Markdown Reports

Human-readable markdown summaries:

promptarena run --format markdown --markdown-file out/results.md

Test Assertions

Arena supports multiple assertion types:

Text Assertions

assertions:
- type: content_includes
patterns: ["expected text", "another phrase"]

- type: content_excludes
patterns: ["unwanted text"]

Media Assertions

Validate media outputs from generative models:

# Image validation
assertions:
- type: image_format
params:
formats: [png, jpeg]

- type: image_dimensions
params:
width: 1920
height: 1080

# Audio validation
assertions:
- type: audio_format
params:
formats: [mp3, wav]

- type: audio_duration
params:
min_seconds: 29
max_seconds: 31

# Video validation
assertions:
- type: video_resolution
params:
presets: [4k, uhd]

- type: video_duration
params:
min_seconds: 59
max_seconds: 61

CI/CD Integration

GitHub Actions

# .github/workflows/arena-tests.yml
name: Arena Tests
on: [push, pull_request]

jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Run Arena Tests
run: promptarena run --ci --format junit
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

- name: Publish Test Results
uses: dorny/test-reporter@v1
if: always()
with:
name: Arena Test Results
path: out/junit.xml
reporter: java-junit

CI Mode

Run in headless mode optimized for CI pipelines:

# Headless mode for CI pipelines
promptarena run --ci --format junit,json

Exit codes:

  • 0 - Success, all tests passed
  • 1 - Failure, one or more tests failed

Commands

promptarena run

Run conversation simulations across multiple LLM providers.

Flags:

  • -c, --config - Configuration file path (default: arena.yaml)
  • -j, --concurrency - Number of concurrent workers (default: 6)
  • --provider - Providers to use (comma-separated)
  • --scenario - Scenarios to run (comma-separated)
  • --temperature - Override temperature for all scenarios
  • --max-tokens - Override max tokens for all scenarios
  • --selfplay - Enable self-play mode
  • --mock-provider - Replace all providers with MockProvider
  • -o, --out - Output directory (default: out)
  • --format - Output formats: json, junit, html, markdown
  • -v, --verbose - Enable verbose debug logging

promptarena config-inspect

Inspect and validate arena configuration:

# Inspect default configuration
promptarena config-inspect

# Verbose output with details
promptarena config-inspect --verbose

# JSON output for programmatic use
promptarena config-inspect --format json

promptarena prompt-debug

Test prompt generation with specific contexts:

# Test prompt generation for task type
promptarena prompt-debug --task-type support

# Test with region
promptarena prompt-debug --task-type support --region us

# Test with scenario file
promptarena prompt-debug --scenario scenarios/customer-support.yaml

promptarena render

Generate HTML report from existing test results:

# Render from default location
promptarena render out/index.json

# Custom output path
promptarena render out/index.json --output custom-report.html

Best Practices

Performance

# Increase concurrency for faster execution
promptarena run --concurrency 10

# Reduce concurrency for stability
promptarena run --concurrency 1

Cost Control

# Use mock provider during development
promptarena run --mock-provider

# Test with cheaper models first
promptarena run --provider gpt-3.5-turbo

Reproducibility

# Always use same seed for consistent results
promptarena run --seed 42

Debugging

# Always start with config validation
promptarena config-inspect --verbose

# Use verbose mode to see API calls
promptarena run --verbose --scenario problematic-test

# Test prompt generation separately
promptarena prompt-debug --scenario scenarios/test.yaml

Learn More

Support

For questions, issues, or feature requests: