Arena Testing
✅ Available Now - PromptArena is actively being used for LLM testing
Overview
PromptArena (promptarena) is a CLI tool for running multi-turn conversation simulations across multiple LLM providers, validating conversation flows, and generating comprehensive test reports.
Arena enables systematic testing of conversational AI systems with support for:
- Multi-Provider Testing - Run the same tests across OpenAI, Anthropic, Google, and more
- Multi-Turn Conversations - Test complex conversation flows and state management
- Self-Play Mode - Simulate realistic user interactions with configurable roles
- Multimodal Testing - Test with images, audio, and video content
- Comprehensive Reporting - HTML, JSON, JUnit XML, and Markdown reports
- Mock Testing - Fast, cost-free testing with mock providers
- CI/CD Integration - Built for automated testing pipelines
Installation
PromptArena is available as part of the PromptKit toolkit. See the PromptKit repository for installation instructions.
Quick Start
Basic Usage
# Run all tests with default configuration
promptarena run
# Specify configuration file
promptarena run --config my-arena.yaml
# Run specific providers only
promptarena run --provider openai,anthropic
# Run specific scenarios
promptarena run --scenario basic-qa,edge-cases
Configuration File
Create an arena.yaml configuration file:
apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Arena
metadata:
name: my-arena
spec:
prompt_configs:
- id: assistant
file: prompts/assistant.yaml
providers:
- file: providers/openai.yaml
- file: providers/anthropic.yaml
scenarios:
- file: scenarios/test.yaml
defaults:
output:
dir: out
formats: ["json", "html"]
Key Features
Multi-Provider Testing
Test the same scenarios across multiple LLM providers:
# Compare OpenAI, Anthropic, and Gemini
promptarena run --provider openai,anthropic,gemini --format html
Supported providers include:
- OpenAI (GPT-4, GPT-3.5, etc.)
- Anthropic (Claude 3 Opus, Sonnet, Haiku)
- Google (Gemini Pro)
- Azure OpenAI
- And more
Multi-Turn Conversations
Define complex conversation flows in scenario files:
apiVersion: promptkit.altairalabs.ai/v1alpha1
kind: Scenario
metadata:
name: customer-support
spec:
task_type: support
turns:
- role: user
parts:
- type: text
text: "I need help with my account"
- role: assistant
# Expected assistant response
- role: user
parts:
- type: text
text: "Can you reset my password?"
Multimodal Content
Test with images, audio, and video:
turns:
- role: user
parts:
- type: text
text: "What's in this image?"
- type: image
media:
file_path: test-data/sample.jpg
detail: high
Supported media types:
- Images: JPEG, PNG, GIF, WebP
- Audio: MP3, WAV, OGG, M4A
- Video: MP4, WebM, MOV
Self-Play Mode
Simulate realistic conversations with configurable roles:
# Enable self-play testing
promptarena run --selfplay
# Self-play with specific roles
promptarena run --selfplay --roles frustrated-customer,tech-support
Mock Testing
Fast, cost-free testing during development:
# Use mock provider instead of real APIs
promptarena run --mock-provider
# Use custom mock configuration
promptarena run --mock-config mock-responses.yaml
Output Formats
Arena generates comprehensive reports in multiple formats:
HTML Reports
Interactive HTML reports with:
- Side-by-side provider comparison
- Response times and token usage
- Cost analysis
- Media content visualization
- Test assertions pass/fail status
promptarena run --format html
open out/report-[timestamp].html
JUnit XML
Standard JUnit XML for CI/CD integration:
promptarena run --format junit --junit-file out/junit.xml
Properties include:
media.images.total- Count of images testedmedia.loaded.success- Successfully loaded mediamedia.loaded.errors- Failed media loads- Test pass/fail status
JSON Reports
Machine-readable JSON for programmatic analysis:
promptarena run --format json
Markdown Reports
Human-readable markdown summaries:
promptarena run --format markdown --markdown-file out/results.md
Test Assertions
Arena supports multiple assertion types:
Text Assertions
assertions:
- type: content_includes
patterns: ["expected text", "another phrase"]
- type: content_excludes
patterns: ["unwanted text"]
Media Assertions
Validate media outputs from generative models:
# Image validation
assertions:
- type: image_format
params:
formats: [png, jpeg]
- type: image_dimensions
params:
width: 1920
height: 1080
# Audio validation
assertions:
- type: audio_format
params:
formats: [mp3, wav]
- type: audio_duration
params:
min_seconds: 29
max_seconds: 31
# Video validation
assertions:
- type: video_resolution
params:
presets: [4k, uhd]
- type: video_duration
params:
min_seconds: 59
max_seconds: 61
CI/CD Integration
GitHub Actions
# .github/workflows/arena-tests.yml
name: Arena Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Arena Tests
run: promptarena run --ci --format junit
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
- name: Publish Test Results
uses: dorny/test-reporter@v1
if: always()
with:
name: Arena Test Results
path: out/junit.xml
reporter: java-junit
CI Mode
Run in headless mode optimized for CI pipelines:
# Headless mode for CI pipelines
promptarena run --ci --format junit,json
Exit codes:
0- Success, all tests passed1- Failure, one or more tests failed
Commands
promptarena run
Run conversation simulations across multiple LLM providers.
Flags:
-c, --config- Configuration file path (default:arena.yaml)-j, --concurrency- Number of concurrent workers (default:6)--provider- Providers to use (comma-separated)--scenario- Scenarios to run (comma-separated)--temperature- Override temperature for all scenarios--max-tokens- Override max tokens for all scenarios--selfplay- Enable self-play mode--mock-provider- Replace all providers with MockProvider-o, --out- Output directory (default:out)--format- Output formats:json,junit,html,markdown-v, --verbose- Enable verbose debug logging
promptarena config-inspect
Inspect and validate arena configuration:
# Inspect default configuration
promptarena config-inspect
# Verbose output with details
promptarena config-inspect --verbose
# JSON output for programmatic use
promptarena config-inspect --format json
promptarena prompt-debug
Test prompt generation with specific contexts:
# Test prompt generation for task type
promptarena prompt-debug --task-type support
# Test with region
promptarena prompt-debug --task-type support --region us
# Test with scenario file
promptarena prompt-debug --scenario scenarios/customer-support.yaml
promptarena render
Generate HTML report from existing test results:
# Render from default location
promptarena render out/index.json
# Custom output path
promptarena render out/index.json --output custom-report.html
Best Practices
Performance
# Increase concurrency for faster execution
promptarena run --concurrency 10
# Reduce concurrency for stability
promptarena run --concurrency 1
Cost Control
# Use mock provider during development
promptarena run --mock-provider
# Test with cheaper models first
promptarena run --provider gpt-3.5-turbo
Reproducibility
# Always use same seed for consistent results
promptarena run --seed 42
Debugging
# Always start with config validation
promptarena config-inspect --verbose
# Use verbose mode to see API calls
promptarena run --verbose --scenario problematic-test
# Test prompt generation separately
promptarena prompt-debug --scenario scenarios/test.yaml
Learn More
- Complete CLI Reference: Arena User Guide
- Configuration Reference: Config Documentation
- Getting Started: First Project Walkthrough
- CI/CD Integration: Pipeline Setup Guide
- GitHub Repository: AltairaLabs/PromptKit
Support
For questions, issues, or feature requests:
- Issues: GitHub Issues
- Discussions: GitHub Discussions