Test Configuration

Prompt Being Tested: System prompts, character descriptions, etc.

Test Date:

Evaluator: If collecting data from more than one person.

Evaluation Criteria

Rate each response on a scale of 1-5. 1 = Poor, 2 = Below Average, 3 = Average, 4 = Good, 5 = Excellent

Run Model Name Sampling Preset Relevance Accuracy Clarity Completeness Follows Instructions Consistency Overall Quality Notes
1
2
3
4
5
6
7
8
9
10

Summary Statistics

Metric Average Score Standard Deviation Min Score Max Score Range
Relevance
Accuracy
Clarity
Completeness
Follows Instructions
Consistency
Overall Quality

Evaluation Criteria Definitions

Relevance (1-5): How well does the response address the specific question or task?

Accuracy (1-5): How factually correct and truthful is the response?