Prompt Being Tested: System prompts, character descriptions, etc.
Test Date:
Evaluator: If collecting data from more than one person.
Rate each response on a scale of 1-5. 1 = Poor, 2 = Below Average, 3 = Average, 4 = Good, 5 = Excellent
| Run | Model Name | Sampling Preset | Relevance | Accuracy | Clarity | Completeness | Follows Instructions | Consistency | Overall Quality | Notes |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | ||||||||||
| 2 | ||||||||||
| 3 | ||||||||||
| 4 | ||||||||||
| 5 | ||||||||||
| 6 | ||||||||||
| 7 | ||||||||||
| 8 | ||||||||||
| 9 | ||||||||||
| 10 |
| Metric | Average Score | Standard Deviation | Min Score | Max Score | Range |
|---|---|---|---|---|---|
| Relevance | |||||
| Accuracy | |||||
| Clarity | |||||
| Completeness | |||||
| Follows Instructions | |||||
| Consistency | |||||
| Overall Quality |
Relevance (1-5): How well does the response address the specific question or task?
Accuracy (1-5): How factually correct and truthful is the response?