Prompt Being Tested: System prompts, character descriptions, etc.
Test Date:
Evaluator: If collecting data from more than one person.
Rate each response on a scale of 1-5. 1 = Poor, 2 = Below Average, 3 = Average, 4 = Good, 5 = Excellent
Run | Model Name | Sampling Preset | Relevance | Accuracy | Clarity | Completeness | Follows Instructions | Consistency | Overall Quality | Notes |
---|---|---|---|---|---|---|---|---|---|---|
1 | ||||||||||
2 | ||||||||||
3 | ||||||||||
4 | ||||||||||
5 | ||||||||||
6 | ||||||||||
7 | ||||||||||
8 | ||||||||||
9 | ||||||||||
10 |
Metric | Average Score | Standard Deviation | Min Score | Max Score | Range |
---|---|---|---|---|---|
Relevance | |||||
Accuracy | |||||
Clarity | |||||
Completeness | |||||
Follows Instructions | |||||
Consistency | |||||
Overall Quality |
Relevance (1-5): How well does the response address the specific question or task?
Accuracy (1-5): How factually correct and truthful is the response?