Evaluate Your Output

self critique

Create prompt, test with different inputs.

create a dataset of input list, then you cat test with each prompt

use easy and hard examples.

create a model that generates different examples.

Same Prompts – Different Models – compare answers with LLM based on criteria you want. or ask the model to grade the output, and output a json or single value function.

Make GPT-4 Evaluate The Responses on specific requirements

sample multiple responses and ensemble them manually or with LLM