Create prompt, test with different inputs.
create a dataset of input list, then you cat test with each prompt
use easy and hard examples.
create a model that generates different examples.
Same Prompts – Different Models – compare answers with LLM based on criteria you want. or ask the model to grade the output, and output a json or single value function.
Make GPT-4 Evaluate The Responses on specific requirements
sample multiple responses and ensemble them manually or with LLM