Benchmarking LLMs Highly Sensitive to Prompt Template
The prompt template used significantly impacts LLM performance evaluation. There are no universally optimal templates and the best performing templates do not transfer well across models, datasets or methods. This makes benchmarking LLMs very challenging.
Summary
- The prompt template impacts LLM benchmark performance as much or more than architectural changes like sparse experts or scale.
- Study evaluated 19 LLMs and sensitivity to prompt template. No template worked best across all models and tasks.
- Best performing template for a model and task does not transfer well to other models, datasets or prediction methods.
- Proposes "Template Ensembles" to average predictions over multiple templates to improve robustness. But very computationally expensive.
- Main contributions: Comprehensive analysis showing gains from proper template selection comparable to architectural improvements. And proposal of Template Ensembles as a baseline solution to improve robustness.