Benchmarking LLMs Highly Sensitive to Prompt Template

The prompt template used significantly impacts LLM performance evaluation. There are no universally optimal templates and the best performing templates do not transfer well across models, datasets or methods. This makes benchmarking LLMs very challenging.


  • The prompt template impacts LLM benchmark performance as much or more than architectural changes like sparse experts or scale.
  • Study evaluated 19 LLMs and sensitivity to prompt template. No template worked best across all models and tasks.
  • Best performing template for a model and task does not transfer well to other models, datasets or prediction methods.
  • Proposes "Template Ensembles" to average predictions over multiple templates to improve robustness. But very computationally expensive.
  • Main contributions: Comprehensive analysis showing gains from proper template selection comparable to architectural improvements. And proposal of Template Ensembles as a baseline solution to improve robustness.


Related post


The Critical Role of Prompt Engineering in Blockchain's Future

Prompt engineering plays a crucial role in ensuring the efficiency, security, scalability, and innovation of blockchain networks and applications. It enables projects to deliver features faster, enhance competitiveness, optimize performance, address evolving needs, and drive adoption. Prompt engineering enables agility, adaptability to changing requirements, faster time-to-market, and better user experiences…

SAP Bets Big on Cloud and Prompt Engineering for AI Success

SAP is reorganizing and focusing more on AI to enable quicker delivery and adoption of AI capabilities across all products and lines of business. Cloud deployments are considered superior for achieving AI results due to ease of data access and reduced complexity. Prompt engineering will be an important skill for…

AI assistants

GitHub's Request for Prompt Tips Triggers Engineering Debate

The key takeaway is that "prompt engineering" has gone through ups and downs in perception, with many developers poking fun at it, but it still has value if done properly. GitHub's request for tips on prompt engineering sparked jokes about it not being real engineering, but also some legitimate guidance.