Benchmarking LLMs Highly Sensitive to Prompt Template
The prompt template used significantly impacts LLM performance evaluation. There are no universally optimal templates and the best performing templates do not transfer well across models, datasets or methods. This makes benchmarking LLMs very challenging. READ ARTICLE