With the rapid advancement of large language models (LLMs), efficiently and accurately evaluating their capabilities is essential for both developers and users. Unfortunately, most benchmarks evaluate ...