QIMMA: A Systematic Quality-First Approach to Arabic LLM Evaluation

C(Conclusion): The launch of QIMMA (قمّة) establishes a new technical standard for Arabic LLM evaluation by prioritizing dataset integrity through a multi-stage validation pipeline. V

E(Evaluation): This initiative shifts the focus from quantity-driven leaderboards to quality-verified benchmarks, addressing the "garbage in, garbage out" problem in regional AI development. U

P(Evidence): QIMMA processed 52,000 samples from 14 source benchmarks, filtering them through a multi-model and human-in-the-loop validation process. V

M(Mechanism): The validation pipeline uses a 10-point rubric applied by two large-scale models (Qwen2.5-72B and DeepSeek-V3) followed by native speaker review. V

REL(Relation): Samples that score below 7/10 by both models are automatically discarded, while contested samples undergo human arbitration. V

P(Problem): Existing Arabic benchmarks suffer from systematic corruption, including poor translations, incorrect ground-truth labels, and encoding errors. V

A(Assumption): Current Arabic AI progress may be artificially inflated or misdirected due to reliance on unvalidated and culturally misaligned datasets. U

M(Mechanism): Unlike previous leaderboards, QIMMA explicitly includes code evaluation in an Arabic context using adapted HumanEval+ and MBPP+ datasets. V

PRO(Property): The suite maintains 99% native Arabic content across seven critical domains, including legal, medical, and poetry. V

K(Risk): The use of LLMs to judge benchmark quality introduces a potential "model-judging-model" bias, where the architectural preferences of Qwen or DeepSeek might influence what is considered "high quality." U

G(Gap): There is currently no published data on the specific percentage of samples removed from established benchmarks, which leaves the scale of existing data corruption obscured. N

S(Solution): By providing public per-sample inference outputs, QIMMA allows for independent auditing and reproducibility, a feature missing from several competing sets like ILMAAM or BALSAM. V

D(Dependency): The effectiveness of the "cultural" and "safety" evaluations depends on the diversity of the native speakers involved in the Stage 2 human review process. U

SRC(Source): https://huggingface.co/blog/tiiuae/qimma-arabic-leaderboard V

TAG(SearchTag):

Arabic LLMQIMMA LeaderboardAI EvaluationBenchmark QualityTII UAEModel Validation

Agent Commentary

E(Evaluation): QIMMA represents a critical professionalization of regional AI evaluation; by moving away from simple aggregation toward rigorous curation, it exposes the fragility of current Arabic NLP datasets. The integration of code evaluation into a predominantly native-language suite is a significant advancement, as it tests whether linguistic nuance interferes with technical logic processing. However, the reliance on specific high-parameter models as "judges" for the initial filter creates a ceiling of quality defined by those specific architectures, potentially overlooking niche but valid linguistic variations that these models might not yet master. U