Janik Jehkul

1 Introduction

The field of artificial intelligence has seen remarkable progress within the recent years, with foundational models demonstrating increasing capabilities across a wide variety of tasks [2]. This rapid advancement has led to quickly saturating benchmarks [18] increasing the need for paradigm shifts of how we evaluate and benchmark more capable AI systems. Amidst these advancements, the very definition of intelligence itself is a subject of ongoing debate. Legg and Hutter summarized 70 literature definitions of intelligence as: "Intelligence measures an agent's general ability to achieve goals in a wide range of environments." [20] More recently, Chollet has emphasized intelligence as a general learning ability defining it as: "The intelligence of a system is a measure of its skill-acquisition efficiency over a scope of tasks, with respect to priors, experience and generalization difficulty." [4] These definitions highlight the complexity of capturing intelligence in AI systems, as terms like "general ability" and "wide range of environments" are themselves not well-defined, complicating the creation of comprehensive evaluation methods. Translating these definitions into practical benchmarks is challenging, as concepts like "achieving goals" can be implemented using simple task-completion metrics, while "skill-acquisition efficiency" requires more complex evaluations.

The evolution of AI benchmarks mirrors the field's progress and growing ambitions. The introduction of the Common Task Framework (CTF) in the mid-1980s marked a significant shift, enabling quantitative comparisons of algorithms on fixed tasks. In the 1990s, specialized initiatives like FERET for facial recognition and various NLP benchmarks emerged, expanding the scope of AI evaluation [31]. Comprehensive evaluations such as OpenLLM [7], HELM [23] or BIG-Bench [35] which have expanded in complexity and breadth, yet fundamentally continue the CTF tradition of measuring performance across an increasing number of tasks and domains. In response to these developments, Chollet introduced the ARC (Abstraction and Reasoning Corpus) benchmark [4], representing a paradigm shift by focusing more on skill acquisition rather than task performance.

Despite these advancements, the rapid progress in AI capabilities has exposed fundamental challenges in evaluating AGI. These include difficulties in defining the scope and validity of general intelligence, constraints and biases inherent in benchmark design, limitations of current evaluation methodologies (such as a narrow focus on task-specific skills and the risk of benchmark data contamination) and practical considerations related to resource-intensive and time-consuming evaluation processes. This review will discuss how to address these challenges in the context of benchmarking AGI and propose future directions for holistic evaluation approaches.

In the Related Work chapter, an overview of various evaluation approaches is provided, categorizing them into evaluating large language models, evaluating multimodal models, evaluating general assistants, evaluating models holistically, evaluating continuously and evaluating abstract reasoning and generalization. The Discussion will provide a critical analysis of the reviewed methods by investigating trends and future directions in AGI evaluation, including dynamic benchmarks, multimodal assessments and approaches focused on measuring generalization and adaptability. Finally, the Conclusion summarizes the main findings, highlights the limitations of existing approaches and proposes potential solutions for more comprehensive assessments.

The evolution of AI evaluation methods has seen a shift from task-specific benchmarks to more comprehensive assessments [31]. This transformation is driven by the rapid advancements in AI, particularly with the emergence of foundation models, which demonstrate remarkable potential across diverse tasks and domains [2]. This section provides an overview of recent developments in AI evaluation, focusing primarily on benchmarks designed for foundation models, as the most relevant and challenging benchmarks in the field are now tailored towards assessing the broad capabilities of these types of systems.

Figure 1: This figure is best viewed in color and illustrates the systematic evolution of AI evaluation methods, progressing from narrow, task-specific benchmarks to increasingly comprehensive assessments. Stages (1-5) illustrate increasing task diversity: (1) language model evaluation, focusing on linguistic tasks; (2) multimodal evaluation, incorporating diverse modalities; (3) general assistant evaluation, which introduces sequential task solving and context handling; (4) holistic evaluation, assessing performance across a broad spectrum of tasks / modalities; (5) continuous evaluation, which employs dynamic benchmarks and human evaluation to provide ongoing assessment of AI systems by constantly introducing new tasks and challenges; and (6) abstract reasoning and generalization, which represents a paradigm shift by focusing on an AI system's ability to acquire core knowledge and abstract reasoning capabilities from one task and apply them to new, unseen tasks (hatched symbols).

This chapter examines evaluating large language models, evaluating multimodal models, evaluating general assistants, evaluating models holistically, evaluating continuously and evaluating abstract reasoning and generalization, as illustrated in Figure 1.

2.1 Evaluating Large Language Models

Evaluating Large Language Models (LLMs) has become increasingly challenging as their capabilities rapidly progress. The General Language Understanding Evaluation (GLUE) benchmark [39] which was introduced in 2018 marked a significant milestone in multitask evaluation, combining diverse natural language understanding tasks such as question answering, sentiment analysis and textual entailment. However, models surpassed human performance on it within a year, leading to the development of SuperGLUE [38], which was also surpassed in less than 18 months.

As these benchmarks primarily focus on simpler textual understanding rather than complex reasoning, there was a need for broader and more challenging benchmarks. The Massive Multitask Language Understanding (MMLU) benchmark [9] addressed this by expanding the evaluation to over 15,000 multiple-choice questions from 57 subjects, across various fields, representing a shift towards more human-centric and knowledge-intensive assessment. Despite all of this, recent models have also surpassed human performance on this benchmark. MMLU-Pro [40] extends upon MMLU by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options.

Seeking even greater task diversity, BIG-Bench (Beyond the Imitation Game Benchmark) [35] was collaboratively created on GitHub with contributors submitting tasks via pull requests and peer review conducted through public discussions. It features over 200 diverse tasks from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias and beyond. BIG-Bench Hard [36], a subset of BIG-Bench, focuses on particularly challenging tasks that even advanced language models struggle with, providing a higher ceiling for evaluation.

AGIEval [47] offers an alternative evaluation approach by utilizing standardized exams such as SAT, GMAT and Gaokao to assess model performance on human-centric tasks. These exams are designed to assess a broad range of skills and knowledge considered crucial for success in academic and professional settings. Taking a step further into specialized knowledge, GPQA (Graduate-Level Google-Proof Q&A) [33] consists of 448 multiple-choice questions written by domain experts to assess deep conceptual understanding and problem-solving skills at an advanced academic level. While previously mentioned benchmarks also cover various domains, GPQA emphasizes the depth of understanding and reasoning required to solve complex, graduate-level problems. This benchmark is designed to be "Google-proof," with even highly skilled non-experts achieving only 34% accuracy despite unlimited web access [33]. BASIS Superintelligence benchmark [37] pushes the boundaries even further, aiming to evaluate AI at the highest levels of human intelligence. It consists of new, unique and offline questions designed to test for the capabilities associated with the 99.999995th percentile of human intelligence.

2.2 Evaluating Multimodal Models

As AI systems advance in their ability to handle multimodal inputs and outputs [2], comprehensive evaluation benchmarks have emerged to track the progress of models in performing a wide range of tasks across different domains, assessing their multimodal capabilities.

Expanding beyond text-only evaluations, MathVista [25] combines mathematical reasoning with visual information processing. This benchmark tests the model's ability to solve problems that require both numerical and visual comprehension such as interpreting diagrams, graphs and visual representations of mathematical concepts.

MMMU [45] increases the diversity of tasks by presenting a diverse set of multimodal questions from college exams, quizzes and textbooks, spanning subjects from art and design to engineering and sciences. It aims to evaluate the model's capacity to understand and reason about complex, domain-specific visual information in an academic context.

MMT-Bench [44] broadens the scope beyond scientific domains, incorporating visual multiplechoice questions across diverse areas such as vehicle driving, GUI navigation and temporal understanding. For example, tasks might include locating the privacy setting button in a screenshot of the Android GUI, detecting the color of a traffic light in a given image or ordering pictures in the most likely temporal sequence.

The Perception Test [30] introduces a temporal dimension for all multimodal evaluation tasks it assesses, focusing on evaluating the perception of a model using sequential data. It focuses on skills (memory, abstraction, physics, semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio and text modalities by asking multiple choice questions on real-world videos. For instance, it tests abstraction skills by asking models to count the distinct objects that a person showed to the camera or determining the order of written letters if they had been written in reverse. An example task for predictive reasoning is predicting whether a configuration of objects is likely to be stable after placing the last object, such as placing a book and some cards on a candle.

2.3 Evaluating General Assistants

As foundational models evolve towards more general-purpose assistants, new benchmarks have emerged to evaluate their capabilities in complex, real-world scenarios, specifically focusing on the ability of AI agents to operate autonomously within interactive environments. This contrasts with multimodal benchmarks, which primarily evaluate a model's ability to process and integrate information from diverse input modalities.

AgentBench [24], provides eight closed-box environments to evaluate AI assistants, including web shopping, databases and digital card games. OpenAGI [8] is a platform and benchmark that tests the ability of LLMs to effectively use external models, tools, plugins or APIs to solve multi-step, real-world tasks . GAIA assesses AI assistants on diverse, real-world tasks without specifying APIs. They also use three different levels of increasing difficulty with level 3 being questions that are described as questions that can only be solved by a "near perfect general assistant" [27].

Some more recently introduced agent benchmarks focused on specific, complex environments: SWE-Bench [16], WebArena [48] and OSWorld [41] offer realistic software engineering, web and computer environments respectively. These benchmarks allow for the evaluation of AI agents in an incredibly rich, ever-growing and human-relevant task domain.

ARA (Autonomous Replication and Adaptation) [19] focuses on evaluating AI agents potential for dangerous autonomy. This benchmark assesses an agent's ability to independently acquire resources, self-improve and adapt to new challenges. By testing agents on tasks ranging from simple file operations to complex scenarios like phishing, ARA aims to measure progress towards AI systems that could autonomously replicate and adapt. While current agents struggle with most tasks, the benchmark provides a framework for monitoring advancements in AI autonomy and associated safety risks.

2.4 Evaluating Models Holistically

Holistic evaluation approaches offer a standardized way to compare results on multiple benchmarks and metrics across various models. This standardization allows for a more comprehensive and consistent view of model performance, providing a more complete picture of their capabilities and limitations.

Open LLM Leaderboard [7] uses a compilation approach that provides a ranking for open source language models across multiple challenging benchmarks, including MMLU-Pro, GPQA and Big-Bench Hard. In contrast to OpenLLM Leaderboard which focuses solely on aggregating performance across various benchmarks, Holistic evaluation of language models (HELM) [23] takes a more systematic approach by providing a framework to systematically evaluate models across a wide array of scenarios and metrics such as accuracy, robustness, fairness and bias. This enables a deeper understanding of models capabilities, limitations and risks. HEMM (Holistic Evaluation of Multimodal Foundation Models) [22] and MMAU (Massive Multitask Agent Understanding) [43] both extend comprehensive evaluation to include multimodal models and agents. While HEMM assesses models based on basic multimodal skills, information flow and real-world use cases across 30 diverse image-text datasets, MMAU focuses on evaluating LLM agents' performance in five domains (including tool-use and mathematics), based on five essential capabilities (such as understanding, reasoning and selfcorrection).

2.5 Evaluating Models Continuously

Continuous model evaluation extends beyond traditional holistic approaches that use a fixed set of benchmarks. It leverages dynamically updated benchmarks or ongoing human evaluations, providing an ongoing assessment of the ability of models to generalize to unseen challenges.

DynaBench [17] is a dynamic benchmark that leverages human annotators to craft adversarial examples that specifically target weaknesses in current models. This ensures the benchmark remains a moving target. LatestEval [21] constructs reading comprehension tasks using recently published texts from platforms like arXiv, BBC and GitHub, providing a continuous source of novel information to test models on. Similarly, LiveCodeBench [14] focuses on evaluating the coding related capabilities of LLMs by continuously incorporating new competitive programming problems from platforms like LeetCode, AtCoder and Codeforces.

Furthermore, human evaluation methods, exemplified by Chatbot Arena [3], offer continuous and direct comparisons based on human judgments. These human preferences can also be incorporated into automated evaluation through model-based approaches like LLM-asjudge [46], where a classifier is trained to predict which answers a human evaluator would prefer.

2.6 Evaluating Abstract Reasoning and Generalization

The preceding sections have introduced benchmarks that prioritize achieving high performance on a broad spectrum of tasks, often with minimal constraints on the methods used to achieve that skill. Evaluating abstract reasoning and generalization explores a different facet of intelligence, focusing on the efficiency of skill acquisition. This approach prioritizes learning and solving entirely novel problems based on limited experience and a core set of fundamental priors.

The Abstraction and Reasoning Corpus (ARC) [4] aims to evaluate fluid intelligence and broad cognitive abilities in AI systems. ARC contains manually generated abstract visual patterns that are also unique and have a limited overlap between tasks. It requires models to infer abstract rules from a small set of examples and apply them to novel situations, mirroring human-like cognitive abilities. ARC tasks are designed to be solvable using core knowledge priors assumed to be innate to humans, such as object cohesion, object persistence, object influence via contact, goal-directedness, number and counting and basic geometry priors. Since ARC accounts for these priors, it aims to test skill acquisition instead of crystallized skills learned from unlimited training data. [4].

Expanding upon this concept, MARVEL [15] introduces a multidimensional approach to abstract visual reasoning. It incorporates a broader range of visual patterns, geometric and abstract shapes and task configurations. It also utilizes a hierarchical evaluation framework, incorporating both abstract reasoning questions and perception-focused questions to pinpoint whether errors stem from perceptual limitations or flawed reasoning.

3 Discussion

The systematic overview in chapter 2 reveals the evolution of AGI evaluation approaches, progressing from narrow, task-specific benchmarks to more comprehensive assessments. However, the rapid advancement of AI has exposed several key issues that challenge the ability of current benchmarks to effectively measure progress towards AGI. These challenges, discussed in section 3.1, include defining the scope and validity of a general intelligence benchmark, addressing constraints and biases in their design and managing resource-intensive and time-consuming evaluation processes. In section 3.2 potential solutions to these challenges are explored, such as methods for measuring broad generalization, scaling benchmark complexity, dynamic benchmarks, multimodality and the incorporation of safety and autonomy considerations in AGI evaluation.

3.1 Challenges in Evaluating AGI

3.1.1 Scope, Validity and Defining AGI

Focus on task-specific performance: Current AI benchmarks often prioritize narrow, task-specific performance, which may not adequately capture the broad cognitive abilities required for AGI. This emphasis on isolated tasks, often rooted in the Common Task Framework (CTF), leads to an overemphasis on specialized expertise at the expense of general intelligence [31]. Furthermore, this focus on narrow skills is compounded by the rapid saturation of many benchmarks such as GLUE, SuperGLUE or MMLU [18]. It's important to note that benchmarks primarily test how well models can reproduce ground truth data points. While we assume that certain fundamental abilities (reasoning, understanding, logic, etc.) are necessary to find this ground truth, we cannot be entirely certain that AI systems use these abilities in the same way humans do. As Hernández-Orallo observed, focusing only on the models performance can lead to AI systems that "solve these tasks without featuring intelligence" [10], achieving high scores by leveraging superficial patterns or structural features (e.g., API familiarity in AI assistant evaluations [27]) rather than demonstrating genuine understanding. This highlights the need for benchmarks that assess a broader range of cognitive abilities and adaptability across diverse tasks, which is crucial for evaluating progress towards AGI.
Human knowledge boundary: Humans' limited capabilities make it challenging to create and verify test instances that are sufficiently challenging and we may be unable to assess AI performance beyond human knowledge [11]. For example, in protein folding, AlphaFold has demonstrated superhuman performance, but validating its novel predictions requires extensive laboratory work [28]. This example illustrates that as AI systems potentially surpass human expertise in various domains, traditional evaluation methods become insufficient. It raises fundamental questions about how to design and validate benchmarks for AGI systems that may operate beyond human cognitive limits.
Ambiguity of AGI Definition: Building an AGI benchmark is inherently challenging due to the lack of a universally accepted and precisely defined notion of "general intelligence." Varying definitions, such as Legg and Hutter's emphasis on achieving goals in a wide range of environments [20] or Chollet's focus on skill-acquisition efficiency [4], highlight this ambiguity. Furthermore, defining a "wide range of environments" that comprehensively captures the diversity and complexity of real-world scenarios is a significant undertaking. Similarly, specifying the appropriate "priors" for an AGI system, which encompass the innate knowledge and biases it should possess, is a complex and open-ended question [4]. These challenges make it difficult to translate abstract definitions of AGI into concrete, measurable criteria for benchmark design. This ambiguity not only complicates benchmark design but also raises questions about the validity of any single evaluation approach, suggesting the need for a multi-faceted evaluation strategy.

3.1.2 Constraints and Biases

Bias: Benchmarks often reflect a narrow range of perspectives, incorporating cultural biases or prioritizing specific values. This can compromise the objectivity and reliability of evaluations, as AI systems might achieve inflated performance due to their alignment with the specific cultural context embedded in the benchmark data [26]. Additionally, evaluation methods themselves can introduce biases, such as positional or verbosity bias in LLM-as-a-judge evaluations [46]. Addressing bias is particularly crucial in the pursuit of AGI, as any system aspiring to general intelligence must demonstrate the ability to operate effectively across diverse contexts and value systems.
Resource-Intensive Evaluation: Setting up benchmarks often poses technical challenges, requiring considerable engineering effort [26]. This problem is amplified for AGI evaluation, which likely demands broader, more complex evaluations, potentially requiring substantial resource allocation and infrastructure adjustments. These factors drive costs, raising questions about accessibility and fairness in AGI research, potentially favoring well-funded institutions and exacerbating existing inequalities in the field. This highlights the need for more efficient and scalable evaluation methods for AGI [26].
Time-Intensive Evaluation: Current benchmark evaluation cycles often lag behind the rapid pace of AI advancements. The HELM benchmark, for example, can take months to evaluate new models due to resource constraints and coordination complexities [6]. This lag is particularly concerning for AGI, as more general and capable systems will likely require even more complex and time-consuming evaluations. The time lag in evaluation could lead to a situation where models' capabilities outpace our ability to assess them accurately, potentially compromising safety and governance efforts.

3.2 Towards More Effective AGI Benchmarks

The challenges discussed in Section 3.1 collectively highlight the complexity of evaluating AGI and underscore the need for innovative approaches that can address these limitations. This section explores potential solutions and strategies for developing more effective AGI benchmarks:

3.2.1 Measuring Broad Generalization

To address the limitations of focusing solely on task-specific performance, AGI benchmarks should prioritize evaluating general capabilities that can transfer across diverse tasks.

Schlangen proposes a curriculum of tasks, organized in a complexity and inclusion hierarchy, where the benchmarking should be the developmental trajectory instead of the performance on a single task [34]. This approach aligns with the goal of evaluating an AI system's ability to acquire and apply skills flexibly.

François Chollet's Abstraction and Reasoning Corpus (ARC) benchmark [4] represents a significant attempt to measure broad generalization and fluid intelligence. The ARC benchmark has several advantages such as a focus on abstract reasoning and skill acquisition which requires models to learn from limited examples. It has a great task diversity compared to typical psychometric intelligence tests and finds its strength in hundreds of unique tasks with limited overlap. This diversity coupled with manual task generation rather than programmatic one significantly reduces the risk of developers finding shortcuts through task-specific solutions of by reverse engineering a single generative program.

However, ARC also has notable limits that need to be addressed. Its limited dataset size may leave it vulnerable to specialized optimization strategies. It is possible that through specialized strategies we might get systems that excel at the benchmark without developing genuine general intelligence. Moreover, the benchmark only features a binary evaluation format, which fails to capture the nuances of problem-solving and the varying levels of success an AI system may achieve on a problem. Additionally ARC's reliance on innate human priors is particularly problematic, as these priors are poorly understood in cognitive science "and whether these priors are correctly captured in ARC is unclear" [4]. This uncertainty connects to Moravec's paradox, which states that "it is comparatively easy to make computers exhibit adult-level performance on intelligence tests or playing checkers and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility" [29]. This paradox highlights a significant challenge in AGI evaluation: how do we assess capabilities that are fundamental to human intelligence but difficult to quantify or replicate in artificial systems?

MARVEL [15] expands upon ARC's concepts, incorporating a broader range of visual patterns and task configurations and introducing a hierarchical evaluation framework that separates perceptual limitations from reasoning errors. This directly addresses Moravec's paradox by allowing researchers to pinpoint whether a model's failure stems from a lack of understanding of the visual input (potentially due to poorly represented priors) or a flawed reasoning process.

Moving beyond static visual reasoning, the Perception Test [30] introduces a time dimension aiming to assess a model's understanding of dynamic scenes. By evaluating perception using sequential data, it tries to examine how well models reason about changes and interactions within real-world videos, adding a layer of complexity to generalization not present in ARC and MARVEL. Its fine-grained analysis across cognitive skills (memory, abstraction, etc.), reasoning types (descriptive, explanatory, etc.) and modalities (video, audio, text) provides a more comprehensive evaluation tool to assess a model's perceptual strengths and weaknesses.

It is important to note that, unlike ARC, both MARVEL and the Perception Test utilize multiplechoice questions for evaluation. This shift towards classification, while providing more nuanced insights into core capabilities, might inadvertently simplify the task, as correctly identifying the answer among a limited set of choices is arguably less demanding than generating a novel solution from scratch. Finding the right balance between structured evaluation and open-ended problem-solving remains crucial for effectively capturing the multifaceted nature of general intelligence.

3.2.2 Scaling Benchmark Complexity

Since AI benchmarks are saturating rapidly, there is a need to include samples with increasing difficulty to effectively measure progress and to pinpoint the stage at which a model begins to struggle. Agent benchmarks are particularly well-suited for this challenge due to their ability to construct complex, real-world relevant problems across an arbitrary range of environments. This aligns with Legg and Hutter's definition of AGI as a system capable of achieving goals across a wide range of environments [20].

The GAIA benchmark exemplifies the increasing difficulty approach with its three-level structure, starting with basic tasks, progressing to more complex ones, which are described as only solvable by a "near-perfect general assistant" [27]. Building on the concept of increasing complexity, dynamic benchmarking via complexity classes [5] is a promising direction. This approach involves classifying algorithmic questions into complexity classes from P to NP-hard problems, each with ten granular difficulty levels.

Beyond dynamic benchmarking via complexity classes, an automated curriculum generation approach, where a teacher program draws inspiration and grounding from the real world to programmatically generate more diverse, complex and novel tasks, could ensure scalability and real-world relevance for AI benchmarks [4]. This approach even has the potential to create a curriculum of tasks that are open-ended and capable of scaling to task environments much larger than what is presently possible, potentially crossing the human knowledge boundary [12]. For AGI evaluation, this could be particularly valuable as it could continuously generate new challenges that test the system's ability to adapt and learn in novel situations.

However, as benchmarks become more complex and dynamic, the evaluation process itself could become more resource- and time-intensive, potentially causing a lag between the development of new AI capabilities and our ability to assess them adequately. This also presents a challenge for researchers with limited resources, as creating and maintaining these benchmarks can be computationally demanding. Moreover, while increasing difficulty in benchmarks is crucial for tracking progress, we must be cautious in interpreting results since they rely on result-oriented metrics which can be insufficient when assessing complex reasoning. Currently, we primarily focus on whether a model achieves the correct answer, rather than how it arrives at that answer. As noted in "Levels of AGI" [28], advancements in mechanistic interpretability [32] may enable process-oriented metrics instead of result-oriented metrics in the future, allowing us to gain a deeper understanding of the underlying reasoning abilities of AI systems. This shift towards process-oriented metrics would provide a more comprehensive evaluation of complex reasoning, moving beyond simply checking the correctness of the final output. Therefore, it's crucial to continually evaluate whether our benchmarks genuinely assess intended general problem-solving capabilities rather than rewarding narrow, benchmark-specific optimization strategies.

3.2.3 Dynamic vs. Static Benchmarks

To reduce variability in standardized evaluations static benchmarks can be utilized. They provide consistency and are easy to implement and to reproduce. However, they face the risk of benchmark data contamination (BDC). BDC occurs when language models are exposed to information related to evaluation benchmarks from their training data, leading to skewed performance measurements and undermining the validity of evaluations. [42]

In contrast, dynamic benchmarks such as LatestEval [21] and LiveCode Bench [14] continuously update their problem sets, ensuring that models are consistently evaluated on unseen data. They also offer the advantage to adjust their difficulty and scope to remain relevant and challenging, addressing the problem of benchmark saturation [18].

Some dynamic benchmarks, such as DynaBench [17], take this a step further by employing human annotators who adversarially try to create examples that the target model will misclassify, also possibly being used to mitigate for biases. The BIG-Bench [35] approach using GitHub with contributors submitting tasks via pull requests and peer review conducted through public discussions is also great to ensure liveness. Moreover, these processes can potentially be automated using the automated curriculum generation approach discussed earlier.

However, the implementation of dynamic benchmarks is not without challenges. Since these benchmarks are continuously changing it is difficult to reproduce exact results which complicates scientific comparison and validation. Ensuring that newly generated tasks maintain consistent difficulty and relevance over time requires careful design and ongoing monitoring. Additionally, the process of continuously updating benchmarks and generating new tasks can be resource-intensive and may require sustained human oversight.

3.2.4 Language-Only vs. Multimodal Benchmarks

When designing an AGI benchmark, a crucial question arises: should we focus on languageonly tasks or include multiple modalities? The Platonic Representation Hypothesis [13] suggests that different AI models, regardless of their input modalities, may converge towards a shared internal representation of reality.

Therefore language-only benchmarks could potentially provide insights into progress towards this shared representation. Language-only benchmarks offer the advantage of standardization, as language can be used to describe a wide range of concepts and tasks across different domains. Furthermore, frameworks like HELM [23] already provide a foundation for measuring crucial metrics such as accuracy, calibration, robustness, fairness, bias, toxicity and efficiency in language-based tasks. This demonstrates the potential for rigorous and multifaceted evaluation even within the confines of language-only benchmarks. However, relying solely on language as the interface for evaluation has limitations. It privileges AI systems designed primarily for language processing and may not accurately assess the capabilities of models that operate with different modalities or lack a strong language component.

Multimodal benchmarks, on the other hand, provide a more direct test of whether models are truly converging on a modality-agnostic representation of reality. This is because there are fewer representations that are competent for more constrained tasks. By incorporating diverse sensory inputs such as images, videos and sounds, these benchmarks can evaluate an AI system's ability to integrate information across different sensory modalities and act in a more holistic and realistic manner. This is crucial for AGI, as it needs to understand and interact with the world in a way that is not limited to language alone.

3.2.5 Incorporating Safety and Autonomy

When designing benchmarks for evaluating advanced AI systems, it's crucial to consider safety and autonomy. While these considerations might not directly measure general intelligence, they are essential for understanding the broader implications of increasingly capable AI systems.

Amodei et al. [1] conducted a comprehensive review of 170 articles to identify and categorize key challenges in AI safety. Their analysis provides a valuable framework for incorporating safety considerations into AGI benchmarks. The authors distilled five core problem areas: avoiding negative side effects, scalable oversight, safe exploration, robustness to distributional shift and avoiding reward hacking. These categories offer concrete, measurable aspects of safety that can be integrated into AGI evaluation criteria. For instance, an AGI benchmark could assess a system's ability to avoid unintended consequences (negative side effects), such as evaluating if a code completion AI suggests efficient code without introducing security vulnerabilities. Similarly, robustness to distributional shift could be evaluated by assessing how well a medical diagnosis AI trained on adult patients adapts to cases involving children.

A crucial dimension of AI safety is the system's capacity for autonomous operation. As AI systems become more capable of independent action, understanding and evaluating their autonomy is useful. Morris et al. framework of "Levels of Autonomy" for AI systems [28] provides a useful perspective on this issue. They provide example risks for each level of autonomy ranging from de-skilling or over-reliance for "AI as a Tool" to concentration of power for "AI as an Agent". Evaluating an AGI's level of autonomy within this framework helps us understand its capacity to act independently in complex environments and adapt to unforeseen circumstances, allowing us to design appropriate benchmarks and safety measures.

By incorporating these dimensions of safety and autonomy, we also make sure to evaluate the system's capacity to operate safely and reliably. This approach provides a more holistic assessment of AGI capabilities, addressing both performance and potential risks.

4 Conclusion

This literature review has explored the current landscape of AGI evaluation, highlighting the limitations of existing approaches and proposing potential solutions for more comprehensive assessments. The need for innovative evaluation methodologies is underscored by several critical challenges, including defining the scope and validity of general intelligence, addressing constraints and biases in benchmark design, managing resource-intensive and time-consuming evaluation processes and mitigating the risks of benchmark saturation and data contamination.

Promising future research directions include dynamic benchmarks that continuously evolve, such as Dynabench [17], LatestEval [21] and LiveCodeBench [14]. Incorporating complexity classes [5] and automated curriculum generation [4, 12] also can provide a more granular and realistic assessment of AI progress. Additionally, agent benchmarks such as OpenAGI [8] and GAIA [27] aim to evaluate AI systems in complex, real-world scenarios, assessing their ability to understand, plan and execute tasks across different domains. This is valuable since it establishes stronger links between benchmark performance and real-world applicability. Future AGI benchmarks should incorporate safety and autonomy considerations to address potential risks associated with highly capable AI systems [19, 28]. Furthermore, measuring broad generalization remains a central challenge in AGI evaluation. The Abstraction and Reasoning Corpus (ARC) benchmark [4] represents a notable attempt to capture fluid intelligence and skill acquisition, but it has limitations that need to be addressed. Complementing abstraction and reasoning benchmarks with a diverse range of tasks and and incorporating fine-grained analysis across cognitive skills and reasoning types, as exemplified by the Perception Test [30], will be crucial for developing more robust and comprehensive benchmarks that can accurately assess progress towards AGI.

In conclusion, the evaluation of AGI is a multifaceted and evolving challenge that requires ongoing research and innovation. By addressing the limitations of current benchmarks, exploring novel evaluation methodologies and incorporating safety and autonomy considerations, the field can work towards more meaningful and comprehensive assessments of artificial general intelligence. As Raji et al. [31] emphasize, "The effective development of benchmarks is critical to progress in machine learning, but what makes a benchmark effective is not the strength of its arbitrary and false claim to 'generality' but its effectiveness in how it helps us understand as researchers how certain systems work— and how they don't." As the quest for AGI continues, it is crucial to maintain a critical and reflective approach to evaluation, constantly questioning assumptions and pushing the boundaries of what is possible.

Bibliography

Amodei, D., Olah, C., Steinhardt, J., Christiano, P. F., Schulman, J., and Mané, D. “Concrete Problems in AI Safety”. In: CoRR abs/1606.06565 (2016). arXiv: 1606.06565. URL: http://arxiv.org/abs/1606.06565.
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R. B., Arora, S., Arx, S. von, Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N. S., Chen, A. S., Creel, K., Davis, J. Q., Demszky, D., Donahue, C., Doumbouya, M., Durmus, E., Ermon, S., Etchemendy, J., Ethayarajh, K., Fei-Fei, L., Finn, C., Gale, T., Gillespie, L., Goel, K., Goodman, N. D., Grossman, S., Guha, N., Hashimoto, T., Henderson, P., Hewitt, J., Ho, D. E., Hong, J., Hsu, K., Huang, J., Icard, T., Jain, S., Jurafsky, D., Kalluri, P., Karamcheti, S., Keeling, G., Khani, F., Khattab, O., Koh, P. W., Krass, M. S., Krishna, R., Kuditipudi, R., and al., et. “On the Opportunities and Risks of Foundation Models". In: CoRR abs/2108.07258 (2021). arXiv: 2108.07258. URL: https://arxiv.org/abs/2108.07258.
Chiang, W., Zheng, L., Sheng, Y., Angelopoulos, A. N., Li, T., Li, D., Zhang, H., Zhu, B., Jordan, M. I., Gonzalez, J. E., and Stoica, I. “Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference". In: CoRR abs/2403.04132 (2024). DOI: 10.48550/ARXIV.2403.04132. arXiv: 2403.04132. URL: https://doi.org/10.48550/arXiv.2403.04132.
Chollet, F. “On the Measure of Intelligence”. In: CoRR abs/1911.01547 (2019). arXiv: 1911.01547. URL: http://arxiv.org/abs/1911.01547.
Fan, L., Hua, W., Li, L., Ling, H., and Zhang, Y. “NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes". In: CoRR abs/2312.14890 (2023). DOI: 10.48550/ARXIV.2312.14890. arXiv: 2312.14890. URL: https://doi.org/10.48550/arXiv.2312.14890.
Ganguli, D., Schiefer, N., Favaro, M., and Clark, J. Challenges in evaluating AI systems. Oct. 4, 2023. URL: https://www.anthropic.com/index/evaluating-ai-systems.
Gao, L., Tow, J., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding, L., Hsu, J., McDonell, K., Muennighoff, N., Phang, J., Reynolds, L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou, A. A framework for few-shot language model evaluation. Sept. 2021. DOI: 10.5281/zenodo.5371628. URL: https://doi.org/10.5281/zenodo.5371628.
Ge, Y., Hua, W., Mei, K., Ji, J., Tan, J., Xu, S., Li, Z., and Zhang, Y. “OpenAGI: When LLM Meets Domain Experts". In: Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. Ed. by Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. 2023. URL: http://papers.nips.cc/paper%5C_files/paper/2023/hash/1190733f217404edc8a7f4e15a57f301-Abstract-Datasets%5C_and%5C_Benchmarks.html.
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. “Measuring Massive Multitask Language Understanding”. In: CoRR abs/2009.03300 (2020). arXiv: 2009.03300. URL: https://arxiv.org/abs/2009.03300.
Hernández-Orallo, J. “Evaluation in artificial intelligence: from task-oriented to ability- oriented measurement”. In: Artif. Intell. Rev. 48.3 (2017), pp. 397–447. DOI: 10.1007/ S10462-016-9505-7. URL: https://doi.org/10.1007/s10462-016-9505-7.
Hernández-Orallo, J. “Twenty Years Beyond the Turing Test: Moving Beyond the Human Judges Too”. In: Minds Mach. 30.4 (2020), pp. 533–562. DOI: 10.1007/S11023- 020-09549-0. URL: https://doi.org/10.1007/s11023-020-09549-0.
Hughes, E., Dennis, M., Parker-Holder, J., Behbahani, F., Mavalankar, A., Shi, Y., Schaul, T., and Rocktaschel, T. Open-Endedness is Essential for Artificial Superhuman Intelligence. 2024. arXiv: 2406.04268 [cs.LG]. URL: https://doi.org/10.48550/arXiv.2406.04268.
Huh, M., Cheung, B., Wang, T., and Isola, P. “The Platonic Representation Hypothesis". In: CoRR abs/2405.07987 (2024). DOI: 10.48550/ARXIV.2405.07987. arXiv: 2405. 07987. URL: https://doi.org/10.48550/arXiv.2405.07987.
Jain, N., Han, K., Gu, A., Li, W., Yan, F., Zhang, T., Wang, S., Solar-Lezama, A., Sen, K., and Stoica, I. “LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code". In: CoRR abs/2403.07974 (2024). DOI: 10.48550/ARXIV. 2403.07974. arXiv: 2403.07974. URL: https://doi.org/10.48550/arXiv.2403.07974.
Jiang, Y., Zhang, J., Sun, K., Sourati, Z., Ahrabian, K., Ma, K., Ilievski, F., and Pujara, J. “MARVEL: Multidimensional Abstraction and Reasoning through Visual Evaluation and Learning". In: CoRR abs/2404.13591 (2024). DOI: 10.48550/ARXIV.2404.13591. arXiv: 2404.13591. URL: https://doi.org/10.48550/arXiv.2404.13591.
Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K. “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?” In: CoRR abs/2310.06770 (2023). DOI: 10.48550/ARXIV.2310.06770. arXiv: 2310.06770. URL: https://doi.org/10.48550/arXiv.2310.06770.
Kiela, D., Bartolo, M., Nie, Y., Kaushik, D., Geiger, A., Wu, Z., Vidgen, B., Prasad, G., Singh, A., Ringshia, P., Ma, Z., Thrush, T., Riedel, S., Waseem, Z., Stenetorp, P., Jia, R., Bansal, M., Potts, C., and Williams, A. “Dynabench: Rethinking Benchmarking in NLP”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021. Ed. by Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tür, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., and Zhou, Y. Association for Computational Linguistics, 2021, pp. 4110–4124. DOI: 10.18653/V1/2021.NAACL- MAIN.324. URL: https://doi.org/10.18653/v1/2021.naacl-main.324.
Kiela, D., Thrush, T., Ethayarajh, K., and Singh, A. "Plotting Progress in AI". In: Contextual AI Blog (2023). URL: https://contextual.ai/blog/plotting-progress.
Kinniment, M., Sato, L. J. K., Du, H., Goodrich, B., Hasin, M., Chan, L., Miles, L. H., Lin, T. R., Wijk, H., Burget, J., Ho, A., Barnes, E., and Christiano, P. “Evaluating Language- Model Agents on Realistic Autonomous Tasks”. In: CoRR abs/2312.11671 (2023). DOI: 10.48550/ARXIV.2312.11671. arXiv: 2312.11671. URL: https://doi.org/10.48550/arXiv.2312.11671.
Legg, S. and Hutter, M. “A Universal Measure of Intelligence for Artificial Agents". In: IJCAI-05, Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence, Edinburgh, Scotland, UK, July 30 - August 5, 2005. Ed. by Kaelbling, L. P. and Saffiotti, A. Professional Book Center, 2005, pp. 1509–1510. URL: http://ijcai.org/Proceedings/05/Papers/post-0042.pdf.
Li, Y., Guerin, F., and Lin, C. “LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time-Sensitive Test Construction”. In: Thirty- Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20-27, 2024, Vancouver, Canada. Ed. by Wooldridge, M. J., Dy, J. G., and Natarajan, S. AAAI Press, 2024, pp. 18600–18607. DOI: 10.1609/AAAI.V38117.29822. URL: https://doi.org/10.1609/aaai.v38i17.29822.
Liang, P. P., Goindani, A., Chafekar, T., Mathur, L., Yu, H., Salakhutdinov, R., and Morency, L.-P. HEMM: Holistic Evaluation of Multimodal Foundation Models. 2024. arXiv: 2407.03418 [cs.LG]. URL: https://doi.org/10.48550/arXiv.2407.03418.
Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., Newman, B., Yuan, B., Yan, B., Zhang, C., Cosgrove, C., Manning, C. D., Ré, C., Acosta-Navas, D., Hudson, D. A., Zelikman, E., Durmus, E., Ladhak, F., Rong, F., Ren, H., Yao, H., Wang, J., Santhanam, K., Orr, L. J., Zheng, L., Yüksekgönül, M., Suzgun, M., Kim, N., Guha, N., Chatterji, N. S., Khattab, O., Henderson, P., Huang, Q., Chi, R., Xie, S. M., Santurkar, S., Ganguli, S., Hashimoto, T., Icard, T., Zhang, T., Chaudhary, V., Wang, W., Li, X., Mai, Y., Zhang, Y., and Koreeda, Y. “Holistic Evaluation of Language Models”. In: CoRR abs/2211.09110 (2022). DOI: 10.48550/ARXIV.2211.09110. arXiv: 2211.09110. URL: https://doi.org/10.48550/arXiv.2211.09110.
Liu, X., Yu, H., Zhang, H., Xu, Y., Lei, X., Lai, H., Gu, Y., Ding, H., Men, K., Yang, K., Zhang, S., Deng, X., Zeng, A., Du, Z., Zhang, C., Shen, S., Zhang, T., Su, Y., Sun, H., Huang, M., Dong, Y., and Tang, J. “AgentBench: Evaluating LLMs as Agents”. In: CoRR abs/2308.03688 (2023). DOI: 10.48550/ARXIV.2308.03688. arXiv: 2308.03688. URL: https://doi.org/10.48550/arXiv.2308.03688.
Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K., Galley, M., and Gao, J. “MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models”. In: CoRR abs/2310.02255 (2023). DOI: 10.48550/ARXIV.2310.02255. arXiv: 2310.02255. URL: https://doi.org/10.48550/arXiv.2310.02255.
McIntosh, T. R., Susnjak, T., Liu, T., Watters, P. A., and Halgamuge, M. N. “Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence". In: CoRR abs/2402.09880 (2024). DOI: 10.48550/ARXIV.2402.09880. arXiv: 2402.09880. URL: https://doi.org/10.48550/arXiv.2402.09880.
Mialon, G., Fourrier, C., Swift, C., Wolf, T., LeCun, Y., and Scialom, T. “GAIA: a benchmark for General AI Assistants". In: CoRR abs/2311.12983 (2023). DOI: 10.48550/ARXIV.2311.12983. arXiv: 2311.12983. URL: https://doi.org/10.48550/arXiv.2311.12983.
Morris, M. R., Sohl-Dickstein, J., Fiedel, N., Warkentin, T., Dafoe, A., Faust, A., Farabet, C., and Legg, S. “Levels of AGI: Operationalizing Progress on the Path to AGI". In: CoRR abs/2311.02462 (2023). DOI: 10.48550/ARXIV.2311.02462. arXiv: 2311.02462. URL: https://doi.org/10.48550/arXiv.2311.02462.
Owen, V. M. “Mind Children-The Future of Robot and Human Intelligence by Hans Moravec Harvard University Press, Cambridge, Massachusetts, 1988, 205 PP + index (£14.95)". In: Robotica 7.4 (1989), pp. 366–367. DOI: 10.1017/S026357470000686X. URL: https://doi.org/10.1017/S026357470000686X.
Patraucean, V., Smaira, L., Gupta, A., Recasens, A., Markeeva, L., Banarse, D., Koppula, S., Heyward, J., Malinowski, M., Yang, Y., Doersch, C., Matejovicova, T., Sulsky, Y., Miech, A., Fréchette, A., Klimczak, H., Koster, R., Zhang, J., Winkler, S., Aytar, Y., Osindero, S., Damen, D., Zisserman, A., and Carreira, J. “Perception Test: A Diagnostic Benchmark for Multimodal Video Models". In: Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. Ed. by Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. 2023. URL: http://papers.nips.cc/paper%5C_files/paper/2023/hash/8540fba4abdc7f9f7a7b1cc6cd60e409-Abstract-Datasets%5C_and%5C_Benchmarks.html.
Raji, I. D., Bender, E. M., Paullada, A., Denton, E., and Hanna, A. “AI and the Everything in the Whole Wide World Benchmark". In: CoRR abs/2111.15366 (2021). arXiv: 2111.15366. URL: https://arxiv.org/abs/2111.15366.
Räuker, T., Ho, A., Casper, S., and Hadfield-Menell, D. “Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks". In: 2023 IEEE Conference on Secure and Trustworthy Machine Learning, SaTML 2023, Raleigh, NC, USA, February 8-10, 2023. IEEE, 2023, pp. 464–483. DOI: 10.1109/SATML54575. 2023.00039. URL: https://doi.org/10.1109/SaTML54575.2023.00039.
Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J., and Bowman, S. R. “GPQA: A Graduate-Level Google-Proof Q&A Benchmark”. In: CoRR abs/2311.12022 (2023). DOI: 10.48550/ARXIV.2311.12022. arXiv: 2311.12022. URL: https://doi.org/10.48550/arXiv.2311.12022.
Schlangen, D. “Targeting the Benchmark: On Methodology in Current Natural Language Processing Research”. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 2: Short Papers), Virtual Event, August 1-6, 2021. Ed. by Zong, C., Xia, F., Li, W., and Navigli, R. Association for Computational Linguistics, 2021, pp. 670–674. DOI: 10.18653/V1/2021.ACL-SHORT.85. URL: https://doi.org/10.18653/v1/2021.acl-short.85.
Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., Kluska, A., Lewkowycz, A., Agarwal, A., Power, A., Ray, A., Warstadt, A., Kocurek, A. W., Safaya, A., Tazarv, A., Xiang, A., Parrish, A., Nie, A., Hussain, A., Askell, A., Dsouza, A., Rahane, A., Iyer, A. S., Andreassen, A., Santilli, A., Stuhlmüller, A., Dai, A. M., La, A., Lampinen, A. K., Zou, A., Jiang, A., Chen, A., Vuong, A., Gupta, A., Gottardi, A., Norelli, A., Venkatesh, A., Gholami-davoodi, A., Tabassum, A., Menezes, A., Kirubarajan, A., Mullokandov, A., Sabharwal, A., Herrick, A., Efrat, A., Erdem, A., Karakas, A., and al., et. “Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models". In: CoRR abs/2206.04615 (2022). DOI: 10.48550/ARXIV.2206.04615. arXiv: 2206.04615. URL: https://doi.org/10.48550/arXiv.2206.04615.
Suzgun, M., Scales, N., Schärli, N., Gehrmann, S., Tay, Y., Chung, H. W., Chowdhery, A., Le, Q. V., Chi, E. H., Zhou, D., and Wei, J. “Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them”. In: Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023. Ed. by Rogers, A., Boyd-Graber, J. L., and Okazaki, N. Association for Computational Linguistics, 2023, pp. 13003-13051. DOI: 10.18653/V1/2023.FINDINGS-ACL.824. URL: https://doi.org/10.18653/v1/2023.findings-acl.824.
Thompson, A. D. and Betts, J. BASIS: Betts Artificial Superintelligence Suite. Accessed: 2024-06-27. Oct. 2023. URL: https://lifearchitect.ai/basis/.
Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. “SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems". In: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada. Ed. by Wallach, H. M., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E. B., and Garnett, R. 2019, pp. 3261–3275. URL: https://proceedings.neurips.cc/paper/2019/hash/4496bf24afe7fab6f046bf4923da8de6-Abstract.html.
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding". In: CoRR abs/1804.07461 (2018). arXiv: 1804.07461. URL: http://arxiv.org/abs/1804.07461.
Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., Li, T., Ku, M., Wang, K., Zhuang, A., Fan, R., Yue, X., and Chen, W. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark. 2024. arXiv: 2406.01574 [cs.CL]. URL: https://doi.org/10.48550/arXiv.2406.01574.
Xie, T., Zhang, D., Chen, J., Li, X., Zhao, S., Cao, R., Hua, T. J., Cheng, Z., Shin, D., Lei, F., Liu, Y., Xu, Y., Zhou, S., Savarese, S., Xiong, C., Zhong, V., and Yu, T. “OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments". In: CoRR abs/2404.07972 (2024). DOI: 10.48550/ARXIV.2404.07972. arXiv: 2404.07972. URL: https://doi.org/10.48550/arXiv.2404.07972.
Xu, C., Guan, S., Greene, D., and Kechadi, M.-T. Benchmark Data Contamination of Large Language Models: A Survey. 2024. arXiv: 2406.04244 [cs.CL]. URL: https://doi.org/10.48550/arXiv.2406.04244.
Yin, G., Bai, H., Ma, S., Nan, F., Sun, Y., Xu, Z., Ma, S., Lu, J., Kong, X., Zhang, A., Yap, D. A., zhang, Y., Ahnert, K., Kamath, V., Berglund, M., Walsh, D., Gindele, T., Wiest, J., Lai, Z., Wang, X., Shan, J., Cao, M., Pang, R., and Wang, Z. MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains. 2024. arXiv: 2407.18961 [cs.AI]. URL: https://doi.org/10.48550/arXiv.2407.18961.
Ying, K., Meng, F., Wang, J., Li, Z., Lin, H., Yang, Y., Zhang, H., Zhang, W., Lin, Y., Liu, S., Lei, J., Lu, Q., Chen, R., Xu, P., Zhang, R., Zhang, H., Gao, P., Wang, Y., Qiao, Y., Luo, P., Zhang, K., and Shao, W. “MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI”. In: CoRR abs/2404.16006 (2024). DOI: 10.48550/ARXIV.2404.16006. arXiv: 2404.16006. URL: https://doi.org/10.48550/arXiv.2404.16006.
Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., Wei, C., Yu, B., Yuan, R., Sun, R., Yin, M., Zheng, B., Yang, Z., Liu, Y., Huang, W., Sun, H., Su, Y., and Chen, W. “MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI”. In: CoRR abs/2311.16502 (2023). DOI: 10.48550/ARXIV.2311.16502. arXiv: 2311.16502. URL: https://doi.org/10.48550/arXiv.2311.16502.
Zheng, L., Chiang, W., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., and Stoica, I. “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena”. In: Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. Ed. by Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. 2023. URL: http://papers.nips.cc/paper%5C_files/paper/2023/hash/91f18a1287b398d378ef22505bf41832-Abstract-Datasets%5C_and%5C_Benchmarks.html.
Zhong, W., Cui, R., Guo, Y., Liang, Y., Lu, S., Wang, Y., Saied, A., Chen, W., and Duan, N. “AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models". In: CoRR abs/2304.06364 (2023). DOI: 10.48550/ARXIV.2304.06364. arXiv: 2304.06364. URL: https://doi.org/10.48550/arXiv.2304.06364.
Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Bisk, Y., Fried, D., Alon, U., and Neubig, G. “WebArena: A Realistic Web Environment for Building Autonomous Agents”. In: CoRR abs/2307.13854 (2023). DOI: 10.48550/ARXIV.2307. 13854. arXiv: 2307.13854. URL: https://doi.org/10.48550/arXiv.2307.13854.