Google AI Research Introduces Vantage: A Novel Approach to Assessing Collaboration and Creativity Based on Large Language Models

14.Apr.2026 09:103 min read

Google’s research team has introduced the Vantage method, which leverages large language models to simulate authentic team interactions for assessing “enduring skills” such as collaboration, creativity, and critical thinking. The study shows that its AI-generated scores align closely with those of human experts, offering a novel technological pathway for educational assessment.

Google AI Research Introduces Vantage: A Novel Approach to Assessing Collaboration and Creativity Based on Large Language Models

In education, traditional standardized tests can assess whether students have mastered calculus or can comprehend a text, but they struggle to measure abilities such as resolving disagreements within a team, generating innovative ideas under pressure, or critically analyzing arguments. These so-called “durable skills”—collaboration, creativity, and critical thinking—have long lacked effective, scalable measurement tools.

Google Research recently introduced a new approach called Vantage, which leverages large language models (LLMs) to simulate authentic group interactions and evaluate participant performance, aiming to build a more ecologically valid assessment framework for these capabilities.

Google AI Research Introduces Vantage: A New LLM-Based Approach to Assessing Collaboration and Creativity

Why Are “Durable Skills” So Difficult to Measure?

The research team notes that the core challenge in assessing durable skills lies in the tension between ecological validity and psychometric rigor. On one hand, assessments should take place in contexts that resemble real-world situations; on the other, they must ensure comparability and repeatability.

For example, the collaborative problem-solving assessment in PISA 2015 relied on multiple-choice questions and scripted interactions with simulated teammates. While this approach allowed for tight variable control, it sacrificed the complexity and dynamism of genuine human interaction.

According to the Google Research team, large language models offer the potential to strike a balance between these competing demands: they can create realistic conversational scenarios while enabling controlled generation and standardized scoring through a unified model.

The Core of Vantage: The Orchestrator LLM Architecture

At the heart of Vantage is the so-called “orchestrator LLM” architecture. This design uses a single LLM to generate responses for all AI participants, enabling coordinated management of the overall dialogue flow.

The advantages of this approach include:

  • Unified control over the behavioral logic of multiple AI roles;

  • Proactive guidance of conversation development based on predefined educational standards;

  • Intentional triggering of specific scenarios at key moments to test participant responses.

For instance, when evaluating conflict resolution skills, the orchestrator LLM can deliberately introduce disagreements through AI characters to observe how the human participant responds. The study found that, compared with uncoordinated independent agents, the orchestrator LLM performed better on two collaboration sub-skills, with a significantly higher rate of eliciting evidence of key behaviors.

Experimental Design and Evaluation Results

In the experiment, the research team recruited 188 participants between the ages of 18 and 25 to complete 30-minute collaborative tasks with AI characters, collecting a total of 373 dialogue transcripts.

Dialogue scoring was conducted jointly by two human raters from New York University and an AI-based evaluation tool. The results showed that:

  • AI-generated scores demonstrated strong agreement with expert human ratings;

  • On measures of creativity and critical thinking, the orchestrator LLM outperformed independent agents;

  • The overall assessment framework shows promising scalability.

A New Direction for Educational Assessment

Overall, Vantage leverages large language models to create controlled yet realistic collaborative scenarios, offering a new tool for the quantitative assessment of “durable skills.” Its orchestrator LLM architecture not only improves the identification of key behaviors but also achieves a high level of consistency with expert human scoring.

At a time when traditional tests struggle to capture abilities such as collaboration and creativity, Vantage highlights the expanding potential of AI in the field of educational assessment.