Back

Tencent and UCLA Open-Source OpenSearch-VL, a Multimodal Deep Search Agent Framework

AI Models07.May.2026 10:173 min read

Tencent Hunyuan, in collaboration with UCLA and CUHK, has released OpenSearch-VL, an open-source framework for training multimodal large language model agents capable of active, multi-step search and reasoning. The project aims to close key reproducibility gaps in data pipelines, tool integration, and reinforcement learning recipes for deep search agents.

Tencent and UCLA Open-Source OpenSearch-VL, a Multimodal Deep Search Agent Framework

As multimodal large language models (MLLMs) rapidly evolve, the next frontier is enabling them to move beyond passive image understanding toward active evidence gathering and multi-step reasoning. This shift—from describing images to autonomously searching, verifying, and reasoning across modalities—has proven difficult to reproduce in open research due to gaps in high-quality training data, trajectory synthesis methods, and reinforcement learning (RL) recipes.

Tencent Hunyuan, working with researchers from the University of California, Los Angeles (UCLA) and The Chinese University of Hong Kong (CUHK), is attempting to close that gap with the release of OpenSearch-VL, an open-source multimodal deep search agent framework. The team has published a technical paper and plans to release datasets, code, and model weights to support reproducibility and further research.

From Passive Vision to Active Search

Traditional vision-language models excel at recognizing and describing visual content. However, real-world problem solving often requires multi-hop reasoning: identifying visual clues, issuing search queries, filtering results, and synthesizing external knowledge. According to the researchers, a key bottleneck has been the lack of structured, high-quality training trajectories that teach models how to perform these steps sequentially.

To address this, OpenSearch-VL introduces a data production pipeline built on Wikipedia’s hyperlink graph. The system samples relational paths between entities and converts them into multi-hop question-answering tasks. To prevent shortcut learning, the team applies entity rewriting techniques that obscure direct answers and incorporates source-code anchor-based visual grounding, forcing the model to identify relevant visual cues before invoking external tools.

The project includes two primary datasets:

SearchVL-SFT, with 36,000 supervised fine-tuning trajectories.
SearchVL-RL, with 8,000 reinforcement learning training samples.

A Tool-Rich Multimodal Environment

Unlike text-only search agents, OpenSearch-VL integrates a broader tool ecosystem tailored to multimodal inputs. In practical scenarios, user-submitted images may be blurry, skewed, or low resolution, limiting downstream retrieval performance.

To compensate, the framework equips the agent with multiple preprocessing and retrieval tools, including:

Web search and reverse image search
Optical character recognition (OCR)
Image cropping and sharpening
Super-resolution reconstruction
Perspective correction

This design encourages “active perception,” where the agent first enhances or repairs visual input before initiating knowledge retrieval. The result is improved robustness and search accuracy in complex, real-world conditions.

Learning From Failure: Multi-Round Fault-Aware GRPO

Long-horizon tool use introduces cascading failure risks: a timeout or incorrect call can derail the entire reasoning chain. Conventional RL methods often discard failed trajectories, wasting potentially useful intermediate reasoning steps.

OpenSearch-VL proposes a “multi-round fault-aware GRPO” algorithm to address this inefficiency. The approach identifies failure points in tool calls, masks invalid post-failure signals, and applies one-sided advantage clamping to preserve useful reasoning steps that occurred before the error. This enables the model to learn effective exploration strategies even when tasks do not fully succeed.

Benchmark Performance and Open Research Implications

In evaluations across seven mainstream multimodal deep search benchmarks, OpenSearch-VL reportedly improves average performance by more than 10 percentage points. On selected tasks, its results approach those of leading proprietary commercial systems, according to the research team.

If validated by the broader community, the release could help standardize training practices for multimodal search agents and reduce reliance on closed ecosystems. By open-sourcing data pipelines, training recipes, and tool integration frameworks, the collaborators aim to provide a reproducible foundation for researchers building next-generation multimodal agents.

The project underscores a broader trend in AI research: shifting from static perception models to interactive systems capable of structured reasoning, tool use, and adaptive recovery from failure—key ingredients for more capable autonomous agents.