Seeking and Updating with Live Visual Knowledge

Mingyang Fu ¹ ^*, Yuyang Peng ¹ ^*, Dongping Chen ¹ ² ^‡, Zetong Zhou ¹, Benlin Liu ², Yao Wan ¹ ^†, Zhou Zhao ³, Philips S. Yu ⁴, Ranjay Krishna ²

¹ Huazhong University of Science and Technology
² University of Washington
³ Zhejiang University
⁴ University of Illinois, Chicago

{dongpingchen0612, yaowan1992}@gmail.com, ranjay@cs.washington.edu

Paper Code Data

In this work, we introduce LIVEVQA, a new dataset and benchmark for evaluating Multimodal Large Language Models (MLLMs) on their ability to understand and reason about up-to-date visual information. Specifically, there are three-fold major contributions:

A Novel Dataset. We introduce LIVEVQA, the first dataset of its kind, featuring 107,143 samples across 12 categories, specifically designed to test how MLLMs handle visual information beyond their training data cutoff and how they can be updated with new knowledge.
Comprehensive Benchmarking and Analysis. We conducted extensive benchmarking of 17 state-of-the-art MLLMs, revealing significant performance gaps on content beyond their knowledge cutoff. Our findings show that tool-use or agentic visual seeking frameworks can drastically improve performance by an average of 327%.
Efficient Knowledge Updating Insights. We explored parameter-efficient fine-tuning (PEFT) methods, demonstrating that MLLMs can be efficiently updated with new visual knowledge within a single epoch. While this can impact visual perception, it can enhance knowledge-intensive capabilities, and we provide insights into balancing adapter capacity and model capability.

Abstract

The visual world around us constantly evolves, from real-time news and social media trends to global infrastructure changes visible through satellite imagery and augmented reality enhancements. However, Multimodal Large Language Models (MLLMs), which automate many tasks, struggle to stay current, limited by the cutoff dates in their fixed training datasets. To quantify this stagnation, we introduce LiveVQA, the first-of-its-kind dataset featuring 107,143 samples and 12 categories data specifically designed to support research in both seeking and updating with live visual knowledge. Drawing from recent news articles, video platforms, and academic publications in April 2024-May 2025, LiveVQA enables evaluation of how models handle latest visual information beyond their knowledge boundaries and how current methods help to update them. Our comprehensive benchmarking of 17 state-of-the-art MLLMs reveals significant performance gaps on content beyond knowledge cutoff, and tool-use or agentic visual seeking framework drastically gain an average of 327% improvement. Furthermore, we explore parameter-efficient fine-tuning (PEFT) methods to update MLLMs with new visual knowledge. We dive deeply to the critical balance between adapter capacity and model capability when updating MLLMs with new visual knowledge.

LIVEVQA Dataset Construction

We introduce LIVEVQA, a first-of-its-kind dataset containing fresh visual content and corresponding question-answer pairs, aimed at benchmarking and advancing Multimodal Large Language Models (MLLMs) in seeking and updating live visual knowledge. The visual content is sourced from recent international news articles, YouTube videos, and academic papers spanning from April 2024 to early May 2025. The construction of the LIVEVQA dataset primarily follows a multi-stage LLM/MLLM-in-the-loop pipeline with rigorous filtering and human validation:

Raw Data Collection from Diverse Sources: This stage involves collecting recent visual and textual data. For news articles, this includes URL and headline filtering, image selection based on size and relevance (enhanced using GPT-4.1 to ensure strong correlation with events), and semantic deduplication. For videos (from YouTube), it involves preprocessing (restricting to English, max 10 mins, with subtitles), subtitle-based segmentation using an LLM, initial keyframe identification (using UVD and perceptual hashing for deduplication), and LLM-driven selection of top-K relevant keyframes. For academic papers (from arXiv), it includes extracting titles, abstracts, authors, images, and captions, followed by key image selection prioritizing architectural diagrams and key findings, avoiding common visualizations.
Visual Question Answering (VQA) Generation and Filtering: This stage constructs two levels of questions. Level 1 questions target basic visual entity recognition (e.g., locations, persons, time) based on filtered images and metadata, with GPT-4.1 used to filter out unqualified QAs (e.g., those with overly brief answers or simple labels). Level 2 questions are more complex, requiring multi-hop cross-modal reasoning using the full image context and related textual information, covering seven types (location, person, organization, time, event, count, reason); these are also generated and filtered by GPT-4.1 to ensure answer verifiability. All LLM/MLLM-assisted processes undergo human validation with a high pass rate.

Data Filtering

The diagram above illustrates the comprehensive filtering process employed in the construction of the LIVEVQA dataset. It details how raw images and synthesized question-answer pairs are systematically refined across three distinct data source pipelines: YouTube videos, arXiv academic papers, and news articles (from sources such as Forbes, Variety, CNN, BBC, and Associated Press). The pipeline begins with a large corpus of "Raw Images" (e.g., 829K from YouTube, 180K from arXiv, 19K from News). These are then subjected to a series of stringent filtering stages. Key steps include "Key Frame Filters" for video content, "Irrelevant Image Filters" to remove non-pertinent visuals, and "Choose the Most Representative" to select the most informative images. Further refinement occurs through "Level-1 QAs Filters" and "Level-2 QAs Filters", followed by an "AI Judge & Filter QAs" step. This meticulous process significantly reduces the volume of data, ensuring that only high-quality and relevant "Meta Images" and their associated reasoning questions (e.g., culminating in 12K images from YouTube, 9K from arXiv, and 8K from News) are included in the final LIVEVQA dataset. This multi-layered filtering strategy is essential for maintaining the integrity and utility of the benchmark.

Benchmark Display

Example 1: News

Source: CNN Sport

Level 1 Question:

Based on the provided image, what is the specific location where this celebration is taking place?

A. Augusta National Golf Club
B. Pebble Beach Golf Links
C. St Andrews Links
D. Torrey Pines Golf Course
E. Pinehurst No. 2

Ground Truth: A (Augusta National Golf Club)

Level 2 Question:

What is the reason the camera operator visible in the green tower above the crowd was able to capture the critical moment shown in the image?

A. replay producer signaled to hold the shot
B. technical director delayed switching as ordered
C. director anticipated the shot outcome
D. head coach instructed to stay on main feed

Ground Truth: B (technical director delayed switching as ordered)

Example 2: News

Source: BBC

Level 1 Question:

Based on the provided image, what is the specific location shown?

A. Trafalgar Square
B. Hyde Park Corner
C. Parliament Square
D. Leicester Square
E. Piccadilly Circus

Ground Truth: C (Parliament Square)

Level 2 Question:

Why did the Home Secretary announce the extension of criminal protection to the monument prominently shown in the image?

A. public demand for more statues
B. banner slogan matching protest motto
C. country celebrating VE Day
D. Parliament Square redevelopment approval

Ground Truth: C (country celebrating VE Day)

Example 3: Video

Source: YouTube

Level 1 Question:

Based on the provided image, what event is taking place?

A. Mumbai Tech Startup Expo 2024
B. MasterSoft Hosts Higher Education Leaders Conclave
C. Maharashtra Digital Learning Conference
D. National Policy on Education Review Summit
E. Nagpur Academic Technology Symposium

Ground Truth: B (MasterSoft Hosts Higher Education Leaders Conclave)

Level 2 Question:

How many years was the association mentioned by the principal who took over in 2009 before adopting the solution described by the man at the blue podium?

A. 2 years
B. 3 years
C. 14 years
D. 4 years

Ground Truth: A (2 years)

Example 4: Video

Source: YouTube

Level 1 Question:

Based on the provided image, what event is taking place?

A. Venice Film Festival
B. Cannes Film Festival
C. Sundance Film Festival
D. Berlin International Film Festival
E. Toronto International Film Festival

Ground Truth: A (Venice Film Festival)

Level 2 Question:

What is the name of the event at which the audience shown in the image is present?

A. Academy Awards Ceremony
B. Venice Film Festival
C. Cannes Film Festival
D. Berlin International Film Festival

Ground Truth: B (Venice Film Festival)

Example 5: Academic Paper

Source: Arxiv

Level 1 Question:

Who is the primary author of the paper shown here?

A. R. J. Smethurst
B. Hugh Dickinson
C. L. F. Fortson
D. Tobias Géron
E. Izzy L. Garland

Ground Truth: D (Tobias Géron)

Level 2 Question:

In this paper, for the sample of 6,640 galaxies that remained after the deduplication process, and based on the described classification scheme (where a galaxy is unbarred if p_strong_bar + p_weak_bar < 0.5), how many galaxies were ultimately classified as unbarred?

A. 311
B. 161
C. 6640
D. 6479
E. 398

Ground Truth: D (6479)

Example 6: Academic Paper

Source: Arxiv

Level 1 Question:

Who conducted the research presented in this image?

A. Yupeng Zhang
B. Mridul Sharma
C. Prajwal Thapa
D. Jinu Nyachhyon
E. Yagya Raj Pandeya

Ground Truth: C (Prajwal Thapa)

Level 2 Question:

In this paper, what is the precise count of distinct, pre-trained architectural frameworks that the researchers explicitly selected, then uniformly adapted at their terminal processing stage for the 60-class herb identification problem, and subsequently benchmarked against one another?

A. 1
B. 5
C. 6
D. 60
E. 121

Ground Truth: C (6)

LIVEVQA Dataset Statistics

Below we present an overview of the main statistics of LIVEVQA, showcasing its composition across different data sources and splits. LIVEVQA contains a total of 28,488 unique images and 107,143 questions.

Category	Images	#Question	Level 1	Level 2	Avg. Len.	Purpose
News Article	7,579	38,809	7,579	31,230	749	-
YouTube Videos	11,948	43,168	11,948	31,220	311	-
Academic Paper	8,961	25,166	9,456	16,205	597	-

Avg. per Sample	1	3.86	1	2.86	517	-

Test Split	1,500	3,000	1,500	1,500	544	Exp. 1
Training Split	26,988	104,143	26,988	77,150	496	Exp. 2

Figure 4: (Left) Image size distribution in YouTube image filtering pipeline. (Right) Textual context length distribution for each question.

Benchmark Results for LIVEVQA

We conducted a comprehensive benchmark of 17 state-of-the-art Multimodal Large Language Models (MLLMs) to evaluate their capabilities in seeking and updating live visual knowledge. The evaluation was performed on the LIVEVQA dataset, which includes content from recent news articles, YouTube videos, and academic papers. Performance was measured for Level 1 (visual entity recognition) and Level 2 (deeper visual knowledge reasoning) questions, with and without various search augmentation methods. Key findings indicate that current MLLMs struggle significantly with visual knowledge beyond their training cutoff, but performance is drastically improved with the use of multimodal search tools.

Overall Accuracy
News Subset - Detailed Categories
Video Subset - Detailed Categories

Accuracy (%) of visual factuality seeking benchmark in open-ended format. (Avg. = Average)

Model	Cutoff	Level 1				Level 2
Model	Cutoff	News	Video	Arxiv	Avg.	News	Video	Arxiv	Avg.
w.o. Search
GPT-4.1	Jun. 2024	27.0	22.0	0.4	16.5	5.2	7.2	0.2	3.0
GPT-4.1-mini	Jun. 2024	24.6	19.6	0.2	14.8	4.0	7.8	0.4	4.0
GPT-4.1-nano	Jun. 2024	13.0	13.0	0.0	8.6	2.2	6.0	0.4	2.9
Gemini-2.5-Flash	Jan. 2025	25.8	18.4	0.8	15.0	4.6	4.4	4.0	4.3
Gemini-2.5-Pro	Jan. 2025	28.0	17.4	0.6	15.3	4.4	2.4	1.2	2.7
Gemma-3-27B-It	Aug. 2024	21.0	16.4	1.0	12.8	3.8	4.6	6.2	4.9
Claude-3.7-Sonnet	Oct. 2024	26.2	16.4	0.6	14.3	2.2	4.4	4.4	3.7
Qwen-2.5-VL-7B-Instruct	Unknown	20.2	13.4	0.2	11.3	3.8	5.4	2.0	3.7
Qwen-2.5-VL-32B-Instruct	Unknown	25.2	16.4	0.4	14.0	4.2	5.6	1.2	3.7
Qwen-2.5-VL-72B-Instruct	Unknown	12.4	9.4	0.0	7.3	1.4	3.6	3.6	2.9
Llama-4-Scout	Aug. 2024	20.6	16.4	0.0	12.1	4.0	5.0	2.8	3.9
Llama-4-Maverick	Aug. 2024	20.2	19.0	0.6	13.3	5.8	6.0	5.2	5.7
w. Text Search
GPT-4.1	Jun. 2024	25.0	21.4	0.6	15.6	3.6	5.6	3.8	4.3
Gemini-2.5-Pro	Jan. 2025	17.6	9.2	0.2	9.0	2.0	1.6	1.0	1.5
Claude-3.7-Sonnet	Oct. 2024	24.6	16.6	0.0	13.7	2.0	3.6	4.8	3.5
w. Native Image Search
GPT-03	Jun. 2024	33.6	33.6	2.6	23.3	14.6	14.9	17.8	15.8
w. MM-Search [Jiang et al., 2024]
GPT-4.1	Jun. 2024	42.0	36.1	22.0	33.4	27.2	15.2	48.8	30.4

Model	Level 1 (News Subset)	Level 2 (News Subset)
w.o. Search
GPT-4.1	50.72	15.19	35.89	27.03	6.28	28.81	0.00	1.75	11.68	3.82	7.84	1.63	0.00	5.05
GPT-4.1-mini	33.33	10.91	45.59	11.86	19.23	24.60	0.00	3.57	8.82	0.00	10.24	0.00	0.00	4.00
GPT-4.1-Nano	16.16	3.64	30.88	3.39	13.00	13.00	0.00	0.00	4.41	1.54	3.94	0.83	0.00	2.20
Gemini-2.5-Flash	26.26	37.27	35.29	7.63	25.80	25.80	0.00	3.57	1.47	3.85	8.66	4.17	0.00	4.60
Gemini-2.5-Pro	23.23	46.36	35.29	10.17	28.00	28.00	3.57	0.00	5.88	3.08	3.94	6.67	0.00	4.40
Gemma-3-27B-IT	24.24	15.45	38.24	8.47	21.00	21.00	3.57	0.00	8.82	1.54	7.87	0.00	0.00	3.80
Claude-3.7-Sonnet	26.20	38.38	10.00	14.41	26.20	26.20	0.00	0.00	4.41	2.31	1.57	2.50	0.00	2.20
Qwen-2.5-VL-7B	23.23	21.15	30.88	12.71	20.20	20.20	0.00	0.00	4.41	1.54	7.09	4.17	0.00	3.80
Qwen-2.5-VL-32B	33.33	18.18	30.88	18.64	25.20	25.20	0.00	0.00	7.35	2.31	6.30	4.17	0.00	4.20
Qwen-2.5-VL-72B	12.50	6.36	15.15	8.47	12.40	12.40	0.00	0.00	4.41	0.77	1.57	0.83	0.00	1.40
Llama-4-Scout	26.26	13.64	35.29	8.47	20.60	20.60	3.57	0.00	4.41	3.08	9.45	0.00	0.00	4.00
Llama-4-Maverick	20.20	19.09	36.76	5.93	20.20	20.20	0.00	0.00	10.29	2.31	13.39	1.67	0.00	5.80
w. Text Search
GPT-4.1	34.62	13.56	48.53	2.73	25.00	25.00	5.88	3.57	5.88	3.85	4.72	0.83	0.00	3.60
Gemini-2.5-Pro	18.18	10.17	29.41	12.73	17.60	17.60	0.00	3.57	4.41	1.54	2.36	1.67	0.00	2.00
Claude-3.7-Sonnet	23.08	18.64	40.38	6.36	24.60	24.60	0.00	5.88	1.47	1.54	3.15	0.83	0.00	2.00
w. Native Image Search
GPT-03	47.47	23.73	57.35	47.12	33.60	33.60	0.00	17.86	20.59	7.69	17.32	17.50	10.00	14.60
w. MM-Search [Jiang et al., 2024]
GPT-4.1	50.00	35.78	55.88	42.86	42.00	42.00	15.50	23.53	30.88	42.52	20.00	46.43	0.00	27.20

Model	Level 1 (Video Subset)	Level 2 (Video Subset)
w.o. Search
GPT-4.1	26.58	8.33	40.85	7.77	32.23	22.00	8.51	3.45	5.56	6.32	11.20	5.65	4.55	7.20
GPT-4.1-mini	21.52	13.54	30.99	4.85	30.58	19.60	2.13	3.45	12.96	6.32	15.20	3.23	4.55	7.80
GPT-4.1-nano	15.19	1.04	28.17	4.85	19.01	13.00	0.00	0.00	5.56	6.32	14.40	2.42	0.00	6.00
Gemini-2.5-Flash	18.99	27.08	29.58	4.85	18.18	18.40	0.00	3.45	1.85	4.21	11.20	0.81	4.55	4.40
Gemini-2.5-Pro	8.86	25.00	32.39	6.80	19.01	17.40	0.00	0.00	1.85	2.11	5.60	1.61	0.00	2.40
Gemma-3-27B-IT	13.92	14.58	33.80	3.88	21.49	16.40	0.00	0.00	5.56	4.21	10.40	1.61	4.55	4.60
Claude-3.7-Sonnet	18.99	7.29	29.58	6.80	23.97	16.40	2.13	0.00	1.85	4.21	7.20	4.84	4.55	4.40
Qwen-2.5-VL-7B	12.66	10.42	25.35	4.85	16.53	13.40	2.13	0.00	5.56	3.16	14.40	1.61	0.00	5.40
Qwen-2.5-VL-32B	16.46	10.42	32.39	4.85	22.31	16.40	0.00	0.00	5.56	6.32	9.60	4.84	4.55	5.60
Qwen-2.5-VL-72B	10.13	3.12	18.31	1.94	14.88	9.40	0.00	0.00	7.41	3.16	5.60	2.42	4.55	3.60
Llama-4-Scout	16.46	13.54	26.76	7.77	20.66	16.40	2.13	0.00	7.41	4.21	10.40	1.61	4.55	5.00
Llama-4-Maverick	18.99	14.58	38.03	8.74	20.66	19.00	2.13	3.45	3.70	4.21	15.20	2.42	0.00	6.00
w. Text Search
GPT-4.1	13.92	6.25	30.05	3.56	22.59	14.60	2.84	0.00	3.09	3.86	6.67	2.42	3.03	3.73
Gemini-2.5-Pro	1.69	1.39	19.72	2.91	8.54	6.53	0.00	0.00	0.62	1.40	3.20	0.00	1.52	1.20
Claude-3.7-Sonnet	8.02	4.17	14.55	2.59	12.95	8.33	1.42	0.00	1.23	1.40	3.73	0.54	0.00	1.60
w. Native Image Search
GPT-o3	37.97	19.79	43.66	22.33	46.28	33.60	8.51	10.34	12.96	11.58	29.60	25.00	18.18	19.40
w. MM-Search [Jiang et al., 2024]
GPT-4.1	29.11	31.58	49.30	21.36	38.84	33.00	13.68	17.02	10.34	11.11	26.40	9.68	4.55	15.20

Empirical Results from LIVEVQA

MLLMs Face Challenges with "Live" Visual Knowledge; Multimodal Search is Key

Our comprehensive benchmarking of 17 state-of-the-art Multimodal Large Language Models (MLLMs) on the LIVEVQA dataset revealed significant difficulties in handling visual information beyond their knowledge cutoff dates. For instance, even top-performing models showed low accuracy on recent visual content when operating without external tools.

However, the integration of multimodal search capabilities leads to dramatic improvements.

Models augmented with multimodal search tools (e.g., GPT-4.1 with MM-Search) demonstrated an average accuracy increase of 327% in seeking live visual knowledge. Specifically, GPT-4.1's average accuracy more than doubled from 16.5% to 33.4% when using MM-Search, with particularly striking gains on challenging Level 2 questions (e.g., accuracy on News subset Level 2 rose from 5.2% to 27.2%).
Native image search capabilities, as seen in models like GPT-03, also provided substantial gains (e.g., from 3.0% to 15.8% on Level 2 questions). In contrast, simple text-based online searching did not yield significant improvements, underscoring the necessity of multimodal retrieval for dynamic visual information.

Efficiently Updating MLLMs with New Visual Knowledge via PEFT

The research explored updating MLLMs with new visual knowledge using Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA and DoRA.

Rapid Adaptation: Visual information can be efficiently updated through fine-tuning within only one epoch. Models using direct multiple-choice questions with concise answers (MCQA format) yielded faster and more effective learning during the visual knowledge acquisition phase compared to other formats like QA (Question + Ground Truth) or QAR (Question + Ground Truth + Reasoning).
LoRA Rank Impact: Higher rank LoRA configurations consistently enhanced visual knowledge capabilities, particularly in assimilating recent visual entities. Models with higher ranks outperformed lower-rank counterparts by an average of 5.4% on the validation subset.
Benefit to General Reasoning: Training on the visually knowledge-intensive LIVEVQA dataset—particularly with straightforward answers and multiple-choice questions—led to a notable 4.2% improvement on the general multimodal reasoning benchmark MMMU.

Knowledge Updating Presents Trade-offs: Enhanced Reasoning vs. Degraded Perception

While PEFT methods allow for efficient incorporation of new visual facts, this process is not without its challenges and trade-offs.

A consistent observation was the degradation in the model's foundational visual perception capabilities (as measured by the MMStar benchmark) after undergoing intensive visual knowledge updates, regardless of rank, training steps, or data formats. For example, models trained using the simple QA format exhibited a performance drop on MMStar from 65.80% to 58.16%. This suggests an inherent conflict between enhancing specific visual knowledge through intensive updates and preserving the model's broader visual perception abilities.

Model Scale Correlates with Performance, but Calibration Remains a Challenge

The benchmark results highlighted several aspects regarding model characteristics:

Larger Models Tend to Perform Better: For models sharing the same knowledge cutoff (e.g., the GPT-4.1 family), increased model size generally correlated with improved accuracy on LIVEVQA tasks across all difficulty levels. Proprietary models also typically maintained an advantage over open-source counterparts.
Overconfidence and Calibration Issues: A crucial finding was the positive correlation between stated confidence and accuracy across models, but with significant calibration issues. All evaluated MLLMs demonstrated a consistent pattern of overconfidence in their visual factuality assessments, with their performance falling significantly below the ideal calibration line. While larger models like GPT-4.1 showed comparatively better calibration than their smaller variants, substantial opportunities remain for improving MLLM calibration when encountering unknown visual knowledge.
Level 2 Questions Prove More Difficult: As anticipated, Level 2 questions, which require deeper cross-modal reasoning, generally resulted in significantly lower performance for models compared to Level 1 (visual entity recognition) questions across most data subsets (News, Video).

BibTeX

@article{fu2025livevqa,
  title={LiveVQA: Live Visual Knowledge Seeking},
  author={Fu, Mingyang and Peng, Yuyang and Liu, Benlin and Wan, Yao and Chen, Dongping},
  journal={arXiv preprint arXiv:2504.05288},
  year={2025}
}

LIVEVQA Team

Model	Level 1 (News Subset)						Level 2 (News Subset)
Model	Loc.	Per.	Org.	Eve.	Obj.	Avg.	Loc.	Per.	Org.	Time	Cou.	Rea.	Eve.	Avg.
w.o. Search
GPT-4.1	50.72	15.19	35.89	27.03	6.28	28.81	0.00	1.75	11.68	3.82	7.84	1.63	0.00	5.05
GPT-4.1-mini	33.33	10.91	45.59	11.86	19.23	24.60	0.00	3.57	8.82	0.00	10.24	0.00	0.00	4.00
GPT-4.1-Nano	16.16	3.64	30.88	3.39	13.00	13.00	0.00	0.00	4.41	1.54	3.94	0.83	0.00	2.20
Gemini-2.5-Flash	26.26	37.27	35.29	7.63	25.80	25.80	0.00	3.57	1.47	3.85	8.66	4.17	0.00	4.60
Gemini-2.5-Pro	23.23	46.36	35.29	10.17	28.00	28.00	3.57	0.00	5.88	3.08	3.94	6.67	0.00	4.40
Gemma-3-27B-IT	24.24	15.45	38.24	8.47	21.00	21.00	3.57	0.00	8.82	1.54	7.87	0.00	0.00	3.80
Claude-3.7-Sonnet	26.20	38.38	10.00	14.41	26.20	26.20	0.00	0.00	4.41	2.31	1.57	2.50	0.00	2.20
Qwen-2.5-VL-7B	23.23	21.15	30.88	12.71	20.20	20.20	0.00	0.00	4.41	1.54	7.09	4.17	0.00	3.80
Qwen-2.5-VL-32B	33.33	18.18	30.88	18.64	25.20	25.20	0.00	0.00	7.35	2.31	6.30	4.17	0.00	4.20
Qwen-2.5-VL-72B	12.50	6.36	15.15	8.47	12.40	12.40	0.00	0.00	4.41	0.77	1.57	0.83	0.00	1.40
Llama-4-Scout	26.26	13.64	35.29	8.47	20.60	20.60	3.57	0.00	4.41	3.08	9.45	0.00	0.00	4.00
Llama-4-Maverick	20.20	19.09	36.76	5.93	20.20	20.20	0.00	0.00	10.29	2.31	13.39	1.67	0.00	5.80
w. Text Search
GPT-4.1	34.62	13.56	48.53	2.73	25.00	25.00	5.88	3.57	5.88	3.85	4.72	0.83	0.00	3.60
Gemini-2.5-Pro	18.18	10.17	29.41	12.73	17.60	17.60	0.00	3.57	4.41	1.54	2.36	1.67	0.00	2.00
Claude-3.7-Sonnet	23.08	18.64	40.38	6.36	24.60	24.60	0.00	5.88	1.47	1.54	3.15	0.83	0.00	2.00
w. Native Image Search
GPT-03	47.47	23.73	57.35	47.12	33.60	33.60	0.00	17.86	20.59	7.69	17.32	17.50	10.00	14.60
w. MM-Search [Jiang et al., 2024]
GPT-4.1	50.00	35.78	55.88	42.86	42.00	42.00	15.50	23.53	30.88	42.52	20.00	46.43	0.00	27.20