PlantInquiryVQA — Thinking Like a Botanist

Abstract

Vision evaluations are typically done through multi-step processes. In plant pathology, botanists inspect leaf images, identify visual cues, infer diagnostic intent, and probe further with targeted questions that adapt to species, symptoms, and severity. Yet current vision-language models are evaluated on single-turn question answering. To address this gap, we introduce PlantInquiryVQA, a benchmark for studying multi-step, intent-driven visual reasoning in botanical diagnosis. We formalize a Chain-of-Inquiry (CoI) framework modeling diagnostic trajectories as ordered question-answer sequences conditioned on grounded visual cues and explicit epistemic intent. We release a dataset of 24,964 expert-curated plant images and 138,078 question-answer pairs annotated with visual grounding, severity labels, and domain-specific reasoning templates. Evaluations on top-tier Multimodal Large Language Models reveal that while they describe visual symptoms adequately, they struggle with safe clinical reasoning and accurate diagnosis. Importantly, structured question-guided inquiry significantly improves diagnostic correctness, reduces hallucination, and increases reasoning efficiency.

At a Glance

          24,964
          Expert-Curated Images
        

          138K
          Question-Answer Pairs
        

          34
          Crop Species
        

          116
          Disease Types
        

          12
          CoI Trajectories
        

          18
          MLLMs Benchmarked
        

Key Idea

Unlike static QA datasets that ask generic questions regardless of disease status, our framework aligns the epistemic intent of the inquiry with visual severity. As disease progresses—from mild to severe—the cognitive task transitions from Diagnosis (ambiguity resolution) to Prognosis (future forecasting) and Management (action planning), ensuring questions are always contextually relevant to the visual evidence.

Chain-of-Inquiry aligned with disease severity

Figure 1. The questioning focus evolves with disease progression in Sunflower (Alternaria Leaf Spot). The cognitive task transitions from Diagnosis in early stages to Prognosis and Management in advanced stages.

Contributions

The PlantInquiryVQA Benchmark A large-scale dataset of 25K manually curated images across diverse crop species, annotated with expert-verified visual cue descriptions and domain-specific knowledge bases.
The Chain-of-Inquiry (CoI) Framework A novel reasoning taxonomy classifying 12 unique reasoning templates into 7 distinct cognitive categories (including Etiological Reasoning, Differential Diagnosis, and Counterfactual Analysis).
Diagnostic Reasoning Evaluation A comprehensive evaluation of 18 closed- and open-source MLLMs showing that question-guided protocols significantly reduce hallucination and improve diagnostic correctness.

Methodology

Our pipeline is divided into three phases: (1) extracting grounded visual cues using a VLM guided by expert schemas, (2) structuring botanical knowledge to map disease severity to diagnostic intent, and (3) a dynamic LLM generation pipeline that injects specific reasoning modules based on intent and visual evidence.

Figure 2. Overall methodology pipeline for PlantInquiryVQA CoI dataset generation. Visual Cue Extraction → CoI Classification → Automated QA Generation.

Chain-of-Inquiry Trajectories

We define 12 distinct CoI trajectories across four axes: Health Status (Healthy, Diseased, Senescence, Pest Damaged), Disease Severity (Mild, Moderate, Severe), Instance Variety (Multi-disease, Cross-species), and Epistemic Intent (Diagnosis, Prognosis, Management).

Figure 3. Qualitative examples of 12 distinct CoI trajectories. The framework adapts questioning strategies across health status, disease severity, instance variety, and epistemic intent.

Main Results

We evaluate 17 leading MLLMs on PlantInquiryVQA using a Cumulative Context Test, where each successive question is conditioned on the full history of preceding questions and generated answers.

Model	Lexical Metrics			Domain Alignment & Quality
Model	F1	BLEU-4	R-L	Dis.	Clin.	Safe.	VG	Len.
Gemini-3-Flash	0.255	0.033	0.196	0.444	0.188	0.147	0.259	85.8
Gemini-2.5-Pro	0.225	0.016	0.132	0.357	0.112	0.040	0.408	142.9
Qwen3-VL-235B	0.210	0.013	0.120	0.348	0.111	0.035	0.489	143.9
Seed-1.6-Flash	0.226	0.022	0.139	0.344	0.120	0.075	0.394	99.1
Llama-3.2-90B-Vision	0.212	0.014	0.105	0.340	0.185	0.214	0.372	134.9
Llama-4-Maverick	0.212	0.013	0.103	0.329	0.175	0.202	0.397	144.5
Gemini-2.5-Flash	0.226	0.018	0.145	0.299	0.098	0.046	0.392	163.5
Qwen3-VL-32B	0.182	0.011	0.096	0.288	0.096	0.035	0.475	227.8
Gemma-3-27B	0.192	0.011	0.103	0.272	0.086	0.032	0.353	156.9
Pixtral-12B	0.225	0.016	0.122	0.272	0.145	0.159	0.368	98.0
Qwen2.5-VL-32B	0.177	0.009	0.076	0.254	0.078	0.017	0.463	260.4
Phi-4-Multimodal	0.177	0.010	0.097	0.254	0.087	0.040	0.358	167.2
Qwen2.5-VL-72B	0.236	0.016	0.123	0.247	0.080	0.040	0.375	106.2
Grok-4.1-Fast	0.203	0.016	0.132	0.224	0.067	0.009	0.498	100.7
Mistral-Medium-3.1	0.211	0.015	0.119	0.205	0.062	0.023	0.360	110.7
Ministral-8B	0.180	0.010	0.094	0.197	0.060	0.020	0.394	151.8
Ministral-3B	0.166	0.007	0.083	0.189	0.059	0.020	0.372	163.0

Table 1. Main results on the test set. Gemini-3-Flash leads across lexical and domain metrics. Grok-4.1-Fast achieves the highest Visual Grounding (VG = 0.498), highlighting a trade-off between description accuracy and clinical reasoning.

Key Findings

Does structured inquiry improve diagnosis?

Question-guided inquiry yields significantly higher diagnosis correctness across all severity levels compared to direct diagnosis. Specific questions force models to attend to fine-grained features (lesion margins, halo presence), constraining the search space and reducing hallucination.

Figure 4. Protocol Structure Benefit Test. Both Qwen2.5-VL-32B and Ministral-8B achieve higher diagnosis correctness with question-guided context (~48% improvement).

Figure 5. Reasoning efficiency comparison. Guided (sequential context) improves explainability efficiency for nearly all models.

Does CoI promote reasoning efficiency?

In the Guided setting (with dialogue history), capable models stop hedging or repeating basic observations and focus on new, specific visual evidence. For example, Gemini-2.5-Flash improves efficiency from 2.60 to 3.67 (+41%). Smaller models like Nemotron-Nano-12B experience "context distraction," where longer dialogue history interferes with immediate visual processing.

Dataset Distribution

The dataset spans 34 crop species and 116 disease types with fine-grained severity annotations. The distribution mirrors real-world plant pathology: Moderate cases dominate (55.6%), with Mild (20.3%) and Severe (24.1%) capturing early and critical stages respectively.

Figure 6. Hierarchical breakdown of diseased samples from disease category to severity level and crop species.

Citation

@inproceedings{sakib2026plantinquiryvqa, title = {Thinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry}, author = {Sakib, Syed Nazmus and Haque, Nafiul and Amin, Shahrear Bin and Abdullah, Hasan Muhammad and Hasan, Md Mehedi and Hossain, Mohammad Zabed and Arman, Shifat E.}, booktitle = {Findings of the Association for Computational Linguistics: ACL 2026}, year = {2026} }