Abstract
Vision evaluations are typically done through multi-step processes. In plant pathology, botanists inspect leaf images, identify visual cues, infer diagnostic intent, and probe further with targeted questions that adapt to species, symptoms, and severity. Yet current vision-language models are evaluated on single-turn question answering. To address this gap, we introduce PlantInquiryVQA, a benchmark for studying multi-step, intent-driven visual reasoning in botanical diagnosis. We formalize a Chain-of-Inquiry (CoI) framework modeling diagnostic trajectories as ordered question-answer sequences conditioned on grounded visual cues and explicit epistemic intent. We release a dataset of 24,964 expert-curated plant images and 138,078 question-answer pairs annotated with visual grounding, severity labels, and domain-specific reasoning templates. Evaluations on top-tier Multimodal Large Language Models reveal that while they describe visual symptoms adequately, they struggle with safe clinical reasoning and accurate diagnosis. Importantly, structured question-guided inquiry significantly improves diagnostic correctness, reduces hallucination, and increases reasoning efficiency.
At a Glance
Key Idea
Unlike static QA datasets that ask generic questions regardless of disease status, our framework aligns the epistemic intent of the inquiry with visual severity. As disease progresses—from mild to severe—the cognitive task transitions from Diagnosis (ambiguity resolution) to Prognosis (future forecasting) and Management (action planning), ensuring questions are always contextually relevant to the visual evidence.
Figure 1. The questioning focus evolves with disease progression in Sunflower (Alternaria Leaf Spot). The cognitive task transitions from Diagnosis in early stages to Prognosis and Management in advanced stages.
Contributions
- The PlantInquiryVQA Benchmark A large-scale dataset of 25K manually curated images across diverse crop species, annotated with expert-verified visual cue descriptions and domain-specific knowledge bases.
- The Chain-of-Inquiry (CoI) Framework A novel reasoning taxonomy classifying 12 unique reasoning templates into 7 distinct cognitive categories (including Etiological Reasoning, Differential Diagnosis, and Counterfactual Analysis).
- Diagnostic Reasoning Evaluation A comprehensive evaluation of 18 closed- and open-source MLLMs showing that question-guided protocols significantly reduce hallucination and improve diagnostic correctness.
Methodology
Our pipeline is divided into three phases: (1) extracting grounded visual cues using a VLM guided by expert schemas, (2) structuring botanical knowledge to map disease severity to diagnostic intent, and (3) a dynamic LLM generation pipeline that injects specific reasoning modules based on intent and visual evidence.
Figure 2. Overall methodology pipeline for PlantInquiryVQA CoI dataset generation. Visual Cue Extraction → CoI Classification → Automated QA Generation.
Chain-of-Inquiry Trajectories
We define 12 distinct CoI trajectories across four axes: Health Status (Healthy, Diseased, Senescence, Pest Damaged), Disease Severity (Mild, Moderate, Severe), Instance Variety (Multi-disease, Cross-species), and Epistemic Intent (Diagnosis, Prognosis, Management).
Figure 3. Qualitative examples of 12 distinct CoI trajectories. The framework adapts questioning strategies across health status, disease severity, instance variety, and epistemic intent.
Main Results
We evaluate 17 leading MLLMs on PlantInquiryVQA using a Cumulative Context Test, where each successive question is conditioned on the full history of preceding questions and generated answers.
| Model | Lexical Metrics | Domain Alignment & Quality | ||||||
|---|---|---|---|---|---|---|---|---|
| F1 | BLEU-4 | R-L | Dis. | Clin. | Safe. | VG | Len. | |
| Gemini-3-Flash | 0.255 | 0.033 | 0.196 | 0.444 | 0.188 | 0.147 | 0.259 | 85.8 |
| Gemini-2.5-Pro | 0.225 | 0.016 | 0.132 | 0.357 | 0.112 | 0.040 | 0.408 | 142.9 |
| Qwen3-VL-235B | 0.210 | 0.013 | 0.120 | 0.348 | 0.111 | 0.035 | 0.489 | 143.9 |
| Seed-1.6-Flash | 0.226 | 0.022 | 0.139 | 0.344 | 0.120 | 0.075 | 0.394 | 99.1 |
| Llama-3.2-90B-Vision | 0.212 | 0.014 | 0.105 | 0.340 | 0.185 | 0.214 | 0.372 | 134.9 |
| Llama-4-Maverick | 0.212 | 0.013 | 0.103 | 0.329 | 0.175 | 0.202 | 0.397 | 144.5 |
| Gemini-2.5-Flash | 0.226 | 0.018 | 0.145 | 0.299 | 0.098 | 0.046 | 0.392 | 163.5 |
| Qwen3-VL-32B | 0.182 | 0.011 | 0.096 | 0.288 | 0.096 | 0.035 | 0.475 | 227.8 |
| Gemma-3-27B | 0.192 | 0.011 | 0.103 | 0.272 | 0.086 | 0.032 | 0.353 | 156.9 |
| Pixtral-12B | 0.225 | 0.016 | 0.122 | 0.272 | 0.145 | 0.159 | 0.368 | 98.0 |
| Qwen2.5-VL-32B | 0.177 | 0.009 | 0.076 | 0.254 | 0.078 | 0.017 | 0.463 | 260.4 |
| Phi-4-Multimodal | 0.177 | 0.010 | 0.097 | 0.254 | 0.087 | 0.040 | 0.358 | 167.2 |
| Qwen2.5-VL-72B | 0.236 | 0.016 | 0.123 | 0.247 | 0.080 | 0.040 | 0.375 | 106.2 |
| Grok-4.1-Fast | 0.203 | 0.016 | 0.132 | 0.224 | 0.067 | 0.009 | 0.498 | 100.7 |
| Mistral-Medium-3.1 | 0.211 | 0.015 | 0.119 | 0.205 | 0.062 | 0.023 | 0.360 | 110.7 |
| Ministral-8B | 0.180 | 0.010 | 0.094 | 0.197 | 0.060 | 0.020 | 0.394 | 151.8 |
| Ministral-3B | 0.166 | 0.007 | 0.083 | 0.189 | 0.059 | 0.020 | 0.372 | 163.0 |
Table 1. Main results on the test set. Gemini-3-Flash leads across lexical and domain metrics. Grok-4.1-Fast achieves the highest Visual Grounding (VG = 0.498), highlighting a trade-off between description accuracy and clinical reasoning.
Key Findings
Does structured inquiry improve diagnosis?
Question-guided inquiry yields significantly higher diagnosis correctness across all severity levels compared to direct diagnosis. Specific questions force models to attend to fine-grained features (lesion margins, halo presence), constraining the search space and reducing hallucination.
Figure 4. Protocol Structure Benefit Test. Both Qwen2.5-VL-32B and Ministral-8B achieve higher diagnosis correctness with question-guided context (~48% improvement).
Figure 5. Reasoning efficiency comparison. Guided (sequential context) improves explainability efficiency for nearly all models.
Does CoI promote reasoning efficiency?
In the Guided setting (with dialogue history), capable models stop hedging or repeating basic observations and focus on new, specific visual evidence. For example, Gemini-2.5-Flash improves efficiency from 2.60 to 3.67 (+41%). Smaller models like Nemotron-Nano-12B experience "context distraction," where longer dialogue history interferes with immediate visual processing.
Dataset Distribution
The dataset spans 34 crop species and 116 disease types with fine-grained severity annotations. The distribution mirrors real-world plant pathology: Moderate cases dominate (55.6%), with Mild (20.3%) and Severe (24.1%) capturing early and critical stages respectively.
Figure 6. Hierarchical breakdown of diseased samples from disease category to severity level and crop species.