Thinking Like a Botanist:
Challenging Multimodal Language Models
with Intent-Driven Chain-of-Inquiry

Syed Nazmus Sakib1,*Nafiul Haque1,*Shahrear Bin Amin1Hasan Muhammad Abdullah2,
Md Mehedi Hasan1Mohammad Zabed Hossain3Shifat E. Arman1,†

1Dept. of Robotics & Mechatronics Engineering, University of Dhaka    2Dept. of Agronomy, BSMRAU    3Dept. of Botany, University of Dhaka
*Equal Contribution    Corresponding Author

Accepted at ACL 2026 Findings

arXiv preprint, code, and dataset coming soon.


Abstract

Vision evaluations are typically done through multi-step processes. In plant pathology, botanists inspect leaf images, identify visual cues, infer diagnostic intent, and probe further with targeted questions that adapt to species, symptoms, and severity. Yet current vision-language models are evaluated on single-turn question answering. To address this gap, we introduce PlantInquiryVQA, a benchmark for studying multi-step, intent-driven visual reasoning in botanical diagnosis. We formalize a Chain-of-Inquiry (CoI) framework modeling diagnostic trajectories as ordered question-answer sequences conditioned on grounded visual cues and explicit epistemic intent. We release a dataset of 24,964 expert-curated plant images and 138,078 question-answer pairs annotated with visual grounding, severity labels, and domain-specific reasoning templates. Evaluations on top-tier Multimodal Large Language Models reveal that while they describe visual symptoms adequately, they struggle with safe clinical reasoning and accurate diagnosis. Importantly, structured question-guided inquiry significantly improves diagnostic correctness, reduces hallucination, and increases reasoning efficiency.

At a Glance

24,964 Expert-Curated Images
138K Question-Answer Pairs
34 Crop Species
116 Disease Types
12 CoI Trajectories
18 MLLMs Benchmarked

Key Idea

Unlike static QA datasets that ask generic questions regardless of disease status, our framework aligns the epistemic intent of the inquiry with visual severity. As disease progresses—from mild to severe—the cognitive task transitions from Diagnosis (ambiguity resolution) to Prognosis (future forecasting) and Management (action planning), ensuring questions are always contextually relevant to the visual evidence.

Chain-of-Inquiry aligned with disease severity

Figure 1. The questioning focus evolves with disease progression in Sunflower (Alternaria Leaf Spot). The cognitive task transitions from Diagnosis in early stages to Prognosis and Management in advanced stages.


Contributions


Methodology

Our pipeline is divided into three phases: (1) extracting grounded visual cues using a VLM guided by expert schemas, (2) structuring botanical knowledge to map disease severity to diagnostic intent, and (3) a dynamic LLM generation pipeline that injects specific reasoning modules based on intent and visual evidence.

PlantInquiryVQA methodology pipeline

Figure 2. Overall methodology pipeline for PlantInquiryVQA CoI dataset generation. Visual Cue Extraction → CoI Classification → Automated QA Generation.


Chain-of-Inquiry Trajectories

We define 12 distinct CoI trajectories across four axes: Health Status (Healthy, Diseased, Senescence, Pest Damaged), Disease Severity (Mild, Moderate, Severe), Instance Variety (Multi-disease, Cross-species), and Epistemic Intent (Diagnosis, Prognosis, Management).

12 Chain-of-Inquiry trajectories

Figure 3. Qualitative examples of 12 distinct CoI trajectories. The framework adapts questioning strategies across health status, disease severity, instance variety, and epistemic intent.


Main Results

We evaluate 17 leading MLLMs on PlantInquiryVQA using a Cumulative Context Test, where each successive question is conditioned on the full history of preceding questions and generated answers.

Model Lexical Metrics Domain Alignment & Quality
F1 BLEU-4 R-L Dis. Clin. Safe. VG Len.
Gemini-3-Flash 0.2550.0330.196 0.4440.1880.147 0.25985.8
Gemini-2.5-Pro 0.2250.0160.132 0.3570.1120.040 0.408142.9
Qwen3-VL-235B 0.2100.0130.120 0.3480.1110.035 0.489143.9
Seed-1.6-Flash 0.2260.0220.139 0.3440.1200.075 0.39499.1
Llama-3.2-90B-Vision 0.2120.0140.105 0.3400.1850.214 0.372134.9
Llama-4-Maverick 0.2120.0130.103 0.3290.1750.202 0.397144.5
Gemini-2.5-Flash 0.2260.0180.145 0.2990.0980.046 0.392163.5
Qwen3-VL-32B 0.1820.0110.096 0.2880.0960.035 0.475227.8
Gemma-3-27B 0.1920.0110.103 0.2720.0860.032 0.353156.9
Pixtral-12B 0.2250.0160.122 0.2720.1450.159 0.36898.0
Qwen2.5-VL-32B 0.1770.0090.076 0.2540.0780.017 0.463260.4
Phi-4-Multimodal 0.1770.0100.097 0.2540.0870.040 0.358167.2
Qwen2.5-VL-72B 0.2360.0160.123 0.2470.0800.040 0.375106.2
Grok-4.1-Fast 0.2030.0160.132 0.2240.0670.009 0.498100.7
Mistral-Medium-3.1 0.2110.0150.119 0.2050.0620.023 0.360110.7
Ministral-8B 0.1800.0100.094 0.1970.0600.020 0.394151.8
Ministral-3B 0.1660.0070.083 0.1890.0590.020 0.372163.0

Table 1. Main results on the test set. Gemini-3-Flash leads across lexical and domain metrics. Grok-4.1-Fast achieves the highest Visual Grounding (VG = 0.498), highlighting a trade-off between description accuracy and clinical reasoning.


Key Findings

Does structured inquiry improve diagnosis?

Question-guided inquiry yields significantly higher diagnosis correctness across all severity levels compared to direct diagnosis. Specific questions force models to attend to fine-grained features (lesion margins, halo presence), constraining the search space and reducing hallucination.

Protocol Structure Benefit Test

Figure 4. Protocol Structure Benefit Test. Both Qwen2.5-VL-32B and Ministral-8B achieve higher diagnosis correctness with question-guided context (~48% improvement).

Explainability Efficiency comparison

Figure 5. Reasoning efficiency comparison. Guided (sequential context) improves explainability efficiency for nearly all models.

Does CoI promote reasoning efficiency?

In the Guided setting (with dialogue history), capable models stop hedging or repeating basic observations and focus on new, specific visual evidence. For example, Gemini-2.5-Flash improves efficiency from 2.60 to 3.67 (+41%). Smaller models like Nemotron-Nano-12B experience "context distraction," where longer dialogue history interferes with immediate visual processing.


Dataset Distribution

The dataset spans 34 crop species and 116 disease types with fine-grained severity annotations. The distribution mirrors real-world plant pathology: Moderate cases dominate (55.6%), with Mild (20.3%) and Severe (24.1%) capturing early and critical stages respectively.

Crop severity distribution

Figure 6. Hierarchical breakdown of diseased samples from disease category to severity level and crop species.


Citation

@inproceedings{sakib2026plantinquiryvqa, title = {Thinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry}, author = {Sakib, Syed Nazmus and Haque, Nafiul and Amin, Shahrear Bin and Abdullah, Hasan Muhammad and Hasan, Md Mehedi and Hossain, Mohammad Zabed and Arman, Shifat E.}, booktitle = {Findings of the Association for Computational Linguistics: ACL 2026}, year = {2026} }