Multimodal CNN–Transformer Framework for Explainable Pathogen Identification and Infection Severity Scoring from Microscopy Images

V. Kavitha; M. Nagseshudu; P. V. Kusuma; G. Veera Sankara Reddy; S. Prateep kumar; D. Mahendra Reddy

doi:10.63001/tbs.2024.v19.i03.pp264-278

Authors

V. Kavitha
M. Nagseshudu
P. V. Kusuma
G. Veera Sankara Reddy
S. Prateep kumar
D. Mahendra Reddy

DOI:

https://doi.org/10.63001/tbs.2024.v19.i03.pp264-278

Keywords:

Pathogen Identification, Microscopy Image Analysis, Multimodal Deep Learning, CNN–Transformer Hybrid, Infection Severity Scoring, Explainable AI (XAI), Grad-CAM Visualization, Calibration, Domain Generalization, Medical Image Classification

Abstract

Microscopic examination remains a cornerstone of infectious disease diagnosis, yet it is constrained by inter-observer variability, limited scalability, and subjective interpretation. To overcome these challenges, we propose a Multimodal CNN–Transformer framework that integrates local texture extraction (CNN), global contextual reasoning (Vision Transformer), and metadata-aware feature fusion for automated pathogen species classification and infection severity scoring from stained microscopy images. The framework employs FiLM-based metadata conditioning to enhance cross-domain generalization and multi-task learning to jointly optimize categorical and ordinal objectives. A calibration module improves prediction reliability using temperature scaling, while Grad-CAM visualizations provide transparent, clinically interpretable infection region localization. Evaluated on 23,700 images from bacterial, fungal, and parasitic datasets collected across four laboratories, the proposed model achieved 96.2% accuracy, macro-F1 of 0.937, and QWK of 0.84, surpassing both CNN-only and Transformer-only baselines. Cross-site experiments confirm robust generalization with <2.5% accuracy drop, and explainability analysis shows >92% overlap with expert annotations. This approach demonstrates the feasibility of explainable, calibration-aware AI for reliable, point-of-care pathogen diagnostics in resource-constrained clinical environments.