Multimodal CNN–Transformer Framework for Explainable Pathogen Identification and Infection Severity Scoring from Microscopy Images
DOI:
https://doi.org/10.63001/tbs.2024.v19.i03.pp264-278Keywords:
Pathogen Identification, Microscopy Image Analysis, Multimodal Deep Learning, CNN–Transformer Hybrid, Infection Severity Scoring, Explainable AI (XAI), Grad-CAM Visualization, Calibration, Domain Generalization, Medical Image ClassificationAbstract
Microscopic examination remains a cornerstone of infectious disease diagnosis, yet it is constrained by inter-observer variability, limited scalability, and subjective interpretation. To overcome these challenges, we propose a Multimodal CNN–Transformer framework that integrates local texture extraction (CNN), global contextual reasoning (Vision Transformer), and metadata-aware feature fusion for automated pathogen species classification and infection severity scoring from stained microscopy images. The framework employs FiLM-based metadata conditioning to enhance cross-domain generalization and multi-task learning to jointly optimize categorical and ordinal objectives. A calibration module improves prediction reliability using temperature scaling, while Grad-CAM visualizations provide transparent, clinically interpretable infection region localization. Evaluated on 23,700 images from bacterial, fungal, and parasitic datasets collected across four laboratories, the proposed model achieved 96.2% accuracy, macro-F1 of 0.937, and QWK of 0.84, surpassing both CNN-only and Transformer-only baselines. Cross-site experiments confirm robust generalization with <2.5% accuracy drop, and explainability analysis shows >92% overlap with expert annotations. This approach demonstrates the feasibility of explainable, calibration-aware AI for reliable, point-of-care pathogen diagnostics in resource-constrained clinical environments.



















