采用双模态联合表征学习方法识别作物病害

王春山; 赵春江; 吴华瑞; 周冀; 李久熙; 朱华吉

doi:10.11975/j.issn.1002-6819.2021.11.020

摘要: 基于深度卷积神经网络的视觉识别方法在病害诊断中表现出色，逐渐成为了研究热点。但是，基于深度卷积神经网络建立的视觉识别模型通常只利用了图像模态的数据，导致模型的识别准确率和鲁棒性，都依赖训练数据集的规模和标注的质量。构建开放环境下大规模的病害数据集并完成高质量的标注，通常需要付出巨大的经济和技术代价，限制了基于深度卷积神经网络的视觉识别方法在实际应用中的推广。该研究提出了一种基于图像与文本双模态联合表征学习的开放环境下作物病害识别模型（bimodalNet）。该模型在病害图像模态的基础上，进行了病害文本模态信息的嵌入，利用两种模态病害信息间的相关性和互补性，实现了病害特征的联合表征学习。最终bimodalNet在较小的数据集上取得了优于单纯的图像模态模型和文本模态模型的效果，最优模型组合在测试集的准确率、精确率、灵敏度、特异性和F1值分别为99.47%、98.51%、98.61%、99.68%和98.51%。该研究证明了利用病害图像和病害文本的双模态表征学习是解决开放环境下作物病害识别的有效方法。

Abstract: Recognition using Deep Convolution Neural Networks (DCNN) performs well in plant disease diagnosis. However, the recognition accuracy and robustness of the model depend mainly on the scale of the training dataset and the quality of annotation. Particularly, the image modal data is only used in DCNN recognition models. It is necessary to build a large-scale disease dataset and complete high-quality annotation in open environments. Nevertheless, the huge economic and technical costs have limited the promotion of DCNN recognition in practical applications. Inspired by multimodal learning, a crop disease recognition model was proposed in an open environment using a flexible image and text bimodal joint representation learning, called bimodalNet. The image-associated text information was brought into the disease recognition task. The input of the network was the image-text pair composed of disease image and description text. The disease images were clipped and adjusted to 224×224 pixels. The description character of the disease was Chinese. A series of operations was completed for the text embedding, including normalization, word segmentation, word list construction, and text vectorization. As such, the network structure consisted of two parts: image and text branches. CNN was used in the image branch to extract disease features from images, whereas, the circular neural network was used in the text branch to learn disease features from description text. The correlation and complementarity were utilized between the two modes of disease information, thereby realizing the joint representation learning of disease features in bimodalNet. The output of image and text branches were added, and then fused into the output of networks. Different image and text classifiers were used to meet the needs of various tasks. As such, the best combination of disease feature extraction was selected from six networks in the experiments. The experimental dataset consisted of 1 834 disease image-text pairs. Among them, the disease images were all taken in the field environment, where each image was given a text description. The dataset was divided into training, verification and test dataset, according to the ratio of 7:2:1. The bimodalNet achieved better performances, compared with either image modal or text modal training alone. The accuracy, sensitivity, specificity, and F1 value of the optimal model in the test dataset were 99.47%, 98.51%, 98.61%, 99.68%, and 98.51%, respectively. Experimental results showed that bimodalNet improved significantly the accuracy of crop disease recognition in an open environment, while reduced the technical and economic costs of a large number of disease image acquisition. The reason was that the complementary features were fully utilized between image and text modes. Specifically, the image mode was used to correct the text error description, whereas, the text mode was to correct fuzzy features in an image mode. Since bimodalNet can be expected to be a general framework in actual use, the image and text branches can be replaced with any excellent CNN and recurrent neural network structure, indicating high flexibility in network structure. The finding can provide high feasibility on the disease identification using multimodal and few samples in an open environment.

采用双模态联合表征学习方法识别作物病害

Recognizing crop diseases using bimodal joint representation learning