Abstract:
Recognition using Deep Convolution Neural Networks (DCNN) performs well in plant disease diagnosis. However, the recognition accuracy and robustness of the model depend mainly on the scale of the training dataset and the quality of annotation. Particularly, the image modal data is only used in DCNN recognition models. It is necessary to build a large-scale disease dataset and complete high-quality annotation in open environments. Nevertheless, the huge economic and technical costs have limited the promotion of DCNN recognition in practical applications. Inspired by multimodal learning, a crop disease recognition model was proposed in an open environment using a flexible image and text bimodal joint representation learning, called bimodalNet. The image-associated text information was brought into the disease recognition task. The input of the network was the image-text pair composed of disease image and description text. The disease images were clipped and adjusted to 224×224 pixels. The description character of the disease was Chinese. A series of operations was completed for the text embedding, including normalization, word segmentation, word list construction, and text vectorization. As such, the network structure consisted of two parts: image and text branches. CNN was used in the image branch to extract disease features from images, whereas, the circular neural network was used in the text branch to learn disease features from description text. The correlation and complementarity were utilized between the two modes of disease information, thereby realizing the joint representation learning of disease features in bimodalNet. The output of image and text branches were added, and then fused into the output of networks. Different image and text classifiers were used to meet the needs of various tasks. As such, the best combination of disease feature extraction was selected from six networks in the experiments. The experimental dataset consisted of 1 834 disease image-text pairs. Among them, the disease images were all taken in the field environment, where each image was given a text description. The dataset was divided into training, verification and test dataset, according to the ratio of 7:2:1. The bimodalNet achieved better performances, compared with either image modal or text modal training alone. The accuracy, sensitivity, specificity, and F1 value of the optimal model in the test dataset were 99.47%, 98.51%, 98.61%, 99.68%, and 98.51%, respectively. Experimental results showed that bimodalNet improved significantly the accuracy of crop disease recognition in an open environment, while reduced the technical and economic costs of a large number of disease image acquisition. The reason was that the complementary features were fully utilized between image and text modes. Specifically, the image mode was used to correct the text error description, whereas, the text mode was to correct fuzzy features in an image mode. Since bimodalNet can be expected to be a general framework in actual use, the image and text branches can be replaced with any excellent CNN and recurrent neural network structure, indicating high flexibility in network structure. The finding can provide high feasibility on the disease identification using multimodal and few samples in an open environment.