首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于实体级遮蔽BERT与BiLSTM-CRF的农业命名实体识别
引用本文:韦紫君,宋玲,胡小春,陈宁江.基于实体级遮蔽BERT与BiLSTM-CRF的农业命名实体识别[J].农业工程学报,2022,38(15):195-203.
作者姓名:韦紫君  宋玲  胡小春  陈宁江
作者单位:1. 广西大学计算机与电子信息学院,南宁 530004;;2. 南宁学院信息工程学院,南宁 530200; 3. 广西多媒体通信与网络技术重点实验室,南宁 530004;;4. 广西财经学院信息与统计学院,南宁 530007;;1. 广西大学计算机与电子信息学院,南宁 530004; 3. 广西多媒体通信与网络技术重点实验室,南宁 530004;
基金项目:国家重点研发计划课题(2018YFB1404404);广西重点研发计划项目(桂科AB19110050);南宁市科技重大专项(20211005)
摘    要:字符的位置信息和语义信息对命名方式繁杂且名称长度较长的中文农业实体的识别至关重要。为解决命名实体识别过程中由于捕获字符位置信息、上下文语义特征和长距离依赖信息不充足导致识别效果不理想的问题,该研究提出一种基于EmBERT-BiLSTM-CRF模型的中文农业命名实体识别方法。该方法采用基于Transformer的深度双向预训练语言模型(Bidirectional Encoder Representation from Transformers,BERT)作为嵌入层提取字向量的深度双向表示,并使用实体级遮蔽策略使模型更好地表征中文语义;然后使用双向长短时记忆网络(Bidirectional Long Short-Term Memory,BiLSTM)学习文本的长序列语义特征;最后使用条件随机场(Conditional Random Field,CRF)在训练数据中学习标注约束规则,并利用相邻标签之间的信息输出全局最优的标注序列。训练过程中使用了焦点损失函数来缓解样本分布不均衡的问题。试验在构建的语料库上对农作物品种、病害、虫害和农药4类农业实体进行识别。结果表明,该研究的EmBERT-BiLSTM-CRF模型对4类农业实体的识别性能相较于其他模型有明显提升,准确率为94.97%,F1值为95.93%。

关 键 词:农业  命名实体识别  实体级遮蔽  BERT  BiLSTM  CRF
收稿时间:2021/12/20 0:00:00
修稿时间:2022/6/29 0:00:00

Named entity recognition of agricultural based entity-level masking BERT and BiLSTM-CRF
Wei Zijun,Song Ling,Hu Xiaochun,Chen Ningjiang.Named entity recognition of agricultural based entity-level masking BERT and BiLSTM-CRF[J].Transactions of the Chinese Society of Agricultural Engineering,2022,38(15):195-203.
Authors:Wei Zijun  Song Ling  Hu Xiaochun  Chen Ningjiang
Institution:1. School of Computer and Electronics Information, Guangxi University, Nanning 530004, China;;2. College of Information Engineering, Nanning University, Nanning 530200, China; 3. Guangxi Key Laboratory of Multimedia Communications and Networks Technology, Nanning 530004, China;;4. School of Information and Statistics, Guangxi University of Finance and Economics, Nanning 530007, China;; 1. School of Computer and Electronics Information, Guangxi University, Nanning 530004, China; 3. Guangxi Key Laboratory of Multimedia Communications and Networks Technology, Nanning 530004, China;
Abstract:An intelligent question-answering of agricultural knowledge can be one of the most important parts of information agriculture. Among them, named entity recognition has been a key technology for intelligent question-answering and knowledge graph construction in the fields of agricultural domain. It is also a high demand for the accurate identification of named entities. Furthermore, the Chinese named entity recognition can be confined to the location and semantic information of characters, due to the long length of agricultural entity and complex naming. Therefore, it is very necessary to improve the recognition performance in the process of named entity recognition, particularly for the sufficient capture of character position, contextual semantic features, and long-distance dependency information. In this study, a novel Chinese named entity recognition of agriculture was proposed using EmBERT-BiLSTM-CRF model. Firstly, the Bidirectional Encoder Representation from Transformers (BERT) pre-trained language model was applied as the layer of word embedding. The context semantic representation of the model was then improved to alleviate the polysemy, when pre-training the depth bidirectional representation of word vectors. Secondly, the language masking of BERT was enhanced significantly, according to the characteristics of Chinese. An Entity-level Masking strategy was utilized to completely mask the Chinese entities in the sentence with the consecutive tokens. The Chinese semantics was then better represented to alleviate the bias caused by incomplete semantics. Thirdly, the Bidirectional long short-term memory network (BiLSTM) model was adopted to learn the semantic features of long-sequence using two LSTM networks (forward and backward), considering the contextual information in both directions at the same time. The long-distance dependency information of text was then captured during this time. Finally, the Conditional random field (CRF) was used to learn the labelling constraint in the training data. Among them, the learned constraint rules were used to detect whether the label sequence was legal during prediction. After that, the CRF also utilized the information of adjacent labels to output the globally optimal label sequence. Thus, the output of the model was a dependent label sequence, but an optimal sequence was considered the rules and order. A focal loss function was also used to alleviate the unbalanced sample distribution. A series of experiments were performed to construct the corpus of named entity recognition. As such, the corpus contained a total of 29 790 agricultural entities after BIO labelling, including 11 057 crops, 8 121 pesticides, 4 505 diseases, and 6 107 pest entities, in which the training, validation, and test set were divided, according to the ratio of 7:2:1. Four types of agricultural entities from the text were identified, including the crop varieties, pesticides, diseases, and insect pests, and then to label them. The experimental results show that the recognition accuracy of the EmBERT-BiLSTM-CRF model for the four types of entities was 94.97%, and the F1 score was 95.93%. Which compared with the models based on BiLSTM-CRF and BERT-BiLSTM-CRF, the recognition performance of EmBERT-BiLSTM-CRF is significantly improved, proved that used pre-trained language model as the a word embedding layer can represent the characteristics of characters well and the Entity-level Masking strategy can alleviate the bias caused by incomplete semantics, thereby enhanced the Chinese semantic representation ability of the model, so that enabling the model to more accurately identify Chinese agricultural named entities. This research can not only provide arelatively high entity recognition accuracy for tasks such as agricultural intelligence question answering, but also offer new ideas for the identification of Chinese named entities in fishery, animal husbandry, Chinese medical, and biological fields.
Keywords:agricultural  named entity recognition  entity-level masking  BERT  BiLSTM  CRF
点击此处可从《农业工程学报》浏览原始摘要信息
点击此处可从《农业工程学报》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号