基于协程模型的分布式爬虫框架 A Distributed Crawler Framework Based on Coroutine Model期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

基于协程模型的分布式爬虫框架

引用本文：	杨济运,刘建勋,姜磊,彭桃,文一凭,卢厅.基于协程模型的分布式爬虫框架[J].湖南农业大学学报(自然科学版),2014(3):126-133.

作者姓名：	杨济运刘建勋姜磊彭桃文一凭卢厅

作者单位：	（湖南科技大学计算机科学与工程学院知识处理与网络化制造湖南省普通高校重点实验室，湖南湘潭411201)

摘要：	网络爬虫主要受到网络延迟和本地运行效率的限制，传统的基于多线程的网络爬虫架构主要为了消除网络延迟而没有考虑到本地运行效率。在高并发的条件下,多线程架构爬虫由于上下文切换开销增大而导致本地运行效率降低，同时使得网络利用率下降，如何能够在最大化利用网络资源的情况下减小系统本地开销是一个需要研究的问题。针对以上问题，本文提出基于协程的分布式网络爬虫框架来解决，从开销、资源利用率、网络利用率上对协程框架和多线程框架进行了分析，并基于协程实现了一个分布式网络爬虫。实验表明该框架无论从开销、资源利用率和网络利用率上相对于多线程框架有比较明显的优势。
关键词：	协程分布式高性能爬虫
A Distributed Crawler Framework Based on Coroutine Model

YANG Ji-yun,LIU Jian-xun,JIANG Lei,PENG Tao,WEN Yi-ping,LU Ting.A Distributed Crawler Framework Based on Coroutine Model[J].Journal of Hunan Agricultural University,2014(3):126-133.

Authors:	YANG Ji-yun LIU Jian-xun JIANG Lei PENG Tao WEN Yi-ping LU Ting

Abstract:	Web crawler is mainly limited by the network latency and local resource. The traditional framework of web crawler, which is based on multi-threads, is mainly to eliminate the network latency but failed to take the local resource limitation into account. Under the high concurrent, multi-threads architecture will result in a poor running efficiency because of the increasing of the context switch. So studying on how to make maximum usage of network resources and also considering the local resource limitation becomes a necessary. To solve the above problems, this paper will propose a distributed crawler framework based on coroutine. First we have analyzed the overhead, resource utilization and network utilization between coroutines and threads, and implemented a web crawler based on coroutine. Experiments had shown that our architecture for a distributed web crawler based on coroutine is better than threads-based web crawler.

Keywords:	coroutine distribution high-performance web crawler

	点击此处可从《湖南农业大学学报(自然科学版)》浏览原始摘要信息
	点击此处可从《湖南农业大学学报(自然科学版)》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏