首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于协程模型的分布式爬虫框架
引用本文:杨济运,刘建勋,姜磊,彭桃,文一凭,卢厅.基于协程模型的分布式爬虫框架[J].湖南农业大学学报(自然科学版),2014(3):126-133.
作者姓名:杨济运  刘建勋  姜磊  彭桃  文一凭  卢厅
作者单位:(湖南科技大学 计算机科学与工程学院 知识处理与网络化制造湖南省普通高校重点实验室,湖南 湘潭411201)
摘    要:网络爬虫主要受到网络延迟和本地运行效率的限制,传统的基于多线程的网络爬虫架构主要为了消除网络延迟而没有考虑到本地运行效率。在高并发的条件下,多线程架构爬虫由于上下文切换开销增大而导致本地运行效率降低,同时使得网络利用率下降,如何能够在最大化利用网络资源的情况下减小系统本地开销是一个需要研究的问题。针对以上问题,本文提出基于协程的分布式网络爬虫框架来解决,从开销、资源利用率、网络利用率上对协程框架和多线程框架进行了分析,并基于协程实现了一个分布式网络爬虫。实验表明该框架无论从开销、资源利用率和网络利用率上相对于多线程框架有比较明显的优势。

关 键 词:协程  分布式  高性能  爬虫

A Distributed Crawler Framework Based on Coroutine Model
YANG Ji-yun,LIU Jian-xun,JIANG Lei,PENG Tao,WEN Yi-ping,LU Ting.A Distributed Crawler Framework Based on Coroutine Model[J].Journal of Hunan Agricultural University,2014(3):126-133.
Authors:YANG Ji-yun  LIU Jian-xun  JIANG Lei  PENG Tao  WEN Yi-ping  LU Ting
Abstract:Web crawler is mainly limited by the network latency and local resource. The traditional framework of web crawler, which is based on multi-threads, is mainly to eliminate the network latency but failed to take the local resource limitation into account. Under the high concurrent, multi-threads architecture will result in a poor running efficiency because of the increasing of the context switch. So studying on how to make maximum usage of network resources and also considering the local resource limitation becomes a necessary. To solve the above problems, this paper will propose a distributed crawler framework based on coroutine. First we have analyzed the overhead, resource utilization and network utilization between coroutines and threads, and implemented a web crawler based on coroutine. Experiments had shown that our architecture for a distributed web crawler based on coroutine is better than threads-based web crawler.
Keywords:coroutine  distribution  high-performance  web crawler
点击此处可从《湖南农业大学学报(自然科学版)》浏览原始摘要信息
点击此处可从《湖南农业大学学报(自然科学版)》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号