如何在scrapy中集成selenium爬取网页的方法

1.背景

  • 我们在爬取网页时一般会使用到三个爬虫库:requests,scrapy,selenium。requests一般用于小型爬虫,scrapy用于构建大的爬虫项目,而selenium主要用来应付负责的页面(复杂js渲染的页面,请求非常难构造,或者构造方式经常变化)。
  • 在我们面对大型爬虫项目时,肯定会优选scrapy框架来开发,但是在解析复杂js渲染的页面时,又很麻烦。 尽管使用selenium浏览器渲染来抓取这样的页面很方便,这种方式下,我们不需要关心页面后台发生了怎样的请求,也不需要分析整个页面的渲染过程,我们只需要关心页面最终结果即可,可见即可爬,但是selenium的效率又太低。
  • 所以,如果可以在scrapy中,集成selenium,让selenium负责复杂页面的爬取,那么这样的爬虫就无敌了,可以爬取任何网站了。

 2. 环境

  • python 3.6.1
  • 系统:win7
  • ide:pycharm
  • 安装过chrome浏览器
  • 配置好chromedriver(设置好环境变量)
  • selenium 3.7.0
  • scrapy 1.4.0

3.原理分析

3.1. 分析request请求的流程

首先看一下scrapy最新的架构图:

部分流程:

第一:爬虫引擎生成requests请求,送往scheduler调度模块,进入等待队列,等待调度。

第二:scheduler模块开始调度这些requests,出队,发往爬虫引擎。

第三:爬虫引擎将这些requests送到下载中间件(多个,例如加header,代理,自定义等等)进行处理。

第四:处理完之后,送往downloader模块进行下载。从这个处理过程来看,突破口就在下载中间件部分,用selenium直接处理掉request请求。

3.2. requests和response中间处理件源码分析

相关代码位置:

源码解析:

# 文件:e:\miniconda\lib\site-packages\scrapy\core\downloader\middleware.py
"""
downloader middleware manager

see documentation in docs/topics/downloader-middleware.rst
"""
import six

from twisted.internet import defer

from scrapy.http import request, response
from scrapy.middleware import middlewaremanager
from scrapy.utils.defer import mustbe_deferred
from scrapy.utils.conf import build_component_list


class downloadermiddlewaremanager(middlewaremanager):

  component_name = 'downloader middleware'

  @classmethod
  def _get_mwlist_from_settings(cls, settings):
    # 从settings.py或这custom_setting中拿到自定义的middleware中间件
    '''
    'downloader_middlewares': {
      'myspider.middlewares.proxiesmiddleware': 400,
      # seleniummiddleware
      'myspider.middlewares.seleniummiddleware': 543,
      'scrapy.downloadermiddlewares.useragent.useragentmiddleware': none,
    },
    '''
    return build_component_list(
      settings.getwithbase('downloader_middlewares'))

  # 将所有自定义middleware中间件的处理函数添加到对应的methods列表中
  def _add_middleware(self, mw):
    if hasattr(mw, 'process_request'):
      self.methods['process_request'].append(mw.process_request)
    if hasattr(mw, 'process_response'):
      self.methods['process_response'].insert(0, mw.process_response)
    if hasattr(mw, 'process_exception'):
      self.methods['process_exception'].insert(0, mw.process_exception)

  # 整个下载流程
  def download(self, download_func, request, spider):
    @defer.inlinecallbacks
    def process_request(request):
      # 处理request请求,依次经过各个自定义middleware中间件的process_request方法,前面有加入到list中
      for method in self.methods['process_request']:
        response = yield method(request=request, spider=spider)
        assert response is none or isinstance(response, (response, request)), \
            'middleware %s.process_request must return none, response or request, got %s' % \
            (six.get_method_self(method).__class__.__name__, response.__class__.__name__)
        # 这是关键地方
        # 如果在某个middleware中间件的process_request中处理完之后,生成了一个response对象
        # 那么会直接将这个response return 出去,跳出循环,不再处理其他的process_request
        # 之前我们的header,proxy中间件,都只是加个user-agent,加个proxy,并不做任何return值
        # 还需要注意一点:就是这个return的必须是response对象
        # 后面我们构造的htmlresponse正是response的子类对象
        if response:
          defer.returnvalue(response)
      # 如果在上面的所有process_request中,都没有返回任何response对象的话
      # 最后,会将这个加工过的request送往download_func,进行下载,返回的就是一个response对象
      # 然后依次经过各个middleware中间件的process_response方法进行加工,如下
      defer.returnvalue((yield download_func(request=request,spider=spider)))

    @defer.inlinecallbacks
    def process_response(response):
      assert response is not none, 'received none in process_response'
      if isinstance(response, request):
        defer.returnvalue(response)

      for method in self.methods['process_response']:
        response = yield method(request=request, response=response,
                    spider=spider)
        assert isinstance(response, (response, request)), \
          'middleware %s.process_response must return response or request, got %s' % \
          (six.get_method_self(method).__class__.__name__, type(response))
        if isinstance(response, request):
          defer.returnvalue(response)
      defer.returnvalue(response)

    @defer.inlinecallbacks
    def process_exception(_failure):
      exception = _failure.value
      for method in self.methods['process_exception']:
        response = yield method(request=request, exception=exception,
                    spider=spider)
        assert response is none or isinstance(response, (response, request)), \
          'middleware %s.process_exception must return none, response or request, got %s' % \
          (six.get_method_self(method).__class__.__name__, type(response))
        if response:
          defer.returnvalue(response)
      defer.returnvalue(_failure)

    deferred = mustbe_deferred(process_request, request)
    deferred.adderrback(process_exception)
    deferred.addcallback(process_response)
    return deferred

4. 代码

在settings.py中,配置好selenium参数:

# 文件settings.py中

# ----------- selenium参数配置 -------------
selenium_timeout = 25      # selenium浏览器的超时时间,单位秒
load_image = true        # 是否下载图片
window_height = 900       # 浏览器窗口大小
window_width = 900

在spider中,生成request时,标记哪些请求需要走selenium下载:

# 文件myspider.py中
class myspider(crawlspider):
  name = "myspideramazon"
  allowed_domains = ['amazon.com']

  custom_settings = {
    'log_level':'info',
    'download_delay': 0,
    'cookies_enabled': false, # enabled by default
    'downloader_middlewares': {
      # 代理中间件
      'myspider.middlewares.proxiesmiddleware': 400,
      # seleniummiddleware 中间件
      'myspider.middlewares.seleniummiddleware': 543,
      # 将scrapy默认的user-agent中间件关闭
      'scrapy.downloadermiddlewares.useragent.useragentmiddleware': none,
    },

#.....................华丽的分割线.......................
# 生成request时,将是否使用selenium下载的标记,放入到meta中
yield request(
  url = "https://www.amazon.com/",
  meta = {'usedselenium': true, 'dont_redirect': true},
  callback = self.parseindexpage,
  errback = self.error
)

在下载中间件middlewares.py中,使用selenium抓取页面(核心部分

# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.common.exceptions import timeoutexception
from selenium.webdriver.common.by import by
from selenium.webdriver.support.ui import webdriverwait
from selenium.webdriver.support import expected_conditions as ec
from selenium.webdriver.common.keys import keys
from scrapy.http import htmlresponse
from logging import getlogger
import time

class seleniummiddleware():
  # 经常需要在pipeline或者中间件中获取settings的属性,可以通过scrapy.crawler.crawler.settings属性
  @classmethod
  def from_crawler(cls, crawler):
    # 从settings.py中,提取selenium设置参数,初始化类
    return cls(timeout=crawler.settings.get('selenium_timeout'),
          isloadimage=crawler.settings.get('load_image'),
          windowheight=crawler.settings.get('window_height'),
          windowwidth=crawler.settings.get('window_width')
          )

  def __init__(self, timeout=30, isloadimage=true, windowheight=none, windowwidth=none):
    self.logger = getlogger(__name__)
    self.timeout = timeout
    self.isloadimage = isloadimage
    # 定义一个属于这个类的browser,防止每次请求页面时,都会打开一个新的chrome浏览器
    # 这样,这个类处理的request都可以只用这一个browser
    self.browser = webdriver.chrome()
    if windowheight and windowwidth:
      self.browser.set_window_size(900, 900)
    self.browser.set_page_load_timeout(self.timeout)    # 页面加载超时时间
    self.wait = webdriverwait(self.browser, 25)       # 指定元素加载超时时间

    def process_request(self, request, spider):
    '''
    用chrome抓取页面
    :param request: request请求对象
    :param spider: spider对象
    :return: htmlresponse响应
    '''
    # self.logger.debug('chrome is getting page')
    print(f"chrome is getting page")
    # 依靠meta中的标记,来决定是否需要使用selenium来爬取
    usedselenium = request.meta.get('usedselenium', false)
    if usedselenium:
      try:
        self.browser.get(request.url)
        # 搜索框是否出现
        input = self.wait.until(
          ec.presence_of_element_located((by.xpath, "//div[@class='nav-search-field ']/input"))
        )
        time.sleep(2)
        input.clear()
        input.send_keys("iphone 7s")
        # 敲enter键, 进行搜索
        input.send_keys(keys.return)
        # 查看搜索结果是否出现
        searchres = self.wait.until(
          ec.presence_of_element_located((by.xpath, "//div[@id='resultscol']"))
        )
      except exception as e:
        # self.logger.debug(f'chrome getting page error, exception = {e}')
        print(f"chrome getting page error, exception = {e}")
        return htmlresponse(url=request.url, status=500, request=request)
      else:
        time.sleep(3)
        return htmlresponse(url=request.url,
                  body=self.browser.page_source,
                  request=request,
                  # 最好根据网页的具体编码而定
                  encoding='utf-8',
                  status=200)

5. 执行结果

 

6. 存在的问题

6.1. spider关闭了,chrome没有退出。

2018-04-04 09:26:18 [scrapy.statscollectors] info: dumping scrapy stats:
{‘downloader/response_bytes’: 2092766,
‘downloader/response_count’: 2,
‘downloader/response_status_count/200’: 2,
‘finish_reason’: ‘finished’,
‘finish_time’: datetime.datetime(2018, 4, 4, 1, 26, 16, 763602),
‘log_count/info’: 7,
‘request_depth_max’: 1,
‘response_received_count’: 2,
‘scheduler/dequeued’: 2,
‘scheduler/dequeued/memory’: 2,
‘scheduler/enqueued’: 2,
‘scheduler/enqueued/memory’: 2,
‘start_time’: datetime.datetime(2018, 4, 4, 1, 25, 48, 301602)}
2018-04-04 09:26:18 [scrapy.core.engine] info: spider closed (finished)

上面,我们是把browser对象放到了middleware中间件中,只能做process_request和process_response, 没有说在中间件中介绍如何调用scrapy的close方法。

解决方案:利用信号量的方式,当收到spider_closed信号时,调用browser.quit()

6.2. 当一个项目同时启动多个spider,会共用到middleware中的selenium,不利于并发。

因为用scrapy + selenium的方式,只有部分,甚至是一小部分页面会用到chrome,既然把chrome放到middleware中有这么多限制,那为什么不能把chrome放到spider里面呢。这样的好处在于:每个spider都有自己的chrome,这样当启动多个spider时,就会有多个chrome,不是所有的spider共用一个chrome,这对我们的并发是有好处的。

解决方案:将chrome的初始化放到spider中,每个spider独占自己的chrome

 7. 改进版代码

在settings.py中,配置好selenium参数:

# 文件settings.py中

# ----------- selenium参数配置 -------------
selenium_timeout = 25      # selenium浏览器的超时时间,单位秒
load_image = true        # 是否下载图片
window_height = 900       # 浏览器窗口大小
window_width = 900

在spider中,生成request时,标记哪些请求需要走selenium下载:

# 文件myspider.py中
# selenium相关库
from selenium import webdriver
from selenium.webdriver.support.ui import webdriverwait

# scrapy 信号相关库
from scrapy.utils.project import get_project_settings
# 下面这种方式,即将废弃,所以不用
# from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
# scrapy最新采用的方案
from pydispatch import dispatcher

class myspider(crawlspider):
  name = "myspideramazon"
  allowed_domains = ['amazon.com']

  custom_settings = {
    'log_level':'info',
    'download_delay': 0,
    'cookies_enabled': false, # enabled by default
    'downloader_middlewares': {
      # 代理中间件
      'myspider.middlewares.proxiesmiddleware': 400,
      # seleniummiddleware 中间件
      'myspider.middlewares.seleniummiddleware': 543,
      # 将scrapy默认的user-agent中间件关闭
      'scrapy.downloadermiddlewares.useragent.useragentmiddleware': none,
    },

  # 将chrome初始化放到spider中,成为spider中的元素
  def __init__(self, timeout=30, isloadimage=true, windowheight=none, windowwidth=none):
    # 从settings.py中获取设置参数
    self.mysetting = get_project_settings()
    self.timeout = self.mysetting['selenium_timeout']
    self.isloadimage = self.mysetting['load_image']
    self.windowheight = self.mysetting['window_height']
    self.windowwidth = self.mysetting['windowwidth']
    # 初始化chrome对象
    self.browser = webdriver.chrome()
    if self.windowheight and self.windowwidth:
      self.browser.set_window_size(900, 900)
    self.browser.set_page_load_timeout(self.timeout)    # 页面加载超时时间
    self.wait = webdriverwait(self.browser, 25)       # 指定元素加载超时时间
    super(myspider, self).__init__()
    # 设置信号量,当收到spider_closed信号时,调用myspiderclosehandle方法,关闭chrome
    dispatcher.connect(receiver = self.myspiderclosehandle,
              signal = signals.spider_closed
              )

  # 信号量处理函数:关闭chrome浏览器
  def myspiderclosehandle(self, spider):
    print(f"myspiderclosehandle: enter ")
    self.browser.quit()

#.....................华丽的分割线.......................
# 生成request时,将是否使用selenium下载的标记,放入到meta中
yield request(
  url = "https://www.amazon.com/",
  meta = {'usedselenium': true, 'dont_redirect': true},
  callback = self.parseindexpage,
  errback = self.error
)

在下载中间件middlewares.py中,使用selenium抓取页面

# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.common.exceptions import timeoutexception
from selenium.webdriver.common.by import by
from selenium.webdriver.support.ui import webdriverwait
from selenium.webdriver.support import expected_conditions as ec
from selenium.webdriver.common.keys import keys
from scrapy.http import htmlresponse
from logging import getlogger
import time

class seleniummiddleware():
  # middleware中会传递进来一个spider,这就是我们的spider对象,从中可以获取__init__时的chrome相关元素
  def process_request(self, request, spider):
    '''
    用chrome抓取页面
    :param request: request请求对象
    :param spider: spider对象
    :return: htmlresponse响应
    '''
    print(f"chrome is getting page")
    # 依靠meta中的标记,来决定是否需要使用selenium来爬取
    usedselenium = request.meta.get('usedselenium', false)
    if usedselenium:
      try:
        spider.browser.get(request.url)
        # 搜索框是否出现
        input = spider.wait.until(
          ec.presence_of_element_located((by.xpath, "//div[@class='nav-search-field ']/input"))
        )
        time.sleep(2)
        input.clear()
        input.send_keys("iphone 7s")
        # 敲enter键, 进行搜索
        input.send_keys(keys.return)
        # 查看搜索结果是否出现
        searchres = spider.wait.until(
          ec.presence_of_element_located((by.xpath, "//div[@id='resultscol']"))
        )
      except exception as e:
        print(f"chrome getting page error, exception = {e}")
        return htmlresponse(url=request.url, status=500, request=request)
      else:
        time.sleep(3)
        # 页面爬取成功,构造一个成功的response对象(htmlresponse是它的子类)
        return htmlresponse(url=request.url,
                  body=spider.browser.page_source,
                  request=request,
                  # 最好根据网页的具体编码而定
                  encoding='utf-8',
                  status=200)

运行结果(spider结束,执行myspiderclosehandle关闭chrome浏览器):

[‘categoryselectoramazon1.pipelines.mongopipeline’]
2018-04-04 11:56:21 [scrapy.core.engine] info: spider opened
2018-04-04 11:56:21 [scrapy.extensions.logstats] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
chrome is getting page
parseproductdetail url = https://www.amazon.com/, status = 200, meta = {‘usedselenium’: true, ‘dont_redirect’: true, ‘download_timeout’: 25.0, ‘proxy’: ‘http://h37xpsb6v57vu96d:cab31daeb9313ce5@proxy.abuyun.com:9020’, ‘depth’: 0}
chrome is getting page
2018-04-04 11:56:54 [scrapy.core.engine] info: closing spider (finished)
myspiderclosehandle: enter
2018-04-04 11:56:59 [scrapy.statscollectors] info: dumping scrapy stats:
{‘downloader/response_bytes’: 1938619,
‘downloader/response_count’: 2,
‘downloader/response_status_count/200’: 2,
‘finish_reason’: ‘finished’,
‘finish_time’: datetime.datetime(2018, 4, 4, 3, 56, 54, 301602),
‘log_count/info’: 7,
‘request_depth_max’: 1,
‘response_received_count’: 2,
‘scheduler/dequeued’: 2,
‘scheduler/dequeued/memory’: 2,
‘scheduler/enqueued’: 2,
‘scheduler/enqueued/memory’: 2,
‘start_time’: datetime.datetime(2018, 4, 4, 3, 56, 21, 642602)}
2018-04-04 11:56:59 [scrapy.core.engine] info: spider closed (finished)

到此这篇关于如何在scrapy中集成selenium爬取网页的方法的文章就介绍到这了,更多相关scrapy集成selenium爬取网页内容请搜索www.887551.com以前的文章或继续浏览下面的相关文章希望大家以后多多支持www.887551.com!

(0)
上一篇 2022年3月21日
下一篇 2022年3月21日

相关推荐