Scrapy新手指南：创建蜘蛛抓取数据

2023-01-15爬虫技术Scrapy

用Scrapy创建蜘蛛抓取数据

目标：

安装scrapy并创建蜘蛛程序实现对网站http://quotes.toscrape.com/page/1/的内容抓取。

安装Scrapy

pip install Scrapy

创建第一个项目

scrapy startproject tutorial

在运行后会生成出这样的项目目录结构

tutorial/
    scrapy.cfg            # deploy configuration file

    tutorial/             # project's Python module, you'll import your code from here
        __init__.py
        items.py          # project items definition file
        middlewares.py    # project middlewares file
        pipelines.py      # project pipelines file
        settings.py       # project settings file
        spiders/          # a directory where you'll later put your spiders
            __init__.py

创建第一只蜘蛛

蜘蛛文件存放于tutorial/spiders目录下，我们创建一个文件名为quotes_spider.py的蜘蛛，内容如下

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f'quotes-{page}.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log(f'Saved file {filename}')

这里的 name = "quotes" 参数很重要，是后面命令行启动蜘蛛指定的名称。

如您所见，我们的Spider子类 scrapy.Spider 并定义了一些属性和方法：

name ：标识蜘蛛。它在一个项目中必须是唯一的，即不能为不同的爬行器设置相同的名称。
start_requests() ：必须返回请求的可迭代(您可以返回请求列表或编写生成器函数)，爬行器将从该请求开始爬行。后续请求将从这些初始请求中相继生成。
parse() ：将被调用以处理为每个请求下载的响应的方法。Response参数是它保存页面内容，并具有进一步有用的方法来处理它。
这个 parse() 方法通常解析响应，将抓取的数据提取为字典，还查找要遵循的新URL并创建新请求 (Request )。

运行蜘蛛

scrapy crawl quotes

输出

... (omitted for brevity)
2016-12-16 21:24:05 [scrapy.core.engine] INFO: Spider opened
2016-12-16 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-16 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-1.html
2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-2.html
2016-12-16 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished)

这时候在项目根目录我们会发现生成出了quotes-1.html和quotes-2.html两个文件。

内容分析

在目标网站http://quotes.toscrape.com中，我们想要的那部分内容的html结构为

<div class="quote">
    <span class="text">“The world as we have created it is a process of our
    thinking. It cannot be changed without changing our thinking.”</span>
    <span>
        by <small class="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
    </span>
    <div class="tags">
        Tags:
        <a class="tag" href="/tag/change/page/1/">change</a>
        <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
        <a class="tag" href="/tag/thinking/page/1/">thinking</a>
        <a class="tag" href="/tag/world/page/1/">world</a>
    </div>
</div>

我们需要提取 class为text，author以及tags中的内容。

数据抓取

蜘蛛爬行规则的实现

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

这时候运行蜘蛛scrapy crawl quotes

2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'}
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}

可以看到每条json格式的数据

保存数据

使用-O参数启动蜘蛛

scrapy crawl quotes -O quotes.json

如果要保存成json lines格式，可使用

scrapy crawl quotes -o quotes.jl

如果您想对爬取的项目执行更复杂的操作，可以编写一个 Item Pipeline . 项目创建时已为您设置了项目管道的占位文件，位于 tutorial/pipelines.py . 但是，如果只想存储爬取的项目，则不需要实现任何项目管道。

分页的处理

目前我们只提供了两条页面的内容，分别是/page/1/和/page/2/，我们首先要找到分页源码

<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>

再对quotes_spider.py蜘蛛稍作修改

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

现在，在提取数据之后， parse() 方法查找到下一页的链接，并使用 urljoin() 方法（因为链接可以是相对的），并生成对下一页的新请求，将自身注册为回调，以处理下一页的数据提取，并保持爬行在所有页中进行。

这里您看到的是scrapy的分页链接机制：当您在回调方法中生成一个请求时，scrapy将计划发送该请求，并注册一个回调方法，以便在该请求完成时执行。

使用它，您可以构建复杂的爬虫程序，这些爬虫程序根据您定义的规则跟踪链接，并根据所访问的页面提取不同类型的数据。

在我们的示例代码中，它创建了一种循环，跟踪到下一页的所有链接，直到找不到一个为止——这对于爬行博客、论坛和其他带有分页的站点很方便。

创建请求的快捷方式

你可以使用response.follow来实现

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

不像Scrapy.Request， response.follow 直接支持相对URL，无需调用urljoin()。

注意 response.follow 只返回一个请求实例，您仍然需要生成这个请求。

也可以将选择器传递给 response.follow 而不是字符串；此选择器应提取必要的属性：

for href in response.css('ul.pager a::attr(href)'):
    yield response.follow(href, callback=self.parse)

为了 <a> 元素有一个快捷方式： response.follow 自动使用其href属性。因此代码可以进一步缩短：

for a in response.css('ul.pager a'):
    yield response.follow(a, callback=self.parse)

要从iterable创建多个请求，可以使用 response.follow_all 取而代之的是：

anchors = response.css('ul.pager a')
yield from response.follow_all(anchors, callback=self.parse)

或者，进一步缩短：

yield from response.follow_all(css='ul.pager a', callback=self.parse)

蜘蛛的启动参数

通过使用 -a 运行它们时的选项：

scrapy crawl quotes -O quotes-humor.json -a tag=humor

这些论点被传给蜘蛛 __init__ 方法并默认成为spider属性。

在本例中，为 tag 参数将通过 self.tag . 您可以使用它使您的蜘蛛只获取带有特定标记的引号，并基于以下参数构建URL:：

mport scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

如果你通过 tag=humor 对于这个蜘蛛，您会注意到它只访问来自 humor 标记，如 http://quotes.toscrape.com/tag/humor .

本文示例代码来自官网：https://docs.scrapy.org/en/latest/intro/tutorial.html

恒馨博客