Python爬蟲實戰（3）：安居客房產經紀人資訊採集

1. 引言

Python開源網路爬蟲專案啟動之初，我們就把網路爬蟲分成兩類：即時爬蟲和收割式網路爬蟲。為了使用各種應用場景，該專案的整個網路爬蟲產品線包含了四類產品，如下圖所示：

本實戰是上圖中的“獨立python爬蟲”的一個實例，以採集安居客房產經紀人資訊為例，記錄整個採集流程，包括python和依賴庫的安裝，即便是python初學者，也可以跟著文章內容成功地完成運行。

2.Python和相關依賴庫的安裝

運行環境：Windows10

2.1.安裝Python3.5.2

這個版本會自動安裝pip和setuptools，方便安裝其它的庫

2.2.Lxml 3.6.0

Lxml官網地址: http://lxml.de/

對應windows下python3.5的安裝檔為 lxml-3.6.0-cp35-cp35m-win32.whl

網頁內容提取器程式是GooSeeker為開源Python即時網路爬蟲專案發佈的一個類，使用這個類，可以大大減少資訊採集規則的調試時間，具體參看《Python即時網路爬蟲專案: 內容提取器的定義》

把gooseeker.py保存在專案目錄下

3.網路爬蟲的原始程式碼

# _*_coding:utf8_*_# anjuke.py# 爬取安居客房產經紀人from urllib import requestfromlxml import etreefrom gooseeker import GsExtractorclass Spider:def getContent(self, url):conn = request.urlopen(url)output = etree.HTML(conn.read())return outputdef saveContent(self, filepath, content):file_obj = open(filepath, 'w', encoding='UTF-8')file_obj.write(content)file_obj.close()bbsExtra = GsExtractor() # 下面這句調用gooseeker的api來設置xslt抓取規則# 第一個參數是app key，

請到GooSeeker會員中心申請# 第二個參數是規則名，是通過GooSeeker的圖形化工具:謀數台MS 來生成的bbsExtra.setXsltFromAPI("31d24931e043e2d5364d03b8ff9cc77e" ,"安居客房產經紀人") url = "http://shenzhen.anjuke.com/tycoon/nanshan/p"totalpages= 50anjukeSpider = Spider()print("爬取開始")for pagenumber in range(1 , totalpages):currenturl = url + str(pagenumber)print("正在爬取", currenturl)content = anjukeSpider.getContent(currenturl)outputxml = bbsExtra.extract(content)outputfile = "result" + str(pagenumber) +".xml"anjukeSpider.saveContent(outputfile , str(outputxml))print("爬取結束")

運行過程如下:

打開Windows CMD視窗，切換目前的目錄到存放anjuke.py的路徑(cd \xxxx\xxx)

運行 python anjuke.py

4.爬蟲結果

在專案目錄下可以看到多個result**.xml檔，檔內容如下圖所示：

5.總結

1. GooSeeker開源Python網路爬蟲GitHub源

End.

作者：fullerhua（中國統計網特邀認證作者）

http://www.itongji.cn