Scrapy抓取Logdown博文相关数据

前言
业余时间在接触 python，从兴趣点着手是个不错的办法。而爬虫正是我感兴趣的一个方向。我根据 younghz 的 Scrapy 教程一步一步来，试着将 Logdown 的博文相关数据抓去下来，作为练手之用，这里做个记录。

这里暂时包括如下4个数据：

博文名称 article_name
博文网址 article_url
博文日期 article_time
博文标签 article_tags

1. 创建project

命令行cd到某个目录，然后运行如下命令

1	scrapy startproject LogdownBlog

2. items.py的编写

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class LogdownblogspiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    article_name = scrapy.Field()
    article_url = scrapy.Field()
    article_time = scrapy.Field()
    article_tags = scrapy.Field()
    pass
```    
## 3. pipelines.py的编写

```python pipelines.py
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json
import codecs

class LogdownblogspiderPipeline(object):
	def __init__(self):
		self.file = codecs.open('LogdownBlogArticles.json', mode = 'w', encoding = 'utf-8')

	def process_item(self, item, spider):
		line = json.dumps(dict(item)) + '\n'
		self.file.write(line.decode('unicode_escape'))
		return item

将item通过管道输出到文件LogdownBlogArticles.json中，模式为w，用json形式覆盖写入，编码为utf-8编码。

4. settings.py的编写

settings.py

BOT_NAME = 'LogdownBlog'

SPIDER_MODULES = ['LogdownBlog.spiders']
NEWSPIDER_MODULE = 'LogdownBlog.spiders'

COOKIES_ENABLED = False

ITEM_PIPELINES = {
	'LogdownBlog.pipelines.LogdownblogspiderPipeline':300
}

5. LogdownSpider.py的编写-爬虫分析部分

在 spider 文件夹下新建一个名字为 LogdownSpider.py 的文件，这时目录层次如下：

项目目录层次

+ LogdownBlog
|  + LogdownBlog
|  |  + spiders
|  |  |  - __init__.py
|  |  |  - LogdownSpider.py
|  |  - __init__.py
|  |  - items.py
|  |  - pipelines.py
|  |  - settings.py
|  - scrapy.cfg

LogdownSpider.py 爬虫代码

# -*- coding: utf-8 -*-

from scrapy.spider import Spider
from scrapy.http import Request
from scrapy.selector import Selector
from LogdownBlog.items import LogdownblogspiderItem

class LogdownSpider(Spider):
	'''LogdownSpider'''

	name = 'LogdownSpider'

	download_delay = 1
	allowed_domains = ["childhood.logdown.com"]
	
	first_blog_url = raw_input("请输入您的第一篇博客地址: ")
	start_urls = [
		first_blog_url
	]

	def parse(self, response):
		sel = Selector(response)
		item = LogdownblogspiderItem()

		article_name = sel.xpath('//article[@class="post"]/h2/a/text()').extract()[0]
		article_url = sel.xpath('//article[@class="post"]/h2/a/@href').extract()[0]
		article_time = sel.xpath('//article[@class="post"]/div[@class="meta"]/div[@class="date"]/time/@datetime').extract()[0]
		article_tags = sel.xpath('//article[@class="post"]/div[@class="meta"]/div[@class="tags"]/a/text()').extract()

		item['article_name'] = article_name.encode('utf-8')
		item['article_url'] = article_url.encode('utf-8')
		item['article_time'] = article_time.encode('utf-8')
		item['article_tags'] = [n.encode('utf-8') for n in article_tags]

		yield item

		# get next article's url
		nextUrl = sel.xpath('//nav[@id="pagenavi"]/a[@class="next"]/@href').extract()[0]
		print nextUrl
		yield Request(nextUrl, callback=self.parse)

有几个点注意下：

对 xpath 的理解要正确，因为直接关系到我们想要的数据在页面html里的提取。

对 xpath 的分析当然离不了对 html 的分析，我这里采用了 Chrome 浏览器，通过右键-审查元素来查看我们想要的数据在页面中的层次位置。下面以【查看下一篇博文】为例子。

所以通过 xpath 为'//nav[@id="pagenavi"]/a[@class="next"]/@href'就能得到下一篇博文的 url 地址。

设置download_delay，减轻服务器的压力，防止被ban。
yield Request(nextUrl, callback=self.parse)，获取每个页面的“下一篇博客“的网址返回给引擎，从而循环实现下一个网页的爬取。

6. 执行

1	scrapy crawl LogdownSpider

截图如文章开头图片所示，格式如下：

...
...
{"article_name": "  PhysicsEditorExporter for QuickCocos2dx 使用说明", "article_tags": ["exporter", "Chipmunk", "physicseditor", "quick-x"], "article_url": "http://childhood.logdown.com/posts/196165/physicseditorexporter-for-quickcocos2dx-instructions-for-use", "article_time": "2014-04-28 14:08:00 UTC"}
...
...

工具：Scrapy

Logdown博文相关数据

Let’s Go!

1. 创建project

2. items.py的编写

4. settings.py的编写

5. LogdownSpider.py的编写-爬虫分析部分

6. 执行

相关资料查阅