Scrapy抓取Logdown博文相关数据

前言
业余时间在接触 python,从兴趣点着手是个不错的办法。而爬虫正是我感兴趣的一个方向。我根据 younghz 的 Scrapy 教程一步一步来,试着将 Logdown 的博文相关数据抓去下来,作为练手之用,这里做个记录。

工具:Scrapy

Logdown博文相关数据

这里暂时包括如下4个数据:

  • 博文名称 article_name
  • 博文网址 article_url
  • 博文日期 article_time
  • 博文标签 article_tags

Let’s Go!

1. 创建project

命令行cd到某个目录,然后运行如下命令
1
scrapy startproject LogdownBlog

2. items.py的编写

items.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class LogdownblogspiderItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
article_name = scrapy.Field()
article_url = scrapy.Field()
article_time = scrapy.Field()
article_tags = scrapy.Field()
pass
```
## 3. pipelines.py的编写

```python pipelines.py
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json
import codecs

class LogdownblogspiderPipeline(object):
def __init__(self):
self.file = codecs.open('LogdownBlogArticles.json', mode = 'w', encoding = 'utf-8')

def process_item(self, item, spider):
line = json.dumps(dict(item)) + '\n'
self.file.write(line.decode('unicode_escape'))
return item

将item通过管道输出到文件LogdownBlogArticles.json中,模式为w,用json形式覆盖写入,编码为utf-8编码。

4. settings.py的编写

settings.py
1
2
3
4
5
6
7
8
9
10
BOT_NAME = 'LogdownBlog'

SPIDER_MODULES = ['LogdownBlog.spiders']
NEWSPIDER_MODULE = 'LogdownBlog.spiders'

COOKIES_ENABLED = False

ITEM_PIPELINES = {
'LogdownBlog.pipelines.LogdownblogspiderPipeline':300
}

5. LogdownSpider.py的编写-爬虫分析部分

在 spider 文件夹下新建一个名字为 LogdownSpider.py 的文件,这时目录层次如下:

项目目录层次
1
2
3
4
5
6
7
8
9
10
+ LogdownBlog
| + LogdownBlog
| | + spiders
| | | - __init__.py
| | | - LogdownSpider.py
| | - __init__.py
| | - items.py
| | - pipelines.py
| | - settings.py
| - scrapy.cfg
LogdownSpider.py 爬虫代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# -*- coding: utf-8 -*-

from scrapy.spider import Spider
from scrapy.http import Request
from scrapy.selector import Selector
from LogdownBlog.items import LogdownblogspiderItem

class LogdownSpider(Spider):
'''LogdownSpider'''

name = 'LogdownSpider'

download_delay = 1
allowed_domains = ["childhood.logdown.com"]

first_blog_url = raw_input("请输入您的第一篇博客地址: ")
start_urls = [
first_blog_url
]

def parse(self, response):
sel = Selector(response)
item = LogdownblogspiderItem()

article_name = sel.xpath('//article[@class="post"]/h2/a/text()').extract()[0]
article_url = sel.xpath('//article[@class="post"]/h2/a/@href').extract()[0]
article_time = sel.xpath('//article[@class="post"]/div[@class="meta"]/div[@class="date"]/time/@datetime').extract()[0]
article_tags = sel.xpath('//article[@class="post"]/div[@class="meta"]/div[@class="tags"]/a/text()').extract()

item['article_name'] = article_name.encode('utf-8')
item['article_url'] = article_url.encode('utf-8')
item['article_time'] = article_time.encode('utf-8')
item['article_tags'] = [n.encode('utf-8') for n in article_tags]

yield item

# get next article's url
nextUrl = sel.xpath('//nav[@id="pagenavi"]/a[@class="next"]/@href').extract()[0]
print nextUrl
yield Request(nextUrl, callback=self.parse)

有几个点注意下:

  • xpath 的理解要正确,因为直接关系到我们想要的数据在页面html里的提取。

对 xpath 的分析当然离不了对 html 的分析,我这里采用了 Chrome 浏览器,通过右键-审查元素来查看我们想要的数据在页面中的层次位置。下面以【查看下一篇博文】为例子。

所以通过 xpath 为'//nav[@id="pagenavi"]/a[@class="next"]/@href'就能得到下一篇博文的 url 地址。

  • 设置download_delay,减轻服务器的压力,防止被ban。
  • yield Request(nextUrl, callback=self.parse),获取每个页面的“下一篇博客“的网址返回给引擎,从而循环实现下一个网页的爬取。

6. 执行

1
scrapy crawl LogdownSpider

截图如文章开头图片所示,格式如下:

1
2
3
4
5
...
...
{"article_name": " PhysicsEditorExporter for QuickCocos2dx 使用说明", "article_tags": ["exporter", "Chipmunk", "physicseditor", "quick-x"], "article_url": "http://childhood.logdown.com/posts/196165/physicseditorexporter-for-quickcocos2dx-instructions-for-use", "article_time": "2014-04-28 14:08:00 UTC"}
...
...

相关资料查阅

坚持原创技术分享,您的支持将鼓励我继续创作!