6/22/2016

用 Scrapy 实现的一个网络爬虫

参考文档

Scrapy框架

安装(Ubuntu 14.04)

在执行如下命令之前,确定已经安装了 Python 2.7, lxml, OpenSSL, pip, python-dev
$ sudo pip install service_identity
$ sudo pip install Scrapy

新建项目(tj91)

下面将建立一个获取“同济大学就业信息网(http://tj91.tongji.edu.cn/)”的爬虫。

$ scrapy startproject tj91

items.py

# -*- coding: utf-8 -*-
from scrapy.item import Item, Field

class Tj91Item(Item):
    title = Field()
    link = Field()
    desc = Field()

pipelines.py

# -*- coding: utf-8 -*-

import json
import codecs

class Tj91Pipeline(object):
    def __init__(self):
        self.file = codecs.open('tj91_data_utf8.json', 'wb', encoding='utf-8')

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + '\n'  
        self.file.write(line.decode("unicode_escape"))  
        return item

settings.py

增加如下内容。

ITEM_PIPELINES = {
    'tj91.pipelines.Tj91Pipeline':300
}

主程序 tj91_spider.py

# -*- coding: utf-8 -*-

from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy import log
from tj91.items import Tj91Item

class Tj91Spider(Spider):
    name = "tj91"
    allowed_domains = ["tongji.edu.cn"]
    start_urls = [
        "http://tj91.tongji.edu.cn/detach.portal?.f=pe401&.pmn=view&action=bulletinsMoreView&search=true&.ia=false&.pen=pe401&groupid=20",
        "http://tj91.tongji.edu.cn/detach.portal?.f=pe401&.pmn=view&action=bulletinsMoreView&search=true&.ia=false&.pen=pe401&groupid=21",
        "http://tj91.tongji.edu.cn/detach.portal?pageIndex=2&.pmn=view&groupid=21&action=bulletinsMoreView&pageSize=&.ia=false&.pen=pe401",
        "http://tj91.tongji.edu.cn/detach.portal?pageIndex=2&.pmn=view&groupid=20&action=bulletinsMoreView&pageSize=&.ia=false&.pen=pe401"
]
  
    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//ul[@id="blpe401"]/li')
        items = []
        for site in sites:
            item = Tj91Item()

            title = site.xpath('a/text()').extract()		# 获取标题
            date = site.xpath('span/span/text()').extract()	# 获取日期
            link = site.xpath('a/@href').extract()		# 获取超链接
            item['title'] = title + date			# 将标题和日期组合显示
            link[1] = "http://tj91.tongji.edu.cn/" + link[1]	# 对爬取的超链接进行修改
            item['link'] = link[1]
            items.append(item)
        return items

执行

$ scrapy crawl tj91

执行该命令就会在工程根目录生成 tj91_data_utf8.json 文件。

完成!_

没有评论:

发表评论

Cloudflare R2 + WebP Cloud + uPic 免费图床方案

搭建免费全球可访问的图床方案:Cloudflare R2 + WebP Cloud + uPic