閒譚藕記: 用 Scrapy 实现的一个网络爬虫

2016/06/22

用 Scrapy 实现的一个网络爬虫

参考文档

`Scrapy`框架

安装（Ubuntu 14.04）

在执行如下命令之前，确定已经安装了 Python 2.7, lxml, OpenSSL, pip, python-dev
$ sudo pip install service_identity
$ sudo pip install Scrapy

新建项目（tj91）

下面将建立一个获取“同济大学就业信息网（http://tj91.tongji.edu.cn/）”的爬虫。

$ scrapy startproject tj91

`items.py`

# -*- coding: utf-8 -*-
from scrapy.item import Item, Field

class Tj91Item(Item):
    title = Field()
    link = Field()
    desc = Field()

`pipelines.py`

# -*- coding: utf-8 -*-

import json
import codecs

class Tj91Pipeline(object):
    def __init__(self):
        self.file = codecs.open('tj91_data_utf8.json', 'wb', encoding='utf-8')

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + '\n'  
        self.file.write(line.decode("unicode_escape"))  
        return item

`settings.py`

增加如下内容。

ITEM_PIPELINES = {
    'tj91.pipelines.Tj91Pipeline':300
}

主程序 `tj91_spider.py`

# -*- coding: utf-8 -*-

from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy import log
from tj91.items import Tj91Item

class Tj91Spider(Spider):
    name = "tj91"
    allowed_domains = ["tongji.edu.cn"]
    start_urls = [
        "http://tj91.tongji.edu.cn/detach.portal?.f=pe401&.pmn=view&action=bulletinsMoreView&search=true&.ia=false&.pen=pe401&groupid=20",
        "http://tj91.tongji.edu.cn/detach.portal?.f=pe401&.pmn=view&action=bulletinsMoreView&search=true&.ia=false&.pen=pe401&groupid=21",
        "http://tj91.tongji.edu.cn/detach.portal?pageIndex=2&.pmn=view&groupid=21&action=bulletinsMoreView&pageSize=&.ia=false&.pen=pe401",
        "http://tj91.tongji.edu.cn/detach.portal?pageIndex=2&.pmn=view&groupid=20&action=bulletinsMoreView&pageSize=&.ia=false&.pen=pe401"
]
  
    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//ul[@id="blpe401"]/li')
        items = []
        for site in sites:
            item = Tj91Item()

            title = site.xpath('a/text()').extract()		# 获取标题
            date = site.xpath('span/span/text()').extract()	# 获取日期
            link = site.xpath('a/@href').extract()		# 获取超链接
            item['title'] = title + date			# 将标题和日期组合显示
            link[1] = "http://tj91.tongji.edu.cn/" + link[1]	# 对爬取的超链接进行修改
            item['link'] = link[1]
            items.append(item)
        return items

执行

$ scrapy crawl tj91

执行该命令就会在工程根目录生成 tj91_data_utf8.json 文件。

完成！^_

閒譚藕記

2016/06/22

用 Scrapy 实现的一个网络爬虫

参考文档

`Scrapy`框架

安装（Ubuntu 14.04）

新建项目（tj91）

`items.py`

`pipelines.py`

`settings.py`

主程序 `tj91_spider.py`

执行

没有评论:

发表评论

博客归档

标签

2016/06/22

用 Scrapy 实现的一个网络爬虫

参考文档

Scrapy框架

安装（Ubuntu 14.04）

新建项目（tj91）

items.py

pipelines.py

settings.py

主程序 tj91_spider.py

执行

没有评论:

发表评论

`Scrapy`框架

`items.py`

`pipelines.py`

`settings.py`

主程序 `tj91_spider.py`