October 23, 2014

Python网络爬虫（二）

上次写到了遍历网站的全部页面和向百度提交搜索，但是其中还存在着许多的问题。

解析HTML

在上次的方法中，由于以前都是用简单的正则表达式来解析HTML，所以为了尝鲜我就使用了BeautifulSoup和SGMLParser两种方法。但是经过使用下来发现还是BeautifulSoup好那么一点，而且官方的文档也很详实，以后用起来也会更加方便。

当然了，这两者在解析的过程中都有自己的局限性，所以还得配合正则表达式使用。

循环与递归

由于上次处理的只是遍历一个页面的URL，所以总的来说工作量比较小，然后我就用了递归这种最笨的方法。

但是显而易见，递归是一个非常耗内存的差方法，用递归写过输出斐波那契数列的人都知道，从第十几个数字后就开始慢的不行了，而且最近还听说某厂面试一个应届生的时候因为他用递归处理斐波那契就直接拒了他……所以还是不用的好。

由于Python中自带队列数据结构，所以通过队列实现迭代循环是目前较为理想的方案。

多线程

在爬虫程序中，当我提交了请求之后CPU需要等待网站相应后才能进一步计算，也就是需要等待urllib2.urlopen()得到相应之后才能read网页的内容，所以这就需要等待一段时间，所以为了提高爬虫的效率，就需要开启多线程进程抓取。

健壮

在爬虫运行的时候，如果因为被网站的防爬虫机制禁止了爬取行为，那就会导致整个爬虫程序的意外退出，所以就必须把urllib2的行为包起来。

另外，如果同一个IP在短时间内对一个网站进行大量访问，可能会被网站的防爬虫措施制裁，比如豆瓣…所以为了避免爬虫挂掉，就得设置一个时间间隔，也就是让线程暂时阻塞，等时间到了之后再加入线程队列中。

Bloom Filter

在上次的遍历一个URL中的所有URL任务中，虽然一次能抓取到几千个URL，但是并不能保证这些URL都是不重复的，如果在这些URL中有环路的话，爬虫就会先入死循环中，所以对抓取到的URL进行去重就是一个要面临的问题。

当需要处理的数据很少的时候，可以用set集合来解决，但是当数据量变大的时候，就得靠Bloom Filter（布隆过滤器）了。BF的算法不算非常复杂，不过好歹有现成的轮子，用起来也方便了许多。

Demo

>>> from pybloom import BloomFilter
>>> f = BloomFilter(capacity=10000, error_rate=0.001)
>>> for i in range_fn(0, f.capacity):
... _ = f.add(i)
...
>>> 0 in f
True
>>> f.capacity in f
False
>>> len(f) <= f.capacity
True
>>> (1.0 - (len(f) / float(f.capacity))) <= f.error_rate + 2e-18
True
>>> from pybloom import ScalableBloomFilter
>>> sbf = ScalableBloomFilter(mode=ScalableBloomFilter.SMALL_SET_GROWTH)
>>> count = 10000
>>> for i in range_fn(0, count):
... _ = sbf.add(i)
...
>>> sbf.capacity > count
True
>>> len(sbf) <= count
True
>>> (1.0 - (len(sbf) / float(count))) <= sbf.error_rate + 2e-18
True
```  

------------

# 新任务
>自动向百度提交搜索请求，搜索nuist.edu.cn中包含？的URL，从返回的结果页面中，提取每一个分页中的URL，并将结果写入一个文件中。**这次强调所有结果有多少页就爬取多少页**！

## 实现
```Python
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re
import time
import Queue
import urllib2
import threading 
from bs4 import BeautifulSoup
import sys
import urllib
from pybloom import BloomFilter
import time

# use Bloom Filter
bf = BloomFilter(1000000, 0.01)

# translate the default code
reload(sys)
sys.setdefaultencoding("utf-8")

# define a queue
url_wait = Queue.Queue(0)

class MyThread(threading.Thread):
    def __init__(self, url, num):
        threading.Thread.__init__(self)
        self.url = url
        # self.tnum = num
    def run(self):
        # traverse the whole url
        time.sleep(5)
        traverse(self.url)
        # print "This is thread-%d" % self.tnum

def find_nextpage(new_url):
    try:
        tmp = urllib2.urlopen(new_url)
        content = tmp.read()
    except:
        pass

    soup = BeautifulSoup(content)

    for link in soup.find_all(href=re.compile("rsv_page=1")):
        tmp_link = link.get('href')
        real_url = "http://www.baidu.com" + tmp_link
        return real_url  

def traverse(url):
    
    fp = open("all_url.txt", "a")

    url_wait.put(url)
     
    while not url_wait.empty():
        url = url_wait.get()
        if url not in bf:
            try:
                content = urllib2.urlopen(url).read() 
                soup = BeautifulSoup(content)                                     
                for urls in soup.find_all(href=re.compile("http")):                     
                    link = urls.get('href')
                    url_wait.put(link)
            except:
                pass

            bf.add(url)
            fp.write( url + '\n\n')   

    fp.close()  


def main():
    num = 0
    fp = open("target.txt", "a")
    url_pool = Queue.Queue(0)
    start_url = "http://www.baidu.com/s?wd=site:(nuist.edu.cn)%20?"
    
    url_pool.put(start_url)

    try:
        while not url_pool.empty():
            new_url = url_pool.get()
            fp.write(new_url + "\n\n")

            nextpage = find_nextpage(new_url)
            url_pool.put(nextpage)
        
            Thread = MyThread(new_url, num)
            num += 1
            Thread.start()
    except:
        pass
    fp.close()  


if __name__ == '__main__':  
    main()

October 13, 2014

Python网络爬虫（一）

从零开始写爬虫

一、遍历网站的全部页面

思路

要遍历一个网站的全部页面,要做的就是先打开目标网站的源码,从中提取所有的URL,然后再逐个遍历,并保存已处理过的URL。

1.提取URL

从一堆HTML中提取可用的URL是一件轻松的事,处理的方法也有很多。

正则表达式提取
BeautifulSoup提取
自带库sgmlib中的SGMLParser类

这里就试试第三种方法。

2.存储URL

提取出URL后,分开存储刚刚提取出来的URL和已经处理过的URL就是接下来要解决的问题。

一开始想过存在列表里面,但是从中提取和pop出URL的顺序又成了问题,所以这里采用Python自带的队列数据结构。

然后处理完的URL就直接存入文件,并且计数,即为已经访问到的页面数量。

3.实现函数

接下来的任务就是运行函数来处理URL了,但是我现在用的只是最笨的递归,效率低下不说,对内存也是一个很大的考验。所以之后会通过多线程编程来解决这个问题,把URL放入内存池中,规定每次允许运行的线程数,这样就能在一定程度上提升效率和速度了。

4.具体代码

# -*- coding: UTF-8 -*-
# 搜寻现成的爬虫代码，弄明白怎样遍历一个网站的全部页面，编码实现：
# 能够遍历一个网站的大部分页面，保存输出可遍历页面的URL，并统计访问到的页面数量。
import re
import Queue
import urllib2
from sgmllib import SGMLParser

url_queue = Queue.Queue(0)
url_num = 0

class find_url(SGMLParser):
	"""docstring for find_url
		store the urls into url_new

	"""
	def __init__(self):
		SGMLParser.__init__(self)
		self.url_new = []

	def start_a(self, attrs):
		href = [v for k, v in attrs if k=='href'] 
		
		if re.match(r'^https?:/{2}\w.+$', "".join(href)):
			self.url_new.extend(href)

def open_url():
	url = url + 1

	url_given = url_queue.get()

	url_traversed.write(url_given + "\n") 

	content = urllib2.urlopen(url_given).read()	
	result  = find_url()
	result.feed(content) 
	for urls in result.url_new:
		# print i                 
		url_queue.put(urls)

	while not url_queue.empty():
		open_url()


if __name__ == '__main__':

	url = "http://movie.douban.com"

	url_traversed = open('URLSTORE.txt', 'w')

	url_queue.put(url)

	open_url()

	url_traversed.closed()

	print "The number of the traversed URL is %d" % url_num
```  

## 二、向百度提交搜索
#### 1.提交搜索
我现在还是用的最笨的方法,即直接打开包含需要搜索内容的URL,就能得到搜索页面的源码。  

用POST和GET提交的方法下次再用。  

#### 2.处理结果  
用BeautifulSoup查找<div>标签间的内容,但是这个还是只能大概地过滤,并不能很精准地返回搜索内容。  

#### 3.待解问题
1. 用POST和GET方法提交搜索。  
2. 细致地处理返回的搜索结果。  
3. 遍历所有的搜索结果。  

```Python
# -*-coding: UTF-8 -*-
# 2.用百度设置内的高级搜索功能，在指定网站中搜索URL中包含？的结果。编程实现：
# 	自动向百度提交搜索，在指在指定网站中搜索URL中包含？的结果，提取百度搜索结果并输出到文件。
#   例如，搜索nuist.edu.cn，其实就是想百度提交搜索字符串site:(nuist.edu.cn) ?

# http://www.baidu.com/s?wd=
import sys
import urllib2
from bs4 import BeautifulSoup

reload(sys)   
sys.setdefaultencoding('utf8')  

def search_baidu():
    url = urllib2.urlopen("http://www.baidu.com/s?wd=site:(nuist.edu.cn)%20?")

  
    urltmp = url.read()

    # urltmp = urltmp.decode("UTF-8").encode("UTF-8")

    soup = BeautifulSoup(urltmp)

    res = soup.find(name='div').getText('\n')

    ss = open('ss.txt', 'w')
    ss.write(res)
    ss.close()

search_baidu()

August 28, 2014

豆瓣爬虫

可抓取豆瓣读书、电影、音乐中任意标签下内容

在我刚刚入门Python爬虫的时候，无论怎样都很难找到一个适当的实例让我参考。

看过很多别人的例子，但都觉得不得要领，所以在折腾很久后写了这个简单的例子。

源码

# -*- coding: UTF-8 -*-
# 如果要在python2的py文件里面写中文，则必须要添加一行声明文件编码的注释，否则python2会默认使用ASCII编码。  

import re 
import urllib2

def douban_crawler(url_head, target):
    for page in range(0, 1000, 20):
    #这个1000是检索的条目数量，可以按需设定
        url_rear = "?start=%d&type=T" % page
        url_use = url_head + url_rear
        #两段合成真正的url
        content = urllib2.urlopen(url_use).read()
        content = content.decode("UTF-8").encode("UTF-8")
        
        content = content.replace(r'title="去FM收听"', "")
        content = content.replace(r'title="去其他标签"', "")
        
        name = re.findall(r'title="(\S*?)"', content, re.S)
        #正则表达式捕获标题
        num  = re.findall(r'<span\s*class="rating_nums">([0-9.]*)<\/span>', content)
        #正则表达式捕获分数
        
        doc = zip(name, num)
        #将标题和分数打包成([ , ][ , ]...)的形式
        if target == "book":
            dou = open("doc_book.txt", 'a')
        elif target == "music":
            dou = open("doc_music.txt", 'a')
        elif target == "movie":
            dou = open("doc_movie.txt", 'a')

        for i in doc:
            dou.write(i[0] + " " + i[1] + "\n")
            #写入
    dou.close()

if __name__ == '__main__':
    target = raw_input("豆瓣 book movie music，你想爬哪一个? ")

    tag   = raw_input("请输入你想要检索的标签: ")

    url_head  = "http://%s.douban.com/tag/%s" % (target, tag)

    douban_crawler(url_head, target)
    
    print "抓取完毕"

抓取结果

豆瓣读书-小说

月亮和六便士 9.0
百年孤独 9.2
解忧杂货店 8.7
追风筝的人 8.8
霍乱时期的爱情 9.0
平凡的世界（全三部） 9.0
围城 8.9
活着 9.1
一九八四 9.3
人生的枷锁 9.0
陆犯焉识 8.7
…

豆瓣电影-悬疑

盗梦空间 9.2
寒战 7.4
嫌疑人X的献身 7.4
七宗罪 8.7
致命ID 8.5
云图 8.0
禁闭岛 8.5
蝴蝶效应 8.6
致命魔术 8.8
恐怖游轮 8.2
…

豆瓣音乐-pop

十二新作 8.2
Alright,Still 7.7
范特西 8.5
Apologize 8.9
逆光 7.3
Spin 8.5
阿岳正传 8.3
感官/世界 8.7
八度空间 7.5
PCD 8.4
…