for循环的机制是迭代
  如果for循环里需要两个变量的话,那么可以用**zip 函数把两个变量“打包”起来
  for i,j in zip(x,y)
正则匹配
(.*?)用来匹配所有的内容
BeautifulSoup真的是神器
安装过程
网上没有一个说清楚的,实践出真知,在[官网](http://www.crummy.com/software/BeautifulSoup/bs4/download/4.2/) 下一个压缩包(就是源代码),复制里面的bs4文件夹,粘贴到``C:\Python27\Lib路径下去就ok了
直接上例子
- 1 
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86- # -*- coding: cp936 -*- 
 from bs4 import BeautifulSoup
 html_doc = """
 <html><head><title>The Dormouse's story</title></head>
 <body>
 <p class="title"><b>The Dormouse's story</b></p>
 <p class="story">Once upon a time there were three little sisters; and their names were></p>
 <div class="thumb">000<a href="/article/94021199?list=hot&s=4720116"
 target="_blank" onclick="_hmt.push(['_trackEvent', 'post', 'click', 'signlePost'])">
 <img src="http://pic.qiushibaike.com/system/pictures/9402/94021199/medium/app94021199.jpg"
 alt="Ï´ÎÔÙÈÓÕâô¸ß¾Í²»ÅãÄãÍæÁË">
 </a>
 </div>
 <div class="thumb">111
 <a href="/article/94023051?list=hot&s=4720116" target="_blank"
 onclick="_hmt.push(['_trackEvent', 'post', 'click', 'signlePost'])">
 <img src="http://pic.qiushibaike.com/system/pictures/9402/94023051/medium/app94023051.jpg"
 alt="È´»¨ÁË247Ôª¸øº¢×ÓÂòÁËÕâ¸öÍæ¾ß³µ">
 </a>
 </div>
 <div class="thumb">222
 <a href="/article/93987073?list=hot&s=4720125" target="_blank"
 onclick="_hmt.push(['_trackEvent', 'post', 'click', 'signlePost'])">
 <img src="http://pic.qiushibaike.com/system/pictures/9398/93987073/medium/app93987073.jpg"
 alt="Ô¸ÀÑÀÑÉíÌ彡¿µ">
 </a>
 </div>
 <div >3333
 <a href="/article/93987073?list=hot&s=4720125" target="_blank"
 onclick="_hmt.push(['_trackEvent', 'post', 'click', 'signlePost'])">
 <img src="http://pic.qiushibaike.com/system/pictures/9398/93987073/medium/app93987073.jpg"
 alt="Ô¸ÀÑÀÑÉíÌ彡¿µ">
 </a>
 </div>
 <p class="story">...</p>
 """
 soup = BeautifulSoup(html_doc)
 my_pic=soup.find_all('div',class_="thumb")
 i=0
 for pic in my_pic:
 print '---'+str(i)
 pic_t= pic.get_text()
 print pic_t.replace("\n","")
 #p_link=p_img.get('href')
 #print p_link
 i=i+1
 soup = BeautifulSoup(html_doc)
 my_pic=soup.find_all('div',class_="thumb")
 i=0
 for pic in my_pic:
 print '---'+str(i)
 #获取div块的文本内容,也就是000,111,222,(注意它没有获取所有的div块,
 #因为在前面对div进行了筛选)
 pic_t= pic.get_text()
 print pic_t.replace("\n","")
 #获取a标签的内容,从头<a>到尾</a>
 pic_img=pic.find('a')
 #记住这个句柄,用它get标签里面的属性,比如href,src之类的,
 #需要注意的是它不再是find,变成了get!!!
 p_link=p_img.get('href')
 #print p_link
 i=i+1- 需要注意的就是find_all和find还有get的用法 - python 中三个引号
- 可以把中间的内容全部包括,不用换行\连字符来连接每一行