0%

Python练习册：0009

发表于 2018-06-02 分类于 Python练习册
本文字数： 1.6k 阅读时长 ≈ 3 分钟

题目

    一个HTML文件，找出里面的链接。

分析

先用urlib获取html内容，接着用BeautifulSoup库去解析HTML的a标签,官方文档中就有这个获取链接的示例。

pip install beautifulsoup4 lxml

代码

使用BeautifulSoup库

"""
一个HTML文件，找出里面的链接。
"""

from urllib import request
from bs4 import BeautifulSoup as BS


url = "https://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000/001432688314740a0aed473a39f47b09c8c7274c9ab6aee000"

req = request.Request(url)
req.add_header('User-Agent','Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0')
#读取网页内容以utf-8解码
html = request.urlopen(req).read().decode('utf-8')

soup = BS(html,'lxml')
#查找所有<a>
a = soup.find_all('a')
for link in a:
    #找href属性的内容
    href = link.get('href')
    try:
        #排除一些#或./之类的干扰链接
        if  href.startswith('http'):
            print(href)
    except:
        pass

使用正则表达式
有一些链接是不在a标签里的，可以简单粗暴的匹配所有href里的内容。


from urllib import request
import re

url = "https://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000/001432688314740a0aed473a39f47b09c8c7274c9ab6aee000"

req = request.Request(url)
req.add_header('User-Agent','Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0')
#读取网页内容以utf-8解码
html = request.urlopen(req).read().decode('utf-8')

href = re.findall(r'href="(http.*?)"',html)
for link in href:
        print(link)

参考

欢迎关注我的其它发布渠道