Python爬虫基础指南
Python爬虫是自动化获取网络数据的技术,广泛应用于数据采集、市场分析等领域。以下是核心实现步骤:
1. 核心库选择
1import requests # 发送HTTP请求 2from bs4 import BeautifulSoup # HTML解析 3import pandas as pd # 数据存储 4
2. 基础爬取流程
1# 发送请求 2response = requests.get("https://example.com/books") 3response.encoding = 'utf-8' # 设置编码 4 5# 解析HTML 6soup = BeautifulSoup(response.text, 'html.parser') 7 8# 数据提取示例 9book_titles = [h2.text for h2 in soup.select('.book-title')] 10book_prices = [float(div.text.strip('¥')) 11 for div in soup.select('.price')] 12 13# 存储数据 14df = pd.DataFrame({'书名': book_titles, '价格': book_prices}) 15df.to_csv('book_data.csv', index=False) 16
3. 关键技巧
- 反爬应对:
1headers = { 2 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)', 3 'Cookie': 'sessionid=abc123' 4} 5response = requests.get(url, headers=headers)
- 动态页面处理(使用Selenium):
1from selenium import webdriver 2driver = webdriver.Chrome() 3driver.get(url) 4dynamic_content = driver.find_element_by_class('js-loaded-data').text
4. 完整案例:豆瓣图书爬虫
1def douban_spider(): 2 url = "https://book.douban.com/top250" 3 res = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}) 4 soup = BeautifulSoup(res.text, 'lxml') 5 6 books = [] 7 for item in soup.select('.item'): 8 title = item.select_one('.pl2 a')['title'] 9 rating = item.select_one('.rating_nums').text 10 books.append((title, float(rating))) 11 12 return pd.DataFrame(books, columns=['书名', '评分']) 13 14df = douban_spider() 15df.to_excel('豆瓣图书TOP250.xlsx') 16
5. 注意事项
- 遵守规则:
- 检查
robots.txt(如https://site.com/robots.txt) - 设置请求间隔:
time.sleep(random.uniform(1,3))
- 检查
- 异常处理:
1try: 2 response = requests.get(url, timeout=10) 3except (requests.ConnectionError, requests.Timeout) as e: 4 print(f"请求失败: {str(e)}")
- 数据清洗:
1# 去除空白字符 2clean_text = re.sub(r'\s+', ' ', raw_text).strip()
提示:对于复杂网站建议使用Scrapy框架,其内置的异步处理、管道机制和中间件能显著提升效率。
《Python爬虫入门:从零到数据采集》 是转载文章,点击查看原文。
