Python爬虫入门：从零到数据采集

Python爬虫基础指南

Python爬虫是自动化获取网络数据的技术，广泛应用于数据采集、市场分析等领域。以下是核心实现步骤：

1. 核心库选择

1import requests  # 发送HTTP请求
2from bs4 import BeautifulSoup  # HTML解析
3import pandas as pd  # 数据存储
4

2. 基础爬取流程

1# 发送请求
2response = requests.get("https://example.com/books")
3response.encoding = 'utf-8'  # 设置编码
4
5# 解析HTML
6soup = BeautifulSoup(response.text, 'html.parser')
7
8# 数据提取示例
9book_titles = [h2.text for h2 in soup.select('.book-title')]
10book_prices = [float(div.text.strip('¥')) 
11               for div in soup.select('.price')]
12
13# 存储数据
14df = pd.DataFrame({'书名': book_titles, '价格': book_prices})
15df.to_csv('book_data.csv', index=False)
16

3. 关键技巧

反爬应对：

1headers = {  
2    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',  
3    'Cookie': 'sessionid=abc123'  
4}  
5response = requests.get(url, headers=headers)

动态页面处理（使用Selenium）：

1from selenium import webdriver  
2driver = webdriver.Chrome()  
3driver.get(url)  
4dynamic_content = driver.find_element_by_class('js-loaded-data').text

4. 完整案例：豆瓣图书爬虫

1def douban_spider():
2    url = "https://book.douban.com/top250"
3    res = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
4    soup = BeautifulSoup(res.text, 'lxml')
5    
6    books = []
7    for item in soup.select('.item'):
8        title = item.select_one('.pl2 a')['title']
9        rating = item.select_one('.rating_nums').text
10        books.append((title, float(rating)))
11    
12    return pd.DataFrame(books, columns=['书名', '评分'])
13
14df = douban_spider()
15df.to_excel('豆瓣图书TOP250.xlsx')
16

5. 注意事项

遵守规则：
- 检查robots.txt（如https://site.com/robots.txt）
- 设置请求间隔：time.sleep(random.uniform(1,3))
异常处理：

1try:  
2    response = requests.get(url, timeout=10)  
3except (requests.ConnectionError, requests.Timeout) as e:  
4    print(f"请求失败: {str(e)}")

数据清洗：

1# 去除空白字符  
2clean_text = re.sub(r'\s+', ' ', raw_text).strip()

提示：对于复杂网站建议使用Scrapy框架，其内置的异步处理、管道机制和中间件能显著提升效率。

《Python爬虫入门：从零到数据采集》是转载文章，点击查看原文。