學習PYTHON PTT正妹版爬蟲習題：多頁爬取－bnn00023的部落格

這篇是我的學習心得，這次學習的文章為：http://blog.castman.net/%E6%95%99%E5%AD%B8/2016/12/19/python-data-science-tutorial-1.html

1.使用requests抓取網頁資料

def get_web_page(url):
    resp = requests.get(
        url=url,
        cookies={'over18': '1'}  # ptt18歲的認證
    )
    if resp.status_code != 200:  #回傳200代表正常
        print('Invalid url:', resp.url)
        return None
    else:
        return resp.text

回傳的是整個網頁的原始碼

2.使用BeatifulSoup(bf4)抓取網頁內的tag內容，用articles儲存title、href、push_count

        if d.find('a'):  # 有超連結，表示文章存在，未被刪除
            href = d.find('a')['href']
            title = d.find('a').string
            articles.append({
                'title': title,
                'href': href,
                'push_count': push_count
            })

使用last儲存上一頁的路徑

    for _ in soup.find_all('a', 'btn wide'):  #用來抓取上一頁
        if _.text == '‹ 上頁':
            articles.append({
                'last': _['href']
            })

下面的程式碼為內容

PTT_URL = 'https://www.ptt.cc'
page = get_web_page('https://www.ptt.cc/bbs/Beauty/index.html')
articles = []
for Num in range(1000):  #抓取的頁數
    if Num != 0:
        page = get_web_page(page)
    if page:
        current_articles = get_articles(page)
        for _ in current_articles:
            if 'last' in _:
                page = PTT_URL + _['last']
            else:
                articles.append(_)

完成了第一個爬蟲程式還是挺有成就感的，不過ptt的網頁結構算是最簡單的，現在試著爬取其他複雜一些網頁，聽說用javascript與ajax寫的程式為抓不到內容，還有其他反爬蟲的網頁，都是很有挑戰性的，另外抓取圖片可以使用多線程，我要學習的東西還是很多。