BeautifulSoup4のチートシート(セレクターなど）

8年前

BeautiflSoup4でスクレイピングして要素を抽出するときに、よく使うセレクタをチートシート的にまとめておく。

BeautifuSoup4の使い方

スクレイピングする時にBeautifulSoup4を使うことは多いと思い。よく使うAPIやセレクターの記述方法をまとめます。
ちなみに、よく忘れてしまって「どうするんだっけ？」となるんですが、BeautifulSoup4ではxpathを使ったセレクタは存在しない。urlをわたしてHTTPリクエストを投げてくれるような機能はない。

インストール

beautifulsoup4 もしくは別名の bs4でpipからインストールする。
※ pip install BeautifulSoupとすると古いBeautifulSoup3になるので注意。

$ pip install beautifulsoup4
or 
$ pip install bs4

BeautifuSoupオブジェクト生成のチートシート

説明	コード例
soupオブジェクト生成（HTML文字列）	`BeautifulSoup('html文字列', 'html.parser')`
soupオブジェクト生成（ファイル）	`BeautifulSoup(file_handle, 'html.parser')`

第1引数はHTML文字列かファイルハンドル。
第2引数はParserライブラリを指定する。よく使用するのは html.parser か lxml。

html.parser はPython標準ライブラリなのでインストール不要ですぐ使えるが、速度は遅い。
lxml は高速だが別途インストールの必要なC依存ライブラリ。実行環境に依存する。

from bs4 import BeautifulSoup
# HTML文字列コンテンツ引数に生成
soup = BeautifulSoup('<html><body>hoge</body></html>', 'html.parser')

# ファイルハンドルを引数に生成
with open('index.html') as html_file:
    soup = BeautifulSoup(html_file, 'html.parser')

# URLからHTTPリクエストを投げて取得するようなAPIは無い
# requestsなどの別モジュールを使って取得する
import requests
res = requests.get('http://b.hatena.ne.jp/hotentry')
soup = BeautifulSoup(res.content, 'html.parser')

BeautifulSoup APIチートシート

説明	コード例
子要素	`soup.head`
タグ全検索	`soup.find_all('li')`
1件検索	`soup.find('li')`
属性検索	`soup.find('li', href="html://www.google.com/")`
class検索	`soup.find('a', class_="first")`
属性取得	`first_link_element['href']`
テキスト要素	`first_link_element.string`
親要素	`first_link_element.parent`

from bs4 import BeautifulSoup

soup = BeautifulSoup('''
<html>
    <head><title>example</title></head>
    <body>
        <ul>
            <li><a href="http://example.com/" class="first">example.com</a></li>
            <li><a href="http://www.google.com/" data="123">google.com</a></li>
        </ul>
    </body>
</html>''', 'html.parser')

# 子要素をたぐる
print(soup.head)        # <head><title>example</title></head>
# headの下のtitle
print(soup.head.title)  # <title>example</title>

# find_allは全検索してリストで返す
li_elements = soup.find_all("li")
print(li_elements)  # [<li><a class="first" href="http://example.com/">example.com</a></li>,
                    #  <li><a href="http://www.google.com/">google.com</a></li>]
# 見つからない場合は空のリスト
nothing_images = soup.find_all('img')
print(nothing_images)   # []

# findは検索して最初に見つかった1つ目
first_li_element = soup.find('li')
print(first_li_element) # <li><a class="first" href="http://example.com/">example.com</a></li>

# 見つからない場合はNone
nothing_image_element = soup.find('img')
print(nothing_image_element)    # None

#属性も検索条件につける
anchors_to_google = soup.find_all("a", href="http://www.google.com/")
print(anchors_to_google)    # [<a href="http://www.google.com/">google.com</a>]

# classは予約後なのでアンダースコア付加
first_link_element = soup.find("a", class_="first")
print(first_link_element)   # <a class="first" href="http://example.com/">example.com</a>

# 属性取得
print(first_link_element['href'])   # http://example.com/
# テキスト要素
print(first_link_element.string)    # example.com

# 親要素
parent = first_link_element.parent
print(parent)   # <li><a class="first" href="http://example.com/">example.com</a></li>

BeautifulSoup cssセレクタチートシート

説明	コード例
タグ検索	`soup.select('li')`
1件検索	`soup.select_one('li')`
属性検索	`soup.select('a[href="http://www.google.com"]')`
属性存在	`soup.select('a[data])`
class検索	`soup.select('a.first')`

# タグ全検索してリストで返す
li_elements = soup.select("li")
print(li_elements)  # [<li><a class="first" href="http://example.com/">example.com</a></li>,
                    #  <li><a href="http://www.google.com/">google.com</a></li>]

# select_oneは検索して最初に見つかった1つ目
first_li_element = soup.select_one('li')
print(first_li_element) # <li><a class="first" href="http://example.com/">example.com</a></li>
# 見つからない場合はNone
nothing_image_element = soup.select_one('img')
print(nothing_image_element)    # None

#属性検索
anchors_to_google = soup.select('a[href="http://www.google.com/"]')
print(anchors_to_google)    # [<a href="http://www.google.com/">google.com</a>]
#data属性存在
print(soup.select('a[data]'))  # [<a data="123" href="http://www.google.com/">google.com</a>]

# class指定
first_link_element = soup.select_one("a.first")
print(first_link_element)   # <a class="first" href="http://example.com/">example.com</a>

BeautifuSoup4の使い方

インストール

BeautifuSoupオブジェクト生成のチートシート

BeautifulSoup APIチートシート

BeautifulSoup cssセレクタ チートシート

BeautifulSoup cssセレクタチートシート