いるかのボックス: Beautifulsoup4のtextとstringの違い

Python3でのスクレイピングにBeautifulsoup4を使用しているが、タグ要素内のテキストを取得するときにtextとstringが使える。例えば以下のようなとき。

from bs4 import BeautifulSoup
htmlData ="
Sample Text
"
htmlParsed = BeautifulSoup( htmlData, "html.parser" )
div = htmlParsed.find( "div" )
print( div.text)
print( div.string)

上のコードを実行すると、以下のようにtextでもstringでも同じ結果になる。

Sample Text
Sample Text

ただ、結果が異なるケースもあり、この2つの違いが気になって調べてみた。

textはPythonのunicode型で、stringについてはBeautifulsoup4のドキュメントに以下のような説明がある（日本語は僕が訳したもの）。

If a tag has only one child, and that child is a NavigableString, the child is made available as .string:

①タグがひとつしか子要素を持っておらず、その子要素がNavigableStringなら、その子要素は.stringとして利用できる。

NavigableStringというのは、Beautifulsoup4独自のクラスで、Pythonのunicodeのような型に、HTMLタグのツリー構造を扱うBeautifulsoup4の機能を追加したものらしい。

If a tag’s only child is another tag, and that tag has a .string, then the parent tag is considered to have the same .string as its child:

②タグがひとつしか子要素を持っておらず、その子要素が別のタグで、そのタグが.stringを持っているなら、親タグは子タグと同じ.stringを持つと見なされる。

If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None:

③タグが複数の子要素を持つ場合、.stringの参照先を特定できないので、Noneとして定義される。

stringで取得できるテキストは子要素の構成によって変わるようだ。ドキュメントにはコードつきで説明されているが、それでも、今ひとつピンとこない。

というわけで、実際にコードを書いて確認してみる。

from bs4 import BeautifulSoup

htmlData ="""

Sample Text

Sample Text

Sample Text


"""
htmlParsed = BeautifulSoup( htmlData, "html.parser" )

divs = htmlParsed.findAll( "div" )
for div in divs:
    print( "----------------------------------")
    # 要素全体の出力
    print( "original: %s" % div )
    # 子要素の出力
    print( "contents: %s" % div.contents)
    # textの出力
    print( "text: %s" % div.text )
    # stringの出力
    print( "string: %s" % div.string )

子要素はcontentsでリスト型として取得できる。結果は以下の通り。

----------------------------------
original: <div>
Sample Text</div>
contents: ['Sample Text']
text: Sample Text
string: Sample Text
----------------------------------
original: <div>
<span>Sample Text</span></div>
contents: [<span>Sample Text</span>]
text: Sample Text
string: Sample Text
----------------------------------
original: <div>
Sample <b>Text</b></div>
contents: ['Sample ', <b>Text</b>]
text: Sample Text
string: None
----------------------------------
original: <div>
</div>
contents: []
text:
string: None

4つのパターンを出力したが、上からそれぞれ①～③のケースに対応する。4つ目はおまけ。出力結果を見ると、textとstringの違いが分かる。特に③のパターンが分かりづらかったが、これですっきりした。結局のところ、テキストを取得するのが目的なら、textを使った方がよさそう。

いるかのボックス

2016年6月29日水曜日

Beautifulsoup4のtextとstringの違い

0 件のコメント:

コメントを投稿