いるかのボックス: Raspberry Pi 3 Model B + CentOS7でスクレイピング環境を構築する

2017年1月5日木曜日

Raspberry Pi 3 Model B + CentOS7でスクレイピング環境を構築する

Raspberry Pi 3 Model BにCentOS7をインストールしたので、ここにPython3とBeautifulSoup4でスクレイピングできる環境を構築する。CentOS7にはデフォルトでPython2はインストールされているが、Python3はインストールされていない。まずはPython3をインストールする。

Python3インストール

yum-builddepをインストールするためにyum-utilsをインストール。

$ su -
# yum install yum-utils

yum-builddepコマンドでPython3に依存関係のあるパッケージをインストール。

# yum-builddep python

makeコマンドを使えるようにする（CentOS7インストール時点ではインストールされていない）。

# yum install make

Python3のソースファイルを適当なディレクトリにダウンロード。

# cd /usr/local/src
# curl -O https://www.python.org/ftp/python/3.5.2/Python-3.5.2.tgz

Python3のインストール。

# tar zxvf Python-3.5.2.tgz
# cd Python-3.5.2
# ./configure
# make
# make install

インストールの確認。

# python3 -V
Python 3.5.2

pipも同時にインストールされる。

# pip3 -V
pip 8.1.1 from /usr/local/lib/python3.5/site-packages (python 3.5)

BeautifulSoup4のインストール

続いてBeautifulSoup4をインストールする。

# pip3 install beautifulsoup4

スクレイピング

以上でスクレイピングできる環境が作成できたので、試しにスクレイピングしてみる。Raspberry PiとPythonでスクレイピングをするのタイトルと記事内のspanタグのテキスト一覧を取得するコードは以下の通り。

# ライブラリの読み込み
from urllib.request import urlopen
from bs4 import BeautifulSoup

# スクレイピングするページのurl
target = "http://irukanobox.blogspot.jp/2016/06/raspberry-pipython.html"

# 指定したurlからデータを読み込む
html_data = urlopen(target).read()
# パーサーに「html.parser」を指定してデータを解析
html_parsed = BeautifulSoup(html_data, "html.parser")

# classに「entry-title」が指定されているh3タグのテキストを表示
print("*** h3タグのテキスト ***")
print(html_parsed.find("h3", class_="entry-title").text.strip())
print()

# findAllを使うと、条件に一致するものすべてをlist型で取得できる
# ブログ記事内のspanタグのテキスト一覧を表示
print("*** spanタグのテキスト ***")
for span in html_parsed.find("div", class_="entry-content").find("div").findAll("span"):
    print(span.text)

実行すると以下のようにタイトルと記事内のspanタグのテキスト一覧が取得できる。

$ python3 test.py
*** h3タグのテキスト ***
Raspberry PiとPythonでスクレイピングをする

*** spanタグのテキスト ***
メールで送信
BlogThis!
Twitter で共有する
Facebook で共有する
Pinterest に共有

いるかのボックス

2017年1月5日木曜日

Raspberry Pi 3 Model B + CentOS7でスクレイピング環境を構築する

Python3インストール

BeautifulSoup4のインストール

スクレイピング

0 件のコメント:

コメントを投稿