いるかのボックス: Pythonのmatplotlibでヒストグラムの見栄えを良くする

PythonのmatplotlibでDataFrameから簡単にヒストグラムを作成できる。

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

%matplotlib inline
from IPython.core.pylabtools import figsize
import matplotlib.pyplot as plt

figsize(11, 9)

# scikit-learnのirisデータセット読み込み
iris = load_iris()
# irisデータセットをPandasのDetaFrameに変換
df = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target'])

colx = df.columns[0]
# ヒストグラムを作成
plt.hist(df[colx], edgecolor='black', linewidth=1.0)
plt.grid(True)
plt.show()

※データはPythonのmatplotlibでラベル付き散布図を作成するで使用したのと同じscikit-learnのirisデータセットを使用

しかしグラフと目盛りの位置が合わないので見た目がイマイチ。

そこで、グラフと目盛りの位置をきちんと合わせて、おまけに度数を各グラフの上部に表示したヒストグラムを作成する。

環境

Bash on Ubuntu on Windows

$ cat /etc/issue
Ubuntu 16.04.3 LTS \n \l
$ python3 -V
Python 3.5.2
$ pip3 show scikit-learn pandas matplotlib
Name: scikit-learn
Version: 0.19.1
...
Name: pandas
Version: 0.22.0
...
Name: matplotlib
Version: 2.1.0
...

Jupyter Notebook

$ jupyter --version
4.4.0

データの集計

scikit-learnのirisデータセットのsepal length (cm)を使う。0.5ごとの度数をヒストグラムにする。まずは0.5ごとの度数を集計する。

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

# scikit-learnのirisデータセット読み込み
iris = load_iris()
# irisデータセットをPandasのDetaFrameに変換
df = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target'])

colx = df.columns[0]

# sepal length (cm)の最大値・最小値（整数）を求める
min = df[colx].min().astype(int)
max = df[colx].max().round(decimals=0)

# 0.5ごとの度数を集計
freq = df.groupby(pd.cut(df[colx], np.arange(min, max, 0.5)))[colx].size()
print(freq)

結果は以下の通り。

sepal length (cm)
(4.0, 4.5]     5
(4.5, 5.0]    27
(5.0, 5.5]    27
(5.5, 6.0]    30
(6.0, 6.5]    31
(6.5, 7.0]    18
(7.0, 7.5]     6
Name: sepal length (cm), dtype: int64

ヒストグラムの作成

集計した0.5ごとの度数からヒストグラムを作成する。histメソッドでなくbarメソッドを使う。

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

%matplotlib inline
from IPython.core.pylabtools import figsize
import matplotlib.pyplot as plt

figsize(11, 9)

# scikit-learnのirisデータセット読み込み
iris = load_iris()
# irisデータセットをPandasのDetaFrameに変換
df = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target'])

colx = df.columns[0]

# sepal length (cm)の最大値・最小値（整数）を求める
min = df[colx].min().astype(int)
max = df[colx].max().round(decimals=0)

# 0.5ごとの度数を集計
freq = df.groupby(pd.cut(df[colx], np.arange(min, max, 0.5)))[colx].size()

fig = plt.figure(figsize=(11, 9))
ax = fig.add_subplot(1,1,1)

# ヒストグラムを作成
# histメソッドでなくbarメソッドを使う
ax.bar([x for x in range(0, len(freq.index))], freq.values, width=1.0, alpha=0.8, edgecolor='black', linewidth=1.0)

# グラフ上に度数を表示
for i, ypos in enumerate(freq.values):
    ax.text(i, ypos+0.5, ypos, horizontalalignment='center', color='black', fontweight='bold')

# x軸目盛りラベルの設定
# なぜかひと目盛りずれるのでminとmaxをそれぞれ-0.5しておく
xticks = ['{:.1f} - {:.1f}'.format(i, i+0.5) for i in np.arange(min-0.5, max-0.5, 0.5)]
ax.set_xticklabels(xticks, rotation=45, size='small')

plt.grid(True)
plt.show()

グラフと目盛りの位置が合っているし、度数がグラフ上に表示された。

いるかのボックス

2018年4月28日土曜日

Pythonのmatplotlibでヒストグラムの見栄えを良くする

環境

データの集計

ヒストグラムの作成

0 件のコメント:

コメントを投稿