[Kaggle] Porti Seguro’s Safe Driver Prediction
Porti Seguro’s Safe Driver Prediction
Data Description
Porti Segure’s Sage Driver Prediction
-
유사한 그룹에 속하는 특징들은 특징 이름에 해당하는 태그로 표시됩니다 (예 : ind, reg, car, calc).
-
이진 특징은 postfix “bin”으로 표시되며 범주형 특징은 “cat”으로 표시됩니다. 이러한 표기가 없는 특징은 연속형 또는 서수형입니다.
-
값 -1은 관찰에서 특징이 누락되었음을 나타냅니다.
-
타겟 열은 해당 보험 계약자가 청구를 제출했는지 여부를 나타냅니다.
Loading packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectFromModel
from sklearn.utils import shuffle
from sklearn.ensemble import RandomForestClassifier
import plotly.offline as py
import plotly.graph_objs as go
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
import os
import warnings
warnings.filterwarnings('ignore')
Loading data
data = pd.read_csv('train.csv')
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train.head()
id | target | ps_ind_01 | ps_ind_02_cat | ps_ind_03 | ps_ind_04_cat | ps_ind_05_cat | ps_ind_06_bin | ps_ind_07_bin | ps_ind_08_bin | ... | ps_calc_11 | ps_calc_12 | ps_calc_13 | ps_calc_14 | ps_calc_15_bin | ps_calc_16_bin | ps_calc_17_bin | ps_calc_18_bin | ps_calc_19_bin | ps_calc_20_bin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7 | 0 | 2 | 2 | 5 | 1 | 0 | 0 | 1 | 0 | ... | 9 | 1 | 5 | 8 | 0 | 1 | 1 | 0 | 0 | 1 |
1 | 9 | 0 | 1 | 1 | 7 | 0 | 0 | 0 | 0 | 1 | ... | 3 | 1 | 1 | 9 | 0 | 1 | 1 | 0 | 1 | 0 |
2 | 13 | 0 | 5 | 4 | 9 | 1 | 0 | 0 | 0 | 1 | ... | 4 | 2 | 7 | 7 | 0 | 1 | 1 | 0 | 1 | 0 |
3 | 16 | 0 | 0 | 1 | 2 | 0 | 0 | 1 | 0 | 0 | ... | 2 | 2 | 4 | 9 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 17 | 0 | 0 | 2 | 0 | 1 | 0 | 1 | 0 | 0 | ... | 3 | 1 | 1 | 3 | 0 | 0 | 0 | 1 | 1 | 0 |
5 rows × 59 columns
변수를 살펴보면
-
이진 변수
-
범주형 변수 중 카테고리 값이 정수인 변수
-
정수 또는 부동 소수점 값이 있는 다른 변수
-
결측값을 나타내는 -1이 있는 변수
-
대상 변수와 ID 변수
즉, 데이터셋에는 이진 변수와 정수형 값을 가진 범주형 변수, 그리고 숫자 값이 있는 다른 변수 등이 포함되어 있으며, 결측값을 -1로 나타내는 변수도 있습니다. 또한, 대상 변수와 식별자(ID) 변수도 포함되어 있다.
MetaData
변수 관리를 용이하게 하기 위해, 변수에 대한 메타 정보를 Data Frame에 저장하고 분석, 시각화, 모델링을 위해 특정 변수를 선택할 때 유용하게 사용할 수 있다.
-
role : 입력변수(input), 식별자 변수(id), 목표 변수(target)
-
level : 명목형(nominal), 간격형(interval), 서열형(ordinal), 이진형(binary)
-
keep : True, False
-
dtype : 정수(int), 부동 소수점(float), 문자열(str)
data = []
for f in train.columns:
if f == 'target':
role = 'target'
elif f == 'id':
role = 'id'
else:
role = 'input'
if 'bin' in f or f == 'target':
level = 'binary'
elif 'cat' in f or f == 'id':
level = 'nominal'
elif train[f].dtype == float:
level = 'real'
elif train[f].dtype == int:
level = 'integer'
keep = True
if f == 'id':
keep = False
dtype = train[f].dtype
f_dict = {
'varname': f,
'role': role,
'level': level,
'keep': keep,
'dtype': dtype
}
data.append(f_dict)
meta = pd.DataFrame(data, columns=['varname', 'role', 'level', 'keep', 'dtype'])
meta.set_index('varname', inplace=True)
pd.DataFrame({'count' : meta.groupby(['role', 'level'])['role'].size()}).reset_index()
role | level | count | |
---|---|---|---|
0 | id | nominal | 1 |
1 | input | binary | 17 |
2 | input | integer | 16 |
3 | input | nominal | 14 |
4 | input | real | 10 |
5 | target | binary | 1 |
Descriptive statistics
Real variables
v = meta[(meta.level == 'real') & meta.keep].index
train[v].describe()
ps_reg_01 | ps_reg_02 | ps_reg_03 | ps_car_12 | ps_car_13 | ps_car_14 | ps_car_15 | ps_calc_01 | ps_calc_02 | ps_calc_03 | |
---|---|---|---|---|---|---|---|---|---|---|
count | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 |
mean | 0.610991 | 0.439184 | 0.551102 | 0.379945 | 0.813265 | 0.276256 | 3.065899 | 0.449756 | 0.449589 | 0.449849 |
std | 0.287643 | 0.404264 | 0.793506 | 0.058327 | 0.224588 | 0.357154 | 0.731366 | 0.287198 | 0.286893 | 0.287153 |
min | 0.000000 | 0.000000 | -1.000000 | -1.000000 | 0.250619 | -1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.400000 | 0.200000 | 0.525000 | 0.316228 | 0.670867 | 0.333167 | 2.828427 | 0.200000 | 0.200000 | 0.200000 |
50% | 0.700000 | 0.300000 | 0.720677 | 0.374166 | 0.765811 | 0.368782 | 3.316625 | 0.500000 | 0.400000 | 0.500000 |
75% | 0.900000 | 0.600000 | 1.000000 | 0.400000 | 0.906190 | 0.396485 | 3.605551 | 0.700000 | 0.700000 | 0.700000 |
max | 0.900000 | 1.800000 | 4.037945 | 1.264911 | 3.720626 | 0.636396 | 3.741657 | 0.900000 | 0.900000 | 0.900000 |
reg variables
-
ps_reg_03
은 결측치를 가지고 있다 -
변수값의 범위에 차이가 있다(스케일링 가능).
car variables
-
ps_car_12
,ps_car_15
는 결측치를 가지고 있다. -
변수값의 범위에 차이가 있다(스케일링 가능).
calc variales
-
결측치는 없다.
-
최대값은 0.9로 보인다.
-
비슷한 분포를 가지고 있다.
Integer variables
v = meta[(meta.level == 'integer') & (meta.keep)].index
train[v].describe()
ps_ind_01 | ps_ind_03 | ps_ind_14 | ps_ind_15 | ps_car_11 | ps_calc_04 | ps_calc_05 | ps_calc_06 | ps_calc_07 | ps_calc_08 | ps_calc_09 | ps_calc_10 | ps_calc_11 | ps_calc_12 | ps_calc_13 | ps_calc_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 |
mean | 1.900378 | 4.423318 | 0.012451 | 7.299922 | 2.346072 | 2.372081 | 1.885886 | 7.689445 | 3.005823 | 9.225904 | 2.339034 | 8.433590 | 5.441382 | 1.441918 | 2.872288 | 7.539026 |
std | 1.983789 | 2.699902 | 0.127545 | 3.546042 | 0.832548 | 1.117219 | 1.134927 | 1.334312 | 1.414564 | 1.459672 | 1.246949 | 2.904597 | 2.332871 | 1.202963 | 1.694887 | 2.746652 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 2.000000 | 0.000000 | 5.000000 | 2.000000 | 2.000000 | 1.000000 | 7.000000 | 2.000000 | 8.000000 | 1.000000 | 6.000000 | 4.000000 | 1.000000 | 2.000000 | 6.000000 |
50% | 1.000000 | 4.000000 | 0.000000 | 7.000000 | 3.000000 | 2.000000 | 2.000000 | 8.000000 | 3.000000 | 9.000000 | 2.000000 | 8.000000 | 5.000000 | 1.000000 | 3.000000 | 7.000000 |
75% | 3.000000 | 6.000000 | 0.000000 | 10.000000 | 3.000000 | 3.000000 | 3.000000 | 9.000000 | 4.000000 | 10.000000 | 3.000000 | 10.000000 | 7.000000 | 2.000000 | 4.000000 | 9.000000 |
max | 7.000000 | 11.000000 | 4.000000 | 13.000000 | 3.000000 | 5.000000 | 6.000000 | 10.000000 | 9.000000 | 12.000000 | 7.000000 | 25.000000 | 19.000000 | 10.000000 | 13.000000 | 23.000000 |
-
ps_car_11
만 결측값을 가지고 있다. -
필요에 따라 스케일링할 수 있다.
Binary variables
v = meta[(meta.level == 'binary') & (meta.keep)].index
train[v].describe()
target | ps_ind_06_bin | ps_ind_07_bin | ps_ind_08_bin | ps_ind_09_bin | ps_ind_10_bin | ps_ind_11_bin | ps_ind_12_bin | ps_ind_13_bin | ps_ind_16_bin | ps_ind_17_bin | ps_ind_18_bin | ps_calc_15_bin | ps_calc_16_bin | ps_calc_17_bin | ps_calc_18_bin | ps_calc_19_bin | ps_calc_20_bin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 | 595212.000000 |
mean | 0.036448 | 0.393742 | 0.257033 | 0.163921 | 0.185304 | 0.000373 | 0.001692 | 0.009439 | 0.000948 | 0.660823 | 0.121081 | 0.153446 | 0.122427 | 0.627840 | 0.554182 | 0.287182 | 0.349024 | 0.153318 |
std | 0.187401 | 0.488579 | 0.436998 | 0.370205 | 0.388544 | 0.019309 | 0.041097 | 0.096693 | 0.030768 | 0.473430 | 0.326222 | 0.360417 | 0.327779 | 0.483381 | 0.497056 | 0.452447 | 0.476662 | 0.360295 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 |
max | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
plt.figure(figsize = (40, 30))
n = 0
for i in range(0, len(v)):
plt.subplot(3, 6, n+1)
sns.countplot(train[v[i]])
plt.xlabel(train[v].columns[i], fontsize = 30)
n += 1
plt.show()
-
이진변수들의 평균을 보면
target
은 0.0365로 0과 1이 불균형하게 분포된 것을 알 수 있다. -
다른 이진 변수들의 평균을 봐도 0이 많이 포함 된 것을 알 수 있다.
Exploratory Data Visualization
Categorical Variables
v = meta[(meta.level == 'nominal') & (meta.keep)].index
l = []
for f in v:
cat_perc = train[[f, 'target']].groupby([f],as_index=False).mean()
cat_perc.sort_values(by='target', ascending=False, inplace=True)
l.append(cat_perc)
plt.figure(figsize = (40, 30))
n = 0
for i in range(0, len(l)):
plt.subplot(4, 3, n+1)
sns.barplot(x = l[i].columns[0], y = 'target', data = l[i])
plt.ylabel('% target', fontsize=18)
plt.xlabel(l[i].columns[0], fontsize=18)
plt.tick_params(axis='both', which='major', labelsize=18)
n += 1
plt.show()
막대 그래프를 통해 결측값이 있는 변수들을 확인할 수 있다. 결측치 값을 따로 치환하지 않고 하나의 값으로 두는 것이 더 좋아 보인다.
(높은 target 값을 가지고 있기 때문에)
Real Variables
def corr_heatmap(v):
correlations = train[v].corr()
# Create color map ranging between two colors
cmap = sns.diverging_palette(220, 10, as_cmap=True)
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(correlations, cmap=cmap, vmax=1.0, center=0, fmt='.2f',
square=True, linewidths=.5, annot=True, cbar_kws={"shrink": .75})
plt.show();
v = meta[(meta.level == 'real') & (meta.keep)].index
corr_heatmap(v)
높은 상관관계의 변수들
-
ps_reg_02
andps_reg_03
(0.7) -
ps_car_12
andps_car13
(0.67) -
ps_car_12
andps_car14
(0.58) -
ps_car_13
andps_car15
(0.67)
빠른 분석을 위해 10%의 데이터 샘플을 가지고 높은 상관관계의 변수들 살펴보자
s = train.sample(frac=0.1)
sns.lmplot(x='ps_reg_02', y='ps_reg_03', data=s, hue='target', palette='Set1', scatter_kws={'alpha':0.3})
plt.title('ps_reg_02 and ps_reg_03')
plt.show()
sns.lmplot(x='ps_car_12', y='ps_car_13', data=s, hue='target', palette='Set1', scatter_kws={'alpha':0.3})
plt.title('ps_car_12 and ps_car_13')
plt.show()
sns.lmplot(x='ps_car_12', y='ps_car_14', data=s, hue='target', palette='Set1', scatter_kws={'alpha':0.3})
plt.title('ps_car_12 and ps_car_14')
plt.show()
sns.lmplot(x='ps_car_15', y='ps_car_13', data=s, hue='target', palette='Set1', scatter_kws={'alpha':0.3})
plt.title('ps_car_15 and ps_car_13')
plt.show()
상관관계가 높은 변수들은 주성분분석(PCA)을 수행할수도 있지만, 상관관계가 높은 변수들이 적기 때문에 모델이 직접 처리하는 변수선택법을 사용할 수 도 있다.
Integer Variables
v = meta[(meta.level == 'integer') & (meta.keep)].index
corr_heatmap(v)
상관관계가 높은 변수는 없지만 target value에 따라 그룹화하여 살펴볼 수 있다.
Feature Importance
from sklearn.ensemble import RandomForestClassifier
import plotly.graph_objs as go
rf = RandomForestClassifier(n_estimators=150, max_depth=8, min_samples_leaf=4, max_features=0.2, n_jobs=-1, random_state=0)
rf.fit(train.drop(['id', 'target'],axis=1), train.target)
features = train.drop(['id', 'target'],axis=1).columns.values
print("----- Training Done -----")
----- Training Done -----
x, y = (list(x) for x in zip(*sorted(zip(rf.feature_importances_, features),
reverse = False)))
trace2 = go.Bar(
x=x ,
y=y,
marker=dict(
color=x,
colorscale = 'Viridis',
reversescale = True
),
name='Random Forest Feature importance',
orientation='h',
)
layout = dict(
title='Barplot of Feature importances',
width = 900, height = 2000,
yaxis=dict(
showgrid=False,
showline=False,
showticklabels=True,
# domain=[0, 0.85],
))
fig1 = go.Figure(data=[trace2])
fig1['layout'].update(layout)
py.iplot(fig1, filename='plots')
Handling imbalanced classes
1의 값이 적은 것을 위에서 확인했으므로
-
target
값이 1인 레코드를 과대표집(oversampling) -
target
값이 0인 레코드를 과소표집(undersampling)
desired_apriori = 0.10
idx_0 = train[train.target == 0].index
idx_1 = train[train.target == 1].index
nb_0 = len(train.loc[idx_0])
nb_1 = len(train.loc[idx_1])
undersampling_rate = ((1-desired_apriori)*nb_1)/(nb_0*desired_apriori)
undersampled_nb_0 = int(undersampling_rate*nb_0)
print('Rate to undersample records with target=0: {}'.format(undersampling_rate))
print('Number of records with target=0 after undersampling: {}'.format(undersampled_nb_0))
undersampled_idx = shuffle(idx_0, random_state=37, n_samples=undersampled_nb_0)
idx_list = list(undersampled_idx) + list(idx_1)
train = train.loc[idx_list].reset_index(drop=True)
Rate to undersample records with target=0: 0.34043569687437886 Number of records with target=0 after undersampling: 195246
Data Quality Checks
Checking missing values
vars_with_missing = []
for f in train.columns:
missings = train[train[f] == -1][f].count()
if missings > 0:
vars_with_missing.append(f)
missings_perc = missings/train.shape[0]
print('Variable {} has {} records ({:.2%}) with missing values'.format(f, missings, missings_perc))
print('In total, there are {} variables with missing values'.format(len(vars_with_missing)))
Variable ps_ind_02_cat has 216 records (0.04%) with missing values Variable ps_ind_04_cat has 83 records (0.01%) with missing values Variable ps_ind_05_cat has 5809 records (0.98%) with missing values Variable ps_reg_03 has 107772 records (18.11%) with missing values Variable ps_car_01_cat has 107 records (0.02%) with missing values Variable ps_car_02_cat has 5 records (0.00%) with missing values Variable ps_car_03_cat has 411231 records (69.09%) with missing values Variable ps_car_05_cat has 266551 records (44.78%) with missing values Variable ps_car_07_cat has 11489 records (1.93%) with missing values Variable ps_car_09_cat has 569 records (0.10%) with missing values Variable ps_car_11 has 5 records (0.00%) with missing values Variable ps_car_12 has 1 records (0.00%) with missing values Variable ps_car_14 has 42620 records (7.16%) with missing values In total, there are 13 variables with missing values
-
ps_car_03_cat and ps_car_05_cat
은 결측치가 많다. -> 제거 -
ps_reg_03
은 18%가 결측치로 이루어져 있다. -> 평균값으로 대체 -
ps_car_11
은 1개의 결측치를 가지고 있다. -> 최빈값으로 대체 -
ps_car_14
는 7%의 결측값을 가지고 있다. -> 평균값으로 대체 -
…
vars_to_drop = ['ps_car_03_cat', 'ps_car_05_cat']
train.drop(vars_to_drop, inplace = True, axis = 1)
test.drop(vars_to_drop, inplace = True, axis = 1)
meta.loc[(vars_to_drop), 'keep'] = False
mean_imp = SimpleImputer(missing_values = -1, strategy = 'mean')
mode_imp = SimpleImputer(missing_values = -1, strategy = 'most_frequent')
train['ps_reg_03'] = mean_imp.fit_transform(train[['ps_reg_03']]).ravel()
train['ps_car_12'] = mean_imp.fit_transform(train[['ps_car_12']]).ravel()
train['ps_car_14'] = mean_imp.fit_transform(train[['ps_car_14']]).ravel()
train['ps_car_11'] = mode_imp.fit_transform(train[['ps_car_11']]).ravel()
# 노이즈 생성
def add_noise(series, noise_level):
return series * (1 + noise_level * np.random.randn(len(series)))
def target_encode(trn_series = None,
tst_series = None,
target = None,
min_samples_leaf = 1,
smoothing = 1,
noise_level = 0):
"""
Smoothing is computed like in the following paper by Daniele Micci-Barreca
https://kaggle2.blob.core.windows.net/forum-message-attachments/225952/7441/high%20cardinality%20categoricals.pdf
trn_series : training categorical feature as a pd.Series
tst_series : test categorical feature as a pd.Series
target : target data as a pd.Series
min_samples_leaf (int) : minimum samples to take category average into account
smoothing (int) : smoothing effect to balance categorical average vs prior
"""
# `train`과 `target`의 길이와 이름이 같음을 확인
assert len(trn_series) == len(target)
assert trn_series.name == tst_series.name
# `train`과 `target` concat
temp = pd.concat([trn_series, target], axis=1)
# group_by를 통해 value별로 평균과 갯수 계산
averages = temp.groupby(by = trn_series.name)[target.name].agg(["mean", "count"])
# smoothing 정의
smoothing = 1 / (1 + np.exp(-(averages["count"] - min_samples_leaf) / smoothing))
# prior을 `target`의 평균값으로 설정
prior = target.mean()
# value별 평균에 smoothing 진행 후 평균과 갯수 제거
averages[target.name] = prior * (1 - smoothing) + averages["mean"] * smoothing
averages.drop(["mean", "count"], axis = 1, inplace = True)
# 평균값으로 새로운 series 정의
ft_trn_series = pd.merge( trn_series.to_frame(trn_series.name),
averages.reset_index().rename(columns = {'index': target.name, target.name: 'average'}),
on = trn_series.name,
how = 'left')['average'].rename(trn_series.name + '_mean').fillna(prior)
# test seriese 도 정의
ft_trn_series.index = trn_series.index
ft_tst_series = pd.merge( tst_series.to_frame(tst_series.name),
averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
on=tst_series.name,
how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
# pd.merge does not keep the index so restore it
ft_tst_series.index = tst_series.index
return add_noise(ft_trn_series, noise_level), add_noise(ft_tst_series, noise_level)
train_encoded, test_encoded = target_encode(train["ps_car_11_cat"],
test["ps_car_11_cat"],
target=train.target,
min_samples_leaf=100,
smoothing=10,
noise_level=0.01)
train['ps_car_11_cat_te'] = train_encoded
train.drop('ps_car_11_cat', axis=1, inplace=True)
meta.loc['ps_car_11_cat','keep'] = False # Updating the meta
test['ps_car_11_cat_te'] = test_encoded
test.drop('ps_car_11_cat', axis=1, inplace=True)
Checking the cardinality of the categorical variables
Cardinality는 변수 내에 다른 값의 개수를 나타낸다.
이후 범주형 변수에서 더미 변수(dummy variable)를 생성할 예정이기 때문에, 많은 고유한 값을 가진 변수가 있는지 확인해야 하고 이러한 변수는 많은 더미 변수를 생성하게 되므로 다른 방식으로 처리해야 한다.
v = meta[(meta.level == 'nominal') & (meta.keep)].index
for f in v:
dist_values = train[f].value_counts().shape[0]
print('Variable {} has {} distinct values'.format(f, dist_values))
Variable ps_ind_02_cat has 5 distinct values Variable ps_ind_04_cat has 3 distinct values Variable ps_ind_05_cat has 8 distinct values Variable ps_car_01_cat has 13 distinct values Variable ps_car_02_cat has 3 distinct values Variable ps_car_04_cat has 10 distinct values Variable ps_car_06_cat has 18 distinct values Variable ps_car_07_cat has 3 distinct values Variable ps_car_08_cat has 2 distinct values Variable ps_car_09_cat has 6 distinct values Variable ps_car_10_cat has 3 distinct values
Feature Engineering
Dummy variables
카테고리형 변수는 순서를 나타내는 변수가 아니기 때문에 더미변수를 생성해서 처리할 수 있다.
v = meta[(meta.level == 'nominal') & (meta.keep)].index
print('Before dummification we have {} variables in train'.format(train.shape[1]))
train = pd.get_dummies(train, columns = v, drop_first = True)
print('After dummification we have {} variables in train'.format(train.shape[1]))
test = pd.get_dummies(test, columns = v, drop_first = True)
print('After dummification we have {} variables in test'.format(test.shape[1]))
Before dummification we have 57 variables in train After dummification we have 109 variables in train After dummification we have 108 variables in test
Derived Variables
PolynomialFeatures
를 통해 파생변수를 생성할 수 있다.
아마도 위에서 heatmap으로 확인한 상관관계가 높은 변수들(real
변수들)에 대한 파생변수를 생성한 것 같다!
v = meta[(meta.level == 'real') & (meta.keep)].index
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
interactions_train = pd.DataFrame(data=poly.fit_transform(train[v]), columns=poly.get_feature_names(v))
interactions_test = pd.DataFrame(data=poly.fit_transform(test[v]), columns=poly.get_feature_names(v))
interactions_train.drop(v, axis=1, inplace=True) # Remove the original columns
interactions_test.drop(v, axis=1, inplace=True) # Remove the original columns
# Concat the interaction variables to the train data
print('Before creating interactions we have {} variables in train'.format(train.shape[1]))
train = pd.concat([train, interactions], axis=1)
print('After creating interactions we have {} variables in train'.format(train.shape[1]))
test = pd.concat([test, interactions], axis=1)
print('After creating interactions we have {} variables in test'.format(test.shape[1]))
Before creating interactions we have 109 variables in train After creating interactions we have 164 variables in train After creating interactions we have 163 variables in test
중요한 변수만 추출
sfm = SelectFromModel(rf, threshold='median', prefit=True)
print('Number of features before selection: {}'.format(X_train.shape[1]))
n_features = sfm.transform(X_train).shape[1]
print('Number of features after selection: {}'.format(n_features))
selected_vars = list(feat_labels[sfm.get_support()])
Number of features before selection: 162 Number of features after selection: 81
train = train[selected_vars + ['target']]
Prepare the Data for Model
Drop calc
columns
리더보드 높은 점수를 기록한 분석들의 솔루션을 보면 calc
변수를 drop 했기 때문에 제거해보자.
drop_col_train = train.columns[train.columns.str.startswith('ps_calc')]
drop_col_test = test.columns[test.columns.str.startswith('ps_calc')]
train.drop(drop_col_train, axis = 1, inplace = True)
test.drop(drop_col_test, axis = 1, inplace = True)
train_X = train.drop(['target', 'id'], axis = 1)
train_y = train['target']
test_X = test.drop(['id'], axis = 1)
Prepare the model
Ensamble with cross validation
class Ensemble(object):
def __init__(self, n_splits, stacker, base_models):
self.n_splits = n_splits
self.stacker = stacker
self.base_models = base_models
def fit_predict(self, X, y, T):
X = np.array(X)
y = np.array(y)
T = np.array(T)
folds = list(StratifiedKFold(n_splits=self.n_splits, shuffle=True, random_state=314).split(X, y))
S_train = np.zeros((X.shape[0], len(self.base_models)))
S_test = np.zeros((T.shape[0], len(self.base_models)))
for i, clf in enumerate(self.base_models):
S_test_i = np.zeros((T.shape[0], self.n_splits))
for j, (train_idx, test_idx) in enumerate(folds):
X_train = X[train_idx]
y_train = y[train_idx]
X_holdout = X[test_idx]
print ("Base model %d: fit %s model | fold %d" % (i+1, str(clf).split('(')[0], j+1))
clf.fit(X_train, y_train)
cross_score = cross_val_score(clf, X_train, y_train, cv=3, scoring='roc_auc')
print("cross_score [roc-auc]: %.5f [gini]: %.5f" % (cross_score.mean(), 2*cross_score.mean()-1))
y_pred = clf.predict_proba(X_holdout)[:,1]
S_train[test_idx, i] = y_pred
S_test_i[:, j] = clf.predict_proba(T)[:,1]
S_test[:, i] = S_test_i.mean(axis=1)
results = cross_val_score(self.stacker, S_train, y, cv=3, scoring='roc_auc')
# Calculate gini factor as 2 * AUC - 1
print("Stacker score [gini]: %.5f" % (2 * results.mean() - 1))
self.stacker.fit(S_train, y)
result = self.stacker.predict_proba(S_test)[:,1]
return result
# LightGBM params
# lgb_1
lgb_params1 = {}
lgb_params1['learning_rate'] = 0.02
lgb_params1['n_estimators'] = 650
lgb_params1['max_bin'] = 10
lgb_params1['subsample'] = 0.8
lgb_params1['subsample_freq'] = 10
lgb_params1['colsample_bytree'] = 0.8
lgb_params1['min_child_samples'] = 500
lgb_params1['seed'] = 314
# lgb2
lgb_params2 = {}
lgb_params2['n_estimators'] = 1090
lgb_params2['learning_rate'] = 0.02
lgb_params2['colsample_bytree'] = 0.3
lgb_params2['subsample'] = 0.7
lgb_params2['subsample_freq'] = 2
lgb_params2['num_leaves'] = 16
lgb_params2['seed'] = 314
# lgb3
lgb_params3 = {}
lgb_params3['n_estimators'] = 1100
lgb_params3['max_depth'] = 4
lgb_params3['learning_rate'] = 0.02
lgb_params3['seed'] = 314
# XGBoost params
xgb_params = {}
xgb_params['objective'] = 'binary:logistic'
xgb_params['learning_rate'] = 0.04
xgb_params['n_estimators'] = 490
xgb_params['max_depth'] = 4
xgb_params['subsample'] = 0.9
xgb_params['colsample_bytree'] = 0.9
xgb_params['min_child_weight'] = 10
# Base models
lgb_model1 = LGBMClassifier(**lgb_params1)
lgb_model2 = LGBMClassifier(**lgb_params2)
lgb_model3 = LGBMClassifier(**lgb_params3)
xgb_model = XGBClassifier(**xgb_params)
# Stacking model
log_model = LogisticRegression()
stack = Ensemble(n_splits=5,
stacker = log_model,
base_models = (lgb_model1, lgb_model2, lgb_model3, xgb_model))
y_prediction = stack.fit_predict(train_X, train_y, test_X)
Base model 1: fit LGBMClassifier model | fold 1 cross_score [roc-auc]: 0.63983 [gini]: 0.27966 Base model 1: fit LGBMClassifier model | fold 2 cross_score [roc-auc]: 0.63881 [gini]: 0.27762 Base model 1: fit LGBMClassifier model | fold 3 cross_score [roc-auc]: 0.63903 [gini]: 0.27805 Base model 1: fit LGBMClassifier model | fold 4 cross_score [roc-auc]: 0.63853 [gini]: 0.27706 Base model 1: fit LGBMClassifier model | fold 5 cross_score [roc-auc]: 0.63984 [gini]: 0.27969 Base model 2: fit LGBMClassifier model | fold 1 cross_score [roc-auc]: 0.63799 [gini]: 0.27598 Base model 2: fit LGBMClassifier model | fold 2 cross_score [roc-auc]: 0.63899 [gini]: 0.27799 Base model 2: fit LGBMClassifier model | fold 3 cross_score [roc-auc]: 0.63784 [gini]: 0.27567 Base model 2: fit LGBMClassifier model | fold 4 cross_score [roc-auc]: 0.63734 [gini]: 0.27469 Base model 2: fit LGBMClassifier model | fold 5 cross_score [roc-auc]: 0.63822 [gini]: 0.27644 Base model 3: fit LGBMClassifier model | fold 1 cross_score [roc-auc]: 0.63656 [gini]: 0.27313 Base model 3: fit LGBMClassifier model | fold 2 cross_score [roc-auc]: 0.63683 [gini]: 0.27367 Base model 3: fit LGBMClassifier model | fold 3 cross_score [roc-auc]: 0.63650 [gini]: 0.27300 Base model 3: fit LGBMClassifier model | fold 4 cross_score [roc-auc]: 0.63494 [gini]: 0.26988 Base model 3: fit LGBMClassifier model | fold 5 cross_score [roc-auc]: 0.63688 [gini]: 0.27375 Base model 4: fit XGBClassifier model | fold 1 cross_score [roc-auc]: 0.63875 [gini]: 0.27750 Base model 4: fit XGBClassifier model | fold 2 cross_score [roc-auc]: 0.63857 [gini]: 0.27713 Base model 4: fit XGBClassifier model | fold 3 cross_score [roc-auc]: 0.63861 [gini]: 0.27722 Base model 4: fit XGBClassifier model | fold 4 cross_score [roc-auc]: 0.63741 [gini]: 0.27481 Base model 4: fit XGBClassifier model | fold 5 cross_score [roc-auc]: 0.63914 [gini]: 0.27828 Stacker score [gini]: 0.28532
댓글남기기