[Kaggle] Porti Seguro’s Safe Driver Prediction

15 분 소요

Porti Seguro’s Safe Driver Prediction

Data Description

유사한 그룹에 속하는 특징들은 특징 이름에 해당하는 태그로 표시됩니다 (예 : ind, reg, car, calc).
이진 특징은 postfix “bin”으로 표시되며 범주형 특징은 “cat”으로 표시됩니다. 이러한 표기가 없는 특징은 연속형 또는 서수형입니다.
값 -1은 관찰에서 특징이 누락되었음을 나타냅니다.
타겟 열은 해당 보험 계약자가 청구를 제출했는지 여부를 나타냅니다.

Loading packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectFromModel
from sklearn.utils import shuffle
from sklearn.ensemble import RandomForestClassifier
import plotly.offline as py
import plotly.graph_objs as go
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
import os

import warnings
warnings.filterwarnings('ignore')

Loading data

data = pd.read_csv('train.csv')
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

train.head()

	id	ps_ind_01	ps_ind_02_cat	ps_ind_03	ps_ind_04_cat	ps_ind_06_bin	ps_ind_07_bin	ps_ind_08_bin	...	ps_calc_11	ps_calc_12	ps_calc_13	ps_calc_14	ps_calc_16_bin	ps_calc_17_bin	ps_calc_18_bin	ps_calc_19_bin	ps_calc_20_bin
0	7	2	2	5	1	0	1	0	...	9	1	5	8	1	1	0	0	1
1	9	1	1	7	0	0	0	1	...	3	1	1	9	1	1	0	1	0
2	13	5	4	9	1	0	0	1	...	4	2	7	7	1	1	0	1	0
3	16	0	1	2	0	1	0	0	...	2	2	4	9	0	0	0	0	0
4	17	0	2	0	1	1	0	0	...	3	1	1	3	0	0	1	1	0

5 rows × 59 columns

변수를 살펴보면

이진 변수
범주형 변수 중 카테고리 값이 정수인 변수
정수 또는 부동 소수점 값이 있는 다른 변수
결측값을 나타내는 -1이 있는 변수
대상 변수와 ID 변수

즉, 데이터셋에는 이진 변수와 정수형 값을 가진 범주형 변수, 그리고 숫자 값이 있는 다른 변수 등이 포함되어 있으며, 결측값을 -1로 나타내는 변수도 있습니다. 또한, 대상 변수와 식별자(ID) 변수도 포함되어 있다.

MetaData

변수 관리를 용이하게 하기 위해, 변수에 대한 메타 정보를 Data Frame에 저장하고 분석, 시각화, 모델링을 위해 특정 변수를 선택할 때 유용하게 사용할 수 있다.

role : 입력변수(input), 식별자 변수(id), 목표 변수(target)
level : 명목형(nominal), 간격형(interval), 서열형(ordinal), 이진형(binary)
keep : True, False
dtype : 정수(int), 부동 소수점(float), 문자열(str)

data = []

for f in train.columns:
    
    if f == 'target':
        role = 'target'
    elif f == 'id':
        role = 'id'
    else:
        role = 'input'
         
    if 'bin' in f or f == 'target':
        level = 'binary'
    elif 'cat' in f or f == 'id':
        level = 'nominal'
    elif train[f].dtype == float:
        level = 'real'
    elif train[f].dtype == int:
        level = 'integer'
        
    keep = True
    if f == 'id':
        keep = False
    
    dtype = train[f].dtype
    
    f_dict = {
        'varname': f,
        'role': role,
        'level': level,
        'keep': keep,
        'dtype': dtype
    }
    data.append(f_dict)
    
meta = pd.DataFrame(data, columns=['varname', 'role', 'level', 'keep', 'dtype'])

meta.set_index('varname', inplace=True)

pd.DataFrame({'count' : meta.groupby(['role', 'level'])['role'].size()}).reset_index()

	role	level	count
0	id	nominal	1
1	input	binary	17
2	input	integer	16
3	input	nominal	14
4	input	real	10
5	target	binary	1

Descriptive statistics

Real variables

v = meta[(meta.level == 'real') & meta.keep].index

train[v].describe()

	ps_reg_01	ps_reg_02	ps_reg_03	ps_car_12	ps_car_13	ps_car_14	ps_car_15	ps_calc_01	ps_calc_02	ps_calc_03
count	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000
mean	0.610991	0.439184	0.551102	0.379945	0.813265	0.276256	3.065899	0.449756	0.449589	0.449849
std	0.287643	0.404264	0.793506	0.058327	0.224588	0.357154	0.731366	0.287198	0.286893	0.287153
min	0.000000	0.000000	-1.000000	-1.000000	0.250619	-1.000000	0.000000	0.000000	0.000000	0.000000
25%	0.400000	0.200000	0.525000	0.316228	0.670867	0.333167	2.828427	0.200000	0.200000	0.200000
50%	0.700000	0.300000	0.720677	0.374166	0.765811	0.368782	3.316625	0.500000	0.400000	0.500000
75%	0.900000	0.600000	1.000000	0.400000	0.906190	0.396485	3.605551	0.700000	0.700000	0.700000
max	0.900000	1.800000	4.037945	1.264911	3.720626	0.636396	3.741657	0.900000	0.900000	0.900000

reg variables

ps_reg_03은 결측치를 가지고 있다
변수값의 범위에 차이가 있다(스케일링 가능).

car variables

ps_car_12, ps_car_15는 결측치를 가지고 있다.
변수값의 범위에 차이가 있다(스케일링 가능).

calc variales

결측치는 없다.
최대값은 0.9로 보인다.
비슷한 분포를 가지고 있다.

Integer variables

v = meta[(meta.level == 'integer') & (meta.keep)].index

train[v].describe()

	ps_ind_01	ps_ind_03	ps_ind_14	ps_ind_15	ps_car_11	ps_calc_04	ps_calc_05	ps_calc_06	ps_calc_07	ps_calc_08	ps_calc_09	ps_calc_10	ps_calc_11	ps_calc_12	ps_calc_13	ps_calc_14
count	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000
mean	1.900378	4.423318	0.012451	7.299922	2.346072	2.372081	1.885886	7.689445	3.005823	9.225904	2.339034	8.433590	5.441382	1.441918	2.872288	7.539026
std	1.983789	2.699902	0.127545	3.546042	0.832548	1.117219	1.134927	1.334312	1.414564	1.459672	1.246949	2.904597	2.332871	1.202963	1.694887	2.746652
min	0.000000	0.000000	0.000000	0.000000	-1.000000	0.000000	0.000000	0.000000	0.000000	2.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.000000	2.000000	0.000000	5.000000	2.000000	2.000000	1.000000	7.000000	2.000000	8.000000	1.000000	6.000000	4.000000	1.000000	2.000000	6.000000
50%	1.000000	4.000000	0.000000	7.000000	3.000000	2.000000	2.000000	8.000000	3.000000	9.000000	2.000000	8.000000	5.000000	1.000000	3.000000	7.000000
75%	3.000000	6.000000	0.000000	10.000000	3.000000	3.000000	3.000000	9.000000	4.000000	10.000000	3.000000	10.000000	7.000000	2.000000	4.000000	9.000000
max	7.000000	11.000000	4.000000	13.000000	3.000000	5.000000	6.000000	10.000000	9.000000	12.000000	7.000000	25.000000	19.000000	10.000000	13.000000	23.000000

ps_car_11만 결측값을 가지고 있다.
필요에 따라 스케일링할 수 있다.

Binary variables

v = meta[(meta.level == 'binary') & (meta.keep)].index

train[v].describe()

	target	ps_ind_06_bin	ps_ind_07_bin	ps_ind_08_bin	ps_ind_09_bin	ps_ind_10_bin	ps_ind_11_bin	ps_ind_12_bin	ps_ind_13_bin	ps_ind_16_bin	ps_ind_17_bin	ps_ind_18_bin	ps_calc_15_bin	ps_calc_16_bin	ps_calc_17_bin	ps_calc_18_bin	ps_calc_19_bin	ps_calc_20_bin
count	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000	595212.000000
mean	0.036448	0.393742	0.257033	0.163921	0.185304	0.000373	0.001692	0.009439	0.000948	0.660823	0.121081	0.153446	0.122427	0.627840	0.554182	0.287182	0.349024	0.153318
std	0.187401	0.488579	0.436998	0.370205	0.388544	0.019309	0.041097	0.096693	0.030768	0.473430	0.326222	0.360417	0.327779	0.483381	0.497056	0.452447	0.476662	0.360295
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	1.000000	1.000000	0.000000	0.000000	0.000000
75%	0.000000	1.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	1.000000	1.000000	1.000000	1.000000	0.000000
max	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000

plt.figure(figsize = (40, 30))

n = 0

for i in range(0, len(v)):
    plt.subplot(3, 6, n+1)
    sns.countplot(train[v[i]])
    plt.xlabel(train[v].columns[i], fontsize = 30)
    
    n += 1
    
plt.show()

이진변수들의 평균을 보면 target은 0.0365로 0과 1이 불균형하게 분포된 것을 알 수 있다.
다른 이진 변수들의 평균을 봐도 0이 많이 포함 된 것을 알 수 있다.

Exploratory Data Visualization

Categorical Variables

v = meta[(meta.level == 'nominal') & (meta.keep)].index

l = []

for f in v:
    cat_perc = train[[f, 'target']].groupby([f],as_index=False).mean()
    cat_perc.sort_values(by='target', ascending=False, inplace=True)
    
    l.append(cat_perc)
    
plt.figure(figsize = (40, 30))

n = 0

for i in range(0, len(l)):
    plt.subplot(4, 3, n+1)
    sns.barplot(x = l[i].columns[0], y = 'target', data = l[i])
    plt.ylabel('% target', fontsize=18)
    plt.xlabel(l[i].columns[0], fontsize=18)
    plt.tick_params(axis='both', which='major', labelsize=18)
    
    n += 1
    
plt.show()

막대 그래프를 통해 결측값이 있는 변수들을 확인할 수 있다. 결측치 값을 따로 치환하지 않고 하나의 값으로 두는 것이 더 좋아 보인다.

(높은 target 값을 가지고 있기 때문에)

Real Variables

def corr_heatmap(v):
    correlations = train[v].corr()

    # Create color map ranging between two colors
    cmap = sns.diverging_palette(220, 10, as_cmap=True)

    fig, ax = plt.subplots(figsize=(10,10))
    sns.heatmap(correlations, cmap=cmap, vmax=1.0, center=0, fmt='.2f',
                square=True, linewidths=.5, annot=True, cbar_kws={"shrink": .75})
    plt.show();
    
v = meta[(meta.level == 'real') & (meta.keep)].index

corr_heatmap(v)

높은 상관관계의 변수들

ps_reg_02 and ps_reg_03 (0.7)
ps_car_12 and ps_car13 (0.67)
ps_car_12 and ps_car14 (0.58)
ps_car_13 and ps_car15 (0.67)

빠른 분석을 위해 10%의 데이터 샘플을 가지고 높은 상관관계의 변수들 살펴보자

s = train.sample(frac=0.1)

sns.lmplot(x='ps_reg_02', y='ps_reg_03', data=s, hue='target', palette='Set1', scatter_kws={'alpha':0.3})
plt.title('ps_reg_02 and ps_reg_03')
plt.show()

sns.lmplot(x='ps_car_12', y='ps_car_13', data=s, hue='target', palette='Set1', scatter_kws={'alpha':0.3})
plt.title('ps_car_12 and ps_car_13')

plt.show()

sns.lmplot(x='ps_car_12', y='ps_car_14', data=s, hue='target', palette='Set1', scatter_kws={'alpha':0.3})
plt.title('ps_car_12 and ps_car_14')
plt.show()

sns.lmplot(x='ps_car_15', y='ps_car_13', data=s, hue='target', palette='Set1', scatter_kws={'alpha':0.3})
plt.title('ps_car_15 and ps_car_13')
plt.show()

상관관계가 높은 변수들은 주성분분석(PCA)을 수행할수도 있지만, 상관관계가 높은 변수들이 적기 때문에 모델이 직접 처리하는 변수선택법을 사용할 수 도 있다.

Integer Variables

v = meta[(meta.level == 'integer') & (meta.keep)].index
corr_heatmap(v)

상관관계가 높은 변수는 없지만 target value에 따라 그룹화하여 살펴볼 수 있다.

Feature Importance

from sklearn.ensemble import RandomForestClassifier
import plotly.graph_objs as go

rf = RandomForestClassifier(n_estimators=150, max_depth=8, min_samples_leaf=4, max_features=0.2, n_jobs=-1, random_state=0)

rf.fit(train.drop(['id', 'target'],axis=1), train.target)

features = train.drop(['id', 'target'],axis=1).columns.values

print("----- Training Done -----")

----- Training Done -----

x, y = (list(x) for x in zip(*sorted(zip(rf.feature_importances_, features), 
                                                            reverse = False)))
trace2 = go.Bar(
    x=x ,
    y=y,
    marker=dict(
        color=x,
        colorscale = 'Viridis',
        reversescale = True
    ),
    name='Random Forest Feature importance',
    orientation='h',
)

layout = dict(
    title='Barplot of Feature importances',
     width = 900, height = 2000,
    yaxis=dict(
        showgrid=False,
        showline=False,
        showticklabels=True,
#         domain=[0, 0.85],
    ))

fig1 = go.Figure(data=[trace2])
fig1['layout'].update(layout)
py.iplot(fig1, filename='plots')

Handling imbalanced classes

1의 값이 적은 것을 위에서 확인했으므로

target값이 1인 레코드를 과대표집(oversampling)
target값이 0인 레코드를 과소표집(undersampling)

desired_apriori = 0.10

idx_0 = train[train.target == 0].index
idx_1 = train[train.target == 1].index

nb_0 = len(train.loc[idx_0])
nb_1 = len(train.loc[idx_1])

undersampling_rate = ((1-desired_apriori)*nb_1)/(nb_0*desired_apriori)
undersampled_nb_0 = int(undersampling_rate*nb_0)
print('Rate to undersample records with target=0: {}'.format(undersampling_rate))
print('Number of records with target=0 after undersampling: {}'.format(undersampled_nb_0))

undersampled_idx = shuffle(idx_0, random_state=37, n_samples=undersampled_nb_0)

idx_list = list(undersampled_idx) + list(idx_1)

train = train.loc[idx_list].reset_index(drop=True)

Rate to undersample records with target=0: 0.34043569687437886
Number of records with target=0 after undersampling: 195246

Data Quality Checks

Checking missing values

vars_with_missing = []

for f in train.columns:
    missings = train[train[f] == -1][f].count()
    
    if missings > 0:
        vars_with_missing.append(f)
        missings_perc = missings/train.shape[0]
        
        print('Variable {} has {} records ({:.2%}) with missing values'.format(f, missings, missings_perc))
        
print('In total, there are {} variables with missing values'.format(len(vars_with_missing)))

Variable ps_ind_02_cat has 216 records (0.04%) with missing values
Variable ps_ind_04_cat has 83 records (0.01%) with missing values
Variable ps_ind_05_cat has 5809 records (0.98%) with missing values
Variable ps_reg_03 has 107772 records (18.11%) with missing values
Variable ps_car_01_cat has 107 records (0.02%) with missing values
Variable ps_car_02_cat has 5 records (0.00%) with missing values
Variable ps_car_03_cat has 411231 records (69.09%) with missing values
Variable ps_car_05_cat has 266551 records (44.78%) with missing values
Variable ps_car_07_cat has 11489 records (1.93%) with missing values
Variable ps_car_09_cat has 569 records (0.10%) with missing values
Variable ps_car_11 has 5 records (0.00%) with missing values
Variable ps_car_12 has 1 records (0.00%) with missing values
Variable ps_car_14 has 42620 records (7.16%) with missing values
In total, there are 13 variables with missing values

ps_car_03_cat and ps_car_05_cat은 결측치가 많다. -> 제거
ps_reg_03은 18%가 결측치로 이루어져 있다. -> 평균값으로 대체
ps_car_11은 1개의 결측치를 가지고 있다. -> 최빈값으로 대체
ps_car_14는 7%의 결측값을 가지고 있다. -> 평균값으로 대체
…

vars_to_drop = ['ps_car_03_cat', 'ps_car_05_cat']
train.drop(vars_to_drop, inplace = True, axis = 1)
test.drop(vars_to_drop, inplace = True, axis = 1)
meta.loc[(vars_to_drop), 'keep'] = False 

mean_imp = SimpleImputer(missing_values = -1, strategy = 'mean')
mode_imp = SimpleImputer(missing_values = -1, strategy = 'most_frequent')

train['ps_reg_03'] = mean_imp.fit_transform(train[['ps_reg_03']]).ravel()
train['ps_car_12'] = mean_imp.fit_transform(train[['ps_car_12']]).ravel()
train['ps_car_14'] = mean_imp.fit_transform(train[['ps_car_14']]).ravel()
train['ps_car_11'] = mode_imp.fit_transform(train[['ps_car_11']]).ravel()

# 노이즈 생성
def add_noise(series, noise_level):
    return series * (1 + noise_level * np.random.randn(len(series)))

def target_encode(trn_series = None, 
                  tst_series = None, 
                  target = None, 
                  min_samples_leaf = 1, 
                  smoothing = 1,
                  noise_level = 0):
    """
    Smoothing is computed like in the following paper by Daniele Micci-Barreca
    https://kaggle2.blob.core.windows.net/forum-message-attachments/225952/7441/high%20cardinality%20categoricals.pdf
    trn_series : training categorical feature as a pd.Series
    tst_series : test categorical feature as a pd.Series
    target : target data as a pd.Series
    min_samples_leaf (int) : minimum samples to take category average into account
    smoothing (int) : smoothing effect to balance categorical average vs prior  
    """ 
    
    # `train`과 `target`의 길이와 이름이 같음을 확인 
    assert len(trn_series) == len(target)
    assert trn_series.name == tst_series.name 
    
    # `train`과 `target` concat
    temp = pd.concat([trn_series, target], axis=1) 
    
    # group_by를 통해 value별로 평균과 갯수 계산 
    averages = temp.groupby(by = trn_series.name)[target.name].agg(["mean", "count"])
    
    # smoothing 정의 
    smoothing = 1 / (1 + np.exp(-(averages["count"] - min_samples_leaf) / smoothing))
    
    # prior을 `target`의 평균값으로 설정
    prior = target.mean()
    
    # value별 평균에 smoothing 진행 후 평균과 갯수 제거 
    averages[target.name] = prior * (1 - smoothing) + averages["mean"] * smoothing
    averages.drop(["mean", "count"], axis = 1, inplace = True)
    
    # 평균값으로 새로운 series 정의
    ft_trn_series = pd.merge(   trn_series.to_frame(trn_series.name),
                                averages.reset_index().rename(columns = {'index': target.name, target.name: 'average'}),
                                on = trn_series.name,
                                how = 'left')['average'].rename(trn_series.name + '_mean').fillna(prior)

    # test seriese 도 정의 
    ft_trn_series.index = trn_series.index 
    ft_tst_series = pd.merge(   tst_series.to_frame(tst_series.name),
                                averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
                                on=tst_series.name,
                                how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)

    # pd.merge does not keep the index so restore it
    ft_tst_series.index = tst_series.index
    return add_noise(ft_trn_series, noise_level), add_noise(ft_tst_series, noise_level)

train_encoded, test_encoded = target_encode(train["ps_car_11_cat"], 
                              test["ps_car_11_cat"], 
                              target=train.target, 
                              min_samples_leaf=100,
                              smoothing=10,
                              noise_level=0.01)
    
train['ps_car_11_cat_te'] = train_encoded
train.drop('ps_car_11_cat', axis=1, inplace=True)
meta.loc['ps_car_11_cat','keep'] = False  # Updating the meta
test['ps_car_11_cat_te'] = test_encoded
test.drop('ps_car_11_cat', axis=1, inplace=True)

Checking the cardinality of the categorical variables

Cardinality는 변수 내에 다른 값의 개수를 나타낸다.

이후 범주형 변수에서 더미 변수(dummy variable)를 생성할 예정이기 때문에, 많은 고유한 값을 가진 변수가 있는지 확인해야 하고 이러한 변수는 많은 더미 변수를 생성하게 되므로 다른 방식으로 처리해야 한다.

v = meta[(meta.level == 'nominal') & (meta.keep)].index

for f in v:
    dist_values = train[f].value_counts().shape[0]
    print('Variable {} has {} distinct values'.format(f, dist_values))

Variable ps_ind_02_cat has 5 distinct values
Variable ps_ind_04_cat has 3 distinct values
Variable ps_ind_05_cat has 8 distinct values
Variable ps_car_01_cat has 13 distinct values
Variable ps_car_02_cat has 3 distinct values
Variable ps_car_04_cat has 10 distinct values
Variable ps_car_06_cat has 18 distinct values
Variable ps_car_07_cat has 3 distinct values
Variable ps_car_08_cat has 2 distinct values
Variable ps_car_09_cat has 6 distinct values
Variable ps_car_10_cat has 3 distinct values

Feature Engineering

Dummy variables

카테고리형 변수는 순서를 나타내는 변수가 아니기 때문에 더미변수를 생성해서 처리할 수 있다.

v = meta[(meta.level == 'nominal') & (meta.keep)].index
print('Before dummification we have {} variables in train'.format(train.shape[1]))
train = pd.get_dummies(train, columns = v, drop_first = True)
print('After dummification we have {} variables in train'.format(train.shape[1]))
test = pd.get_dummies(test, columns = v, drop_first = True)
print('After dummification we have {} variables in test'.format(test.shape[1]))

Before dummification we have 57 variables in train
After dummification we have 109 variables in train
After dummification we have 108 variables in test

Derived Variables

PolynomialFeatures를 통해 파생변수를 생성할 수 있다.

아마도 위에서 heatmap으로 확인한 상관관계가 높은 변수들(real 변수들)에 대한 파생변수를 생성한 것 같다!

v = meta[(meta.level == 'real') & (meta.keep)].index
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
interactions_train = pd.DataFrame(data=poly.fit_transform(train[v]), columns=poly.get_feature_names(v))
interactions_test = pd.DataFrame(data=poly.fit_transform(test[v]), columns=poly.get_feature_names(v))

interactions_train.drop(v, axis=1, inplace=True)  # Remove the original columns
interactions_test.drop(v, axis=1, inplace=True)  # Remove the original columns

# Concat the interaction variables to the train data
print('Before creating interactions we have {} variables in train'.format(train.shape[1]))

train = pd.concat([train, interactions], axis=1)
print('After creating interactions we have {} variables in train'.format(train.shape[1]))

test = pd.concat([test, interactions], axis=1)
print('After creating interactions we have {} variables in test'.format(test.shape[1]))

Before creating interactions we have 109 variables in train
After creating interactions we have 164 variables in train
After creating interactions we have 163 variables in test

중요한 변수만 추출

sfm = SelectFromModel(rf, threshold='median', prefit=True)
print('Number of features before selection: {}'.format(X_train.shape[1]))
n_features = sfm.transform(X_train).shape[1]
print('Number of features after selection: {}'.format(n_features))
selected_vars = list(feat_labels[sfm.get_support()])

Number of features before selection: 162
Number of features after selection: 81

train = train[selected_vars + ['target']]

Prepare the Data for Model

Drop `calc` columns

리더보드 높은 점수를 기록한 분석들의 솔루션을 보면 calc 변수를 drop 했기 때문에 제거해보자.

drop_col_train = train.columns[train.columns.str.startswith('ps_calc')]
drop_col_test = test.columns[test.columns.str.startswith('ps_calc')]

train.drop(drop_col_train, axis = 1, inplace = True)
test.drop(drop_col_test, axis = 1, inplace = True)

train_X = train.drop(['target', 'id'], axis = 1)
train_y = train['target']
test_X = test.drop(['id'], axis = 1)

Prepare the model

Ensamble with cross validation

class Ensemble(object):
    def __init__(self, n_splits, stacker, base_models):
        self.n_splits = n_splits
        self.stacker = stacker
        self.base_models = base_models

    def fit_predict(self, X, y, T):
        X = np.array(X)
        y = np.array(y)
        T = np.array(T)

        folds = list(StratifiedKFold(n_splits=self.n_splits, shuffle=True, random_state=314).split(X, y))

        S_train = np.zeros((X.shape[0], len(self.base_models)))
        S_test = np.zeros((T.shape[0], len(self.base_models)))
        for i, clf in enumerate(self.base_models):

            S_test_i = np.zeros((T.shape[0], self.n_splits))

            for j, (train_idx, test_idx) in enumerate(folds):
                X_train = X[train_idx]
                y_train = y[train_idx]
                X_holdout = X[test_idx]


                print ("Base model %d: fit %s model | fold %d" % (i+1, str(clf).split('(')[0], j+1))
                clf.fit(X_train, y_train)
                cross_score = cross_val_score(clf, X_train, y_train, cv=3, scoring='roc_auc')
                print("cross_score [roc-auc]: %.5f [gini]: %.5f" % (cross_score.mean(), 2*cross_score.mean()-1))
                y_pred = clf.predict_proba(X_holdout)[:,1]                

                S_train[test_idx, i] = y_pred
                S_test_i[:, j] = clf.predict_proba(T)[:,1]
            S_test[:, i] = S_test_i.mean(axis=1)

        results = cross_val_score(self.stacker, S_train, y, cv=3, scoring='roc_auc')
        # Calculate gini factor as 2 * AUC - 1
        print("Stacker score [gini]: %.5f" % (2 * results.mean() - 1))

        self.stacker.fit(S_train, y)
        result = self.stacker.predict_proba(S_test)[:,1]
        return result

# LightGBM params
# lgb_1
lgb_params1 = {}
lgb_params1['learning_rate'] = 0.02
lgb_params1['n_estimators'] = 650
lgb_params1['max_bin'] = 10
lgb_params1['subsample'] = 0.8
lgb_params1['subsample_freq'] = 10
lgb_params1['colsample_bytree'] = 0.8   
lgb_params1['min_child_samples'] = 500
lgb_params1['seed'] = 314

# lgb2
lgb_params2 = {}
lgb_params2['n_estimators'] = 1090
lgb_params2['learning_rate'] = 0.02
lgb_params2['colsample_bytree'] = 0.3   
lgb_params2['subsample'] = 0.7
lgb_params2['subsample_freq'] = 2
lgb_params2['num_leaves'] = 16
lgb_params2['seed'] = 314

# lgb3
lgb_params3 = {}
lgb_params3['n_estimators'] = 1100
lgb_params3['max_depth'] = 4
lgb_params3['learning_rate'] = 0.02
lgb_params3['seed'] = 314

# XGBoost params
xgb_params = {}
xgb_params['objective'] = 'binary:logistic'
xgb_params['learning_rate'] = 0.04
xgb_params['n_estimators'] = 490
xgb_params['max_depth'] = 4
xgb_params['subsample'] = 0.9
xgb_params['colsample_bytree'] = 0.9  
xgb_params['min_child_weight'] = 10

# Base models
lgb_model1 = LGBMClassifier(**lgb_params1)

lgb_model2 = LGBMClassifier(**lgb_params2)
       
lgb_model3 = LGBMClassifier(**lgb_params3)

xgb_model = XGBClassifier(**xgb_params)

# Stacking model
log_model = LogisticRegression()

stack = Ensemble(n_splits=5,
        stacker = log_model,
        base_models = (lgb_model1, lgb_model2, lgb_model3, xgb_model))

y_prediction = stack.fit_predict(train_X, train_y, test_X)        

Base model 1: fit LGBMClassifier model | fold 1
cross_score [roc-auc]: 0.63983 [gini]: 0.27966
Base model 1: fit LGBMClassifier model | fold 2
cross_score [roc-auc]: 0.63881 [gini]: 0.27762
Base model 1: fit LGBMClassifier model | fold 3
cross_score [roc-auc]: 0.63903 [gini]: 0.27805
Base model 1: fit LGBMClassifier model | fold 4
cross_score [roc-auc]: 0.63853 [gini]: 0.27706
Base model 1: fit LGBMClassifier model | fold 5
cross_score [roc-auc]: 0.63984 [gini]: 0.27969
Base model 2: fit LGBMClassifier model | fold 1
cross_score [roc-auc]: 0.63799 [gini]: 0.27598
Base model 2: fit LGBMClassifier model | fold 2
cross_score [roc-auc]: 0.63899 [gini]: 0.27799
Base model 2: fit LGBMClassifier model | fold 3
cross_score [roc-auc]: 0.63784 [gini]: 0.27567
Base model 2: fit LGBMClassifier model | fold 4
cross_score [roc-auc]: 0.63734 [gini]: 0.27469
Base model 2: fit LGBMClassifier model | fold 5
cross_score [roc-auc]: 0.63822 [gini]: 0.27644
Base model 3: fit LGBMClassifier model | fold 1
cross_score [roc-auc]: 0.63656 [gini]: 0.27313
Base model 3: fit LGBMClassifier model | fold 2
cross_score [roc-auc]: 0.63683 [gini]: 0.27367
Base model 3: fit LGBMClassifier model | fold 3
cross_score [roc-auc]: 0.63650 [gini]: 0.27300
Base model 3: fit LGBMClassifier model | fold 4
cross_score [roc-auc]: 0.63494 [gini]: 0.26988
Base model 3: fit LGBMClassifier model | fold 5
cross_score [roc-auc]: 0.63688 [gini]: 0.27375
Base model 4: fit XGBClassifier model | fold 1
cross_score [roc-auc]: 0.63875 [gini]: 0.27750
Base model 4: fit XGBClassifier model | fold 2
cross_score [roc-auc]: 0.63857 [gini]: 0.27713
Base model 4: fit XGBClassifier model | fold 3
cross_score [roc-auc]: 0.63861 [gini]: 0.27722
Base model 4: fit XGBClassifier model | fold 4
cross_score [roc-auc]: 0.63741 [gini]: 0.27481
Base model 4: fit XGBClassifier model | fold 5
cross_score [roc-auc]: 0.63914 [gini]: 0.27828
Stacker score [gini]: 0.28532

Twitter Facebook LinkedIn

Seoyun Lee

[Kaggle] Porti Seguro’s Safe Driver Prediction

Porti Seguro’s Safe Driver Prediction

Data Description

Loading packages

Loading data

MetaData

Descriptive statistics

Real variables

Integer variables

Binary variables

Exploratory Data Visualization

Categorical Variables

Real Variables

Integer Variables

Feature Importance

Handling imbalanced classes

Data Quality Checks

Checking missing values

Checking the cardinality of the categorical variables

Feature Engineering

Dummy variables

Derived Variables

중요한 변수만 추출

Prepare the Data for Model

Drop `calc` columns

Prepare the model

Ensamble with cross validation

공유하기

댓글남기기

참고

[Statistics] 확률변수

[Visualization] Python Visualization 1

[Visualization] TIPS Dataset Visualization

[Kaggle] Home Credit Default Risk Competition

	id	ps_ind_01	ps_ind_02_cat	ps_ind_03	ps_ind_04_cat	ps_ind_06_bin	ps_ind_07_bin	ps_ind_08_bin	...	ps_calc_11	ps_calc_12	ps_calc_13	ps_calc_14	ps_calc_16_bin	ps_calc_17_bin	ps_calc_18_bin	ps_calc_19_bin	ps_calc_20_bin
0	7	2	2	5	1	0	1	0	...	9	1	5	8	1	1	0	0	1
1	9	1	1	7	0	0	0	1	...	3	1	1	9	1	1	0	1	0
2	13	5	4	9	1	0	0	1	...	4	2	7	7	1	1	0	1	0
3	16	0	1	2	0	1	0	0	...	2	2	4	9	0	0	0	0	0
4	17	0	2	0	1	1	0	0	...	3	1	1	3	0	0	1	1	0

	id	ps_ind_01	ps_ind_02_cat	ps_ind_03	ps_ind_04_cat	ps_ind_06_bin	ps_ind_07_bin	ps_ind_08_bin	...	ps_calc_11	ps_calc_12	ps_calc_13	ps_calc_14	ps_calc_16_bin	ps_calc_17_bin	ps_calc_18_bin	ps_calc_19_bin	ps_calc_20_bin
0	7	2	2	5	1	0	1	0	...	9	1	5	8	1	1	0	0	1
1	9	1	1	7	0	0	0	1	...	3	1	1	9	1	1	0	1	0
2	13	5	4	9	1	0	0	1	...	4	2	7	7	1	1	0	1	0
3	16	0	1	2	0	1	0	0	...	2	2	4	9	0	0	0	0	0
4	17	0	2	0	1	1	0	0	...	3	1	1	3	0	0	1	1	0

Seoyun Lee

Porti Seguro’s Safe Driver Prediction

Data Description

Loading packages

Loading data

MetaData

Descriptive statistics

Real variables

Integer variables

Binary variables

Exploratory Data Visualization

Categorical Variables

Real Variables

Integer Variables

Feature Importance

Handling imbalanced classes

Data Quality Checks

Checking missing values

Checking the cardinality of the categorical variables

Feature Engineering

Dummy variables

Derived Variables

중요한 변수만 추출

Prepare the Data for Model

Drop calc columns

Prepare the model

Ensamble with cross validation

공유하기

댓글남기기

참고

[Statistics] 확률변수

[Visualization] Python Visualization 1

[Visualization] TIPS Dataset Visualization

[Kaggle] Home Credit Default Risk Competition

Drop `calc` columns

	id	ps_ind_01	ps_ind_02_cat	ps_ind_03	ps_ind_04_cat	ps_ind_06_bin	ps_ind_07_bin	ps_ind_08_bin	...	ps_calc_11	ps_calc_12	ps_calc_13	ps_calc_14	ps_calc_16_bin	ps_calc_17_bin	ps_calc_18_bin	ps_calc_19_bin	ps_calc_20_bin
0	7	2	2	5	1	0	1	0	...	9	1	5	8	1	1	0	0	1
1	9	1	1	7	0	0	0	1	...	3	1	1	9	1	1	0	1	0
2	13	5	4	9	1	0	0	1	...	4	2	7	7	1	1	0	1	0
3	16	0	1	2	0	1	0	0	...	2	2	4	9	0	0	0	0	0
4	17	0	2	0	1	1	0	0	...	3	1	1	3	0	0	1	1	0