[Kaggle] Porti Seguro’s Safe Driver Prediction

15 분 소요

Porti Seguro’s Safe Driver Prediction

Data Description

Porti Segure’s Sage Driver Prediction

  • 유사한 그룹에 속하는 특징들은 특징 이름에 해당하는 태그로 표시됩니다 (예 : ind, reg, car, calc).

  • 이진 특징은 postfix “bin”으로 표시되며 범주형 특징은 “cat”으로 표시됩니다. 이러한 표기가 없는 특징은 연속형 또는 서수형입니다.

  • 값 -1은 관찰에서 특징이 누락되었음을 나타냅니다.

  • 타겟 열은 해당 보험 계약자가 청구를 제출했는지 여부를 나타냅니다.

Loading packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectFromModel
from sklearn.utils import shuffle
from sklearn.ensemble import RandomForestClassifier
import plotly.offline as py
import plotly.graph_objs as go
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
import os

import warnings
warnings.filterwarnings('ignore')

Loading data

data = pd.read_csv('train.csv')
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train.head()
id target ps_ind_01 ps_ind_02_cat ps_ind_03 ps_ind_04_cat ps_ind_05_cat ps_ind_06_bin ps_ind_07_bin ps_ind_08_bin ... ps_calc_11 ps_calc_12 ps_calc_13 ps_calc_14 ps_calc_15_bin ps_calc_16_bin ps_calc_17_bin ps_calc_18_bin ps_calc_19_bin ps_calc_20_bin
0 7 0 2 2 5 1 0 0 1 0 ... 9 1 5 8 0 1 1 0 0 1
1 9 0 1 1 7 0 0 0 0 1 ... 3 1 1 9 0 1 1 0 1 0
2 13 0 5 4 9 1 0 0 0 1 ... 4 2 7 7 0 1 1 0 1 0
3 16 0 0 1 2 0 0 1 0 0 ... 2 2 4 9 0 0 0 0 0 0
4 17 0 0 2 0 1 0 1 0 0 ... 3 1 1 3 0 0 0 1 1 0

5 rows × 59 columns

변수를 살펴보면

  • 이진 변수

  • 범주형 변수 중 카테고리 값이 정수인 변수

  • 정수 또는 부동 소수점 값이 있는 다른 변수

  • 결측값을 나타내는 -1이 있는 변수

  • 대상 변수와 ID 변수

즉, 데이터셋에는 이진 변수와 정수형 값을 가진 범주형 변수, 그리고 숫자 값이 있는 다른 변수 등이 포함되어 있으며, 결측값을 -1로 나타내는 변수도 있습니다. 또한, 대상 변수와 식별자(ID) 변수도 포함되어 있다.

MetaData

변수 관리를 용이하게 하기 위해, 변수에 대한 메타 정보를 Data Frame에 저장하고 분석, 시각화, 모델링을 위해 특정 변수를 선택할 때 유용하게 사용할 수 있다.

  • role : 입력변수(input), 식별자 변수(id), 목표 변수(target)

  • level : 명목형(nominal), 간격형(interval), 서열형(ordinal), 이진형(binary)

  • keep : True, False

  • dtype : 정수(int), 부동 소수점(float), 문자열(str)

data = []

for f in train.columns:
    
    if f == 'target':
        role = 'target'
    elif f == 'id':
        role = 'id'
    else:
        role = 'input'
         
    if 'bin' in f or f == 'target':
        level = 'binary'
    elif 'cat' in f or f == 'id':
        level = 'nominal'
    elif train[f].dtype == float:
        level = 'real'
    elif train[f].dtype == int:
        level = 'integer'
        
    keep = True
    if f == 'id':
        keep = False
    
    dtype = train[f].dtype
    
    f_dict = {
        'varname': f,
        'role': role,
        'level': level,
        'keep': keep,
        'dtype': dtype
    }
    data.append(f_dict)
    
meta = pd.DataFrame(data, columns=['varname', 'role', 'level', 'keep', 'dtype'])

meta.set_index('varname', inplace=True)

pd.DataFrame({'count' : meta.groupby(['role', 'level'])['role'].size()}).reset_index()
role level count
0 id nominal 1
1 input binary 17
2 input integer 16
3 input nominal 14
4 input real 10
5 target binary 1

Descriptive statistics

Real variables

v = meta[(meta.level == 'real') & meta.keep].index

train[v].describe()
ps_reg_01 ps_reg_02 ps_reg_03 ps_car_12 ps_car_13 ps_car_14 ps_car_15 ps_calc_01 ps_calc_02 ps_calc_03
count 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000
mean 0.610991 0.439184 0.551102 0.379945 0.813265 0.276256 3.065899 0.449756 0.449589 0.449849
std 0.287643 0.404264 0.793506 0.058327 0.224588 0.357154 0.731366 0.287198 0.286893 0.287153
min 0.000000 0.000000 -1.000000 -1.000000 0.250619 -1.000000 0.000000 0.000000 0.000000 0.000000
25% 0.400000 0.200000 0.525000 0.316228 0.670867 0.333167 2.828427 0.200000 0.200000 0.200000
50% 0.700000 0.300000 0.720677 0.374166 0.765811 0.368782 3.316625 0.500000 0.400000 0.500000
75% 0.900000 0.600000 1.000000 0.400000 0.906190 0.396485 3.605551 0.700000 0.700000 0.700000
max 0.900000 1.800000 4.037945 1.264911 3.720626 0.636396 3.741657 0.900000 0.900000 0.900000

reg variables

  • ps_reg_03은 결측치를 가지고 있다

  • 변수값의 범위에 차이가 있다(스케일링 가능).

car variables

  • ps_car_12, ps_car_15는 결측치를 가지고 있다.

  • 변수값의 범위에 차이가 있다(스케일링 가능).

calc variales

  • 결측치는 없다.

  • 최대값은 0.9로 보인다.

  • 비슷한 분포를 가지고 있다.

Integer variables

v = meta[(meta.level == 'integer') & (meta.keep)].index

train[v].describe()
ps_ind_01 ps_ind_03 ps_ind_14 ps_ind_15 ps_car_11 ps_calc_04 ps_calc_05 ps_calc_06 ps_calc_07 ps_calc_08 ps_calc_09 ps_calc_10 ps_calc_11 ps_calc_12 ps_calc_13 ps_calc_14
count 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000
mean 1.900378 4.423318 0.012451 7.299922 2.346072 2.372081 1.885886 7.689445 3.005823 9.225904 2.339034 8.433590 5.441382 1.441918 2.872288 7.539026
std 1.983789 2.699902 0.127545 3.546042 0.832548 1.117219 1.134927 1.334312 1.414564 1.459672 1.246949 2.904597 2.332871 1.202963 1.694887 2.746652
min 0.000000 0.000000 0.000000 0.000000 -1.000000 0.000000 0.000000 0.000000 0.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 2.000000 0.000000 5.000000 2.000000 2.000000 1.000000 7.000000 2.000000 8.000000 1.000000 6.000000 4.000000 1.000000 2.000000 6.000000
50% 1.000000 4.000000 0.000000 7.000000 3.000000 2.000000 2.000000 8.000000 3.000000 9.000000 2.000000 8.000000 5.000000 1.000000 3.000000 7.000000
75% 3.000000 6.000000 0.000000 10.000000 3.000000 3.000000 3.000000 9.000000 4.000000 10.000000 3.000000 10.000000 7.000000 2.000000 4.000000 9.000000
max 7.000000 11.000000 4.000000 13.000000 3.000000 5.000000 6.000000 10.000000 9.000000 12.000000 7.000000 25.000000 19.000000 10.000000 13.000000 23.000000
  • ps_car_11만 결측값을 가지고 있다.

  • 필요에 따라 스케일링할 수 있다.

Binary variables

v = meta[(meta.level == 'binary') & (meta.keep)].index

train[v].describe()
target ps_ind_06_bin ps_ind_07_bin ps_ind_08_bin ps_ind_09_bin ps_ind_10_bin ps_ind_11_bin ps_ind_12_bin ps_ind_13_bin ps_ind_16_bin ps_ind_17_bin ps_ind_18_bin ps_calc_15_bin ps_calc_16_bin ps_calc_17_bin ps_calc_18_bin ps_calc_19_bin ps_calc_20_bin
count 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000 595212.000000
mean 0.036448 0.393742 0.257033 0.163921 0.185304 0.000373 0.001692 0.009439 0.000948 0.660823 0.121081 0.153446 0.122427 0.627840 0.554182 0.287182 0.349024 0.153318
std 0.187401 0.488579 0.436998 0.370205 0.388544 0.019309 0.041097 0.096693 0.030768 0.473430 0.326222 0.360417 0.327779 0.483381 0.497056 0.452447 0.476662 0.360295
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000 1.000000 0.000000 0.000000 0.000000
75% 0.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 0.000000
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
plt.figure(figsize = (40, 30))

n = 0

for i in range(0, len(v)):
    plt.subplot(3, 6, n+1)
    sns.countplot(train[v[i]])
    plt.xlabel(train[v].columns[i], fontsize = 30)
    
    n += 1
    
plt.show()

  • 이진변수들의 평균을 보면 target은 0.0365로 0과 1이 불균형하게 분포된 것을 알 수 있다.

  • 다른 이진 변수들의 평균을 봐도 0이 많이 포함 된 것을 알 수 있다.

Exploratory Data Visualization

Categorical Variables

v = meta[(meta.level == 'nominal') & (meta.keep)].index

l = []

for f in v:
    cat_perc = train[[f, 'target']].groupby([f],as_index=False).mean()
    cat_perc.sort_values(by='target', ascending=False, inplace=True)
    
    l.append(cat_perc)
    
plt.figure(figsize = (40, 30))

n = 0

for i in range(0, len(l)):
    plt.subplot(4, 3, n+1)
    sns.barplot(x = l[i].columns[0], y = 'target', data = l[i])
    plt.ylabel('% target', fontsize=18)
    plt.xlabel(l[i].columns[0], fontsize=18)
    plt.tick_params(axis='both', which='major', labelsize=18)
    
    n += 1
    
plt.show()

막대 그래프를 통해 결측값이 있는 변수들을 확인할 수 있다. 결측치 값을 따로 치환하지 않고 하나의 값으로 두는 것이 더 좋아 보인다.

(높은 target 값을 가지고 있기 때문에)

Real Variables

def corr_heatmap(v):
    correlations = train[v].corr()

    # Create color map ranging between two colors
    cmap = sns.diverging_palette(220, 10, as_cmap=True)

    fig, ax = plt.subplots(figsize=(10,10))
    sns.heatmap(correlations, cmap=cmap, vmax=1.0, center=0, fmt='.2f',
                square=True, linewidths=.5, annot=True, cbar_kws={"shrink": .75})
    plt.show();
    
v = meta[(meta.level == 'real') & (meta.keep)].index

corr_heatmap(v)

높은 상관관계의 변수들

  • ps_reg_02 and ps_reg_03 (0.7)

  • ps_car_12 and ps_car13 (0.67)

  • ps_car_12 and ps_car14 (0.58)

  • ps_car_13 and ps_car15 (0.67)

빠른 분석을 위해 10%의 데이터 샘플을 가지고 높은 상관관계의 변수들 살펴보자

s = train.sample(frac=0.1)
sns.lmplot(x='ps_reg_02', y='ps_reg_03', data=s, hue='target', palette='Set1', scatter_kws={'alpha':0.3})
plt.title('ps_reg_02 and ps_reg_03')
plt.show()

sns.lmplot(x='ps_car_12', y='ps_car_13', data=s, hue='target', palette='Set1', scatter_kws={'alpha':0.3})
plt.title('ps_car_12 and ps_car_13')

plt.show()

sns.lmplot(x='ps_car_12', y='ps_car_14', data=s, hue='target', palette='Set1', scatter_kws={'alpha':0.3})
plt.title('ps_car_12 and ps_car_14')
plt.show()

sns.lmplot(x='ps_car_15', y='ps_car_13', data=s, hue='target', palette='Set1', scatter_kws={'alpha':0.3})
plt.title('ps_car_15 and ps_car_13')
plt.show()

상관관계가 높은 변수들은 주성분분석(PCA)을 수행할수도 있지만, 상관관계가 높은 변수들이 적기 때문에 모델이 직접 처리하는 변수선택법을 사용할 수 도 있다.

Integer Variables

v = meta[(meta.level == 'integer') & (meta.keep)].index
corr_heatmap(v)

상관관계가 높은 변수는 없지만 target value에 따라 그룹화하여 살펴볼 수 있다.

Feature Importance

from sklearn.ensemble import RandomForestClassifier
import plotly.graph_objs as go

rf = RandomForestClassifier(n_estimators=150, max_depth=8, min_samples_leaf=4, max_features=0.2, n_jobs=-1, random_state=0)

rf.fit(train.drop(['id', 'target'],axis=1), train.target)

features = train.drop(['id', 'target'],axis=1).columns.values

print("----- Training Done -----")
----- Training Done -----
x, y = (list(x) for x in zip(*sorted(zip(rf.feature_importances_, features), 
                                                            reverse = False)))
trace2 = go.Bar(
    x=x ,
    y=y,
    marker=dict(
        color=x,
        colorscale = 'Viridis',
        reversescale = True
    ),
    name='Random Forest Feature importance',
    orientation='h',
)

layout = dict(
    title='Barplot of Feature importances',
     width = 900, height = 2000,
    yaxis=dict(
        showgrid=False,
        showline=False,
        showticklabels=True,
#         domain=[0, 0.85],
    ))

fig1 = go.Figure(data=[trace2])
fig1['layout'].update(layout)
py.iplot(fig1, filename='plots')

Handling imbalanced classes

1의 값이 적은 것을 위에서 확인했으므로

  • target값이 1인 레코드를 과대표집(oversampling)

  • target값이 0인 레코드를 과소표집(undersampling)

desired_apriori = 0.10

idx_0 = train[train.target == 0].index
idx_1 = train[train.target == 1].index

nb_0 = len(train.loc[idx_0])
nb_1 = len(train.loc[idx_1])

undersampling_rate = ((1-desired_apriori)*nb_1)/(nb_0*desired_apriori)
undersampled_nb_0 = int(undersampling_rate*nb_0)
print('Rate to undersample records with target=0: {}'.format(undersampling_rate))
print('Number of records with target=0 after undersampling: {}'.format(undersampled_nb_0))

undersampled_idx = shuffle(idx_0, random_state=37, n_samples=undersampled_nb_0)

idx_list = list(undersampled_idx) + list(idx_1)

train = train.loc[idx_list].reset_index(drop=True)
Rate to undersample records with target=0: 0.34043569687437886
Number of records with target=0 after undersampling: 195246

Data Quality Checks

Checking missing values

vars_with_missing = []

for f in train.columns:
    missings = train[train[f] == -1][f].count()
    
    if missings > 0:
        vars_with_missing.append(f)
        missings_perc = missings/train.shape[0]
        
        print('Variable {} has {} records ({:.2%}) with missing values'.format(f, missings, missings_perc))
        
print('In total, there are {} variables with missing values'.format(len(vars_with_missing)))
Variable ps_ind_02_cat has 216 records (0.04%) with missing values
Variable ps_ind_04_cat has 83 records (0.01%) with missing values
Variable ps_ind_05_cat has 5809 records (0.98%) with missing values
Variable ps_reg_03 has 107772 records (18.11%) with missing values
Variable ps_car_01_cat has 107 records (0.02%) with missing values
Variable ps_car_02_cat has 5 records (0.00%) with missing values
Variable ps_car_03_cat has 411231 records (69.09%) with missing values
Variable ps_car_05_cat has 266551 records (44.78%) with missing values
Variable ps_car_07_cat has 11489 records (1.93%) with missing values
Variable ps_car_09_cat has 569 records (0.10%) with missing values
Variable ps_car_11 has 5 records (0.00%) with missing values
Variable ps_car_12 has 1 records (0.00%) with missing values
Variable ps_car_14 has 42620 records (7.16%) with missing values
In total, there are 13 variables with missing values
  • ps_car_03_cat and ps_car_05_cat은 결측치가 많다. -> 제거

  • ps_reg_03은 18%가 결측치로 이루어져 있다. -> 평균값으로 대체

  • ps_car_11은 1개의 결측치를 가지고 있다. -> 최빈값으로 대체

  • ps_car_14는 7%의 결측값을 가지고 있다. -> 평균값으로 대체

vars_to_drop = ['ps_car_03_cat', 'ps_car_05_cat']
train.drop(vars_to_drop, inplace = True, axis = 1)
test.drop(vars_to_drop, inplace = True, axis = 1)
meta.loc[(vars_to_drop), 'keep'] = False 
mean_imp = SimpleImputer(missing_values = -1, strategy = 'mean')
mode_imp = SimpleImputer(missing_values = -1, strategy = 'most_frequent')

train['ps_reg_03'] = mean_imp.fit_transform(train[['ps_reg_03']]).ravel()
train['ps_car_12'] = mean_imp.fit_transform(train[['ps_car_12']]).ravel()
train['ps_car_14'] = mean_imp.fit_transform(train[['ps_car_14']]).ravel()
train['ps_car_11'] = mode_imp.fit_transform(train[['ps_car_11']]).ravel()
# 노이즈 생성
def add_noise(series, noise_level):
    return series * (1 + noise_level * np.random.randn(len(series)))

def target_encode(trn_series = None, 
                  tst_series = None, 
                  target = None, 
                  min_samples_leaf = 1, 
                  smoothing = 1,
                  noise_level = 0):
    """
    Smoothing is computed like in the following paper by Daniele Micci-Barreca
    https://kaggle2.blob.core.windows.net/forum-message-attachments/225952/7441/high%20cardinality%20categoricals.pdf
    trn_series : training categorical feature as a pd.Series
    tst_series : test categorical feature as a pd.Series
    target : target data as a pd.Series
    min_samples_leaf (int) : minimum samples to take category average into account
    smoothing (int) : smoothing effect to balance categorical average vs prior  
    """ 
    
    # `train`과 `target`의 길이와 이름이 같음을 확인 
    assert len(trn_series) == len(target)
    assert trn_series.name == tst_series.name 
    
    # `train`과 `target` concat
    temp = pd.concat([trn_series, target], axis=1) 
    
    # group_by를 통해 value별로 평균과 갯수 계산 
    averages = temp.groupby(by = trn_series.name)[target.name].agg(["mean", "count"])
    
    # smoothing 정의 
    smoothing = 1 / (1 + np.exp(-(averages["count"] - min_samples_leaf) / smoothing))
    
    # prior을 `target`의 평균값으로 설정
    prior = target.mean()
    
    # value별 평균에 smoothing 진행 후 평균과 갯수 제거 
    averages[target.name] = prior * (1 - smoothing) + averages["mean"] * smoothing
    averages.drop(["mean", "count"], axis = 1, inplace = True)
    
    # 평균값으로 새로운 series 정의
    ft_trn_series = pd.merge(   trn_series.to_frame(trn_series.name),
                                averages.reset_index().rename(columns = {'index': target.name, target.name: 'average'}),
                                on = trn_series.name,
                                how = 'left')['average'].rename(trn_series.name + '_mean').fillna(prior)

    # test seriese 도 정의 
    ft_trn_series.index = trn_series.index 
    ft_tst_series = pd.merge(   tst_series.to_frame(tst_series.name),
                                averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
                                on=tst_series.name,
                                how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)

    # pd.merge does not keep the index so restore it
    ft_tst_series.index = tst_series.index
    return add_noise(ft_trn_series, noise_level), add_noise(ft_tst_series, noise_level)

train_encoded, test_encoded = target_encode(train["ps_car_11_cat"], 
                              test["ps_car_11_cat"], 
                              target=train.target, 
                              min_samples_leaf=100,
                              smoothing=10,
                              noise_level=0.01)
    
train['ps_car_11_cat_te'] = train_encoded
train.drop('ps_car_11_cat', axis=1, inplace=True)
meta.loc['ps_car_11_cat','keep'] = False  # Updating the meta
test['ps_car_11_cat_te'] = test_encoded
test.drop('ps_car_11_cat', axis=1, inplace=True)

Checking the cardinality of the categorical variables

Cardinality는 변수 내에 다른 값의 개수를 나타낸다.

이후 범주형 변수에서 더미 변수(dummy variable)를 생성할 예정이기 때문에, 많은 고유한 값을 가진 변수가 있는지 확인해야 하고 이러한 변수는 많은 더미 변수를 생성하게 되므로 다른 방식으로 처리해야 한다.

v = meta[(meta.level == 'nominal') & (meta.keep)].index

for f in v:
    dist_values = train[f].value_counts().shape[0]
    print('Variable {} has {} distinct values'.format(f, dist_values))
Variable ps_ind_02_cat has 5 distinct values
Variable ps_ind_04_cat has 3 distinct values
Variable ps_ind_05_cat has 8 distinct values
Variable ps_car_01_cat has 13 distinct values
Variable ps_car_02_cat has 3 distinct values
Variable ps_car_04_cat has 10 distinct values
Variable ps_car_06_cat has 18 distinct values
Variable ps_car_07_cat has 3 distinct values
Variable ps_car_08_cat has 2 distinct values
Variable ps_car_09_cat has 6 distinct values
Variable ps_car_10_cat has 3 distinct values

Feature Engineering

Dummy variables

카테고리형 변수는 순서를 나타내는 변수가 아니기 때문에 더미변수를 생성해서 처리할 수 있다.

v = meta[(meta.level == 'nominal') & (meta.keep)].index
print('Before dummification we have {} variables in train'.format(train.shape[1]))
train = pd.get_dummies(train, columns = v, drop_first = True)
print('After dummification we have {} variables in train'.format(train.shape[1]))
test = pd.get_dummies(test, columns = v, drop_first = True)
print('After dummification we have {} variables in test'.format(test.shape[1]))
Before dummification we have 57 variables in train
After dummification we have 109 variables in train
After dummification we have 108 variables in test

Derived Variables

PolynomialFeatures를 통해 파생변수를 생성할 수 있다.

아마도 위에서 heatmap으로 확인한 상관관계가 높은 변수들(real 변수들)에 대한 파생변수를 생성한 것 같다!

v = meta[(meta.level == 'real') & (meta.keep)].index
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
interactions_train = pd.DataFrame(data=poly.fit_transform(train[v]), columns=poly.get_feature_names(v))
interactions_test = pd.DataFrame(data=poly.fit_transform(test[v]), columns=poly.get_feature_names(v))

interactions_train.drop(v, axis=1, inplace=True)  # Remove the original columns
interactions_test.drop(v, axis=1, inplace=True)  # Remove the original columns

# Concat the interaction variables to the train data
print('Before creating interactions we have {} variables in train'.format(train.shape[1]))

train = pd.concat([train, interactions], axis=1)
print('After creating interactions we have {} variables in train'.format(train.shape[1]))

test = pd.concat([test, interactions], axis=1)
print('After creating interactions we have {} variables in test'.format(test.shape[1]))
Before creating interactions we have 109 variables in train
After creating interactions we have 164 variables in train
After creating interactions we have 163 variables in test

중요한 변수만 추출

sfm = SelectFromModel(rf, threshold='median', prefit=True)
print('Number of features before selection: {}'.format(X_train.shape[1]))
n_features = sfm.transform(X_train).shape[1]
print('Number of features after selection: {}'.format(n_features))
selected_vars = list(feat_labels[sfm.get_support()])
Number of features before selection: 162
Number of features after selection: 81
train = train[selected_vars + ['target']]

Prepare the Data for Model

Drop calc columns

리더보드 높은 점수를 기록한 분석들의 솔루션을 보면 calc 변수를 drop 했기 때문에 제거해보자.

drop_col_train = train.columns[train.columns.str.startswith('ps_calc')]
drop_col_test = test.columns[test.columns.str.startswith('ps_calc')]
train.drop(drop_col_train, axis = 1, inplace = True)
test.drop(drop_col_test, axis = 1, inplace = True)
train_X = train.drop(['target', 'id'], axis = 1)
train_y = train['target']
test_X = test.drop(['id'], axis = 1)

Prepare the model

Ensamble with cross validation

class Ensemble(object):
    def __init__(self, n_splits, stacker, base_models):
        self.n_splits = n_splits
        self.stacker = stacker
        self.base_models = base_models

    def fit_predict(self, X, y, T):
        X = np.array(X)
        y = np.array(y)
        T = np.array(T)

        folds = list(StratifiedKFold(n_splits=self.n_splits, shuffle=True, random_state=314).split(X, y))

        S_train = np.zeros((X.shape[0], len(self.base_models)))
        S_test = np.zeros((T.shape[0], len(self.base_models)))
        for i, clf in enumerate(self.base_models):

            S_test_i = np.zeros((T.shape[0], self.n_splits))

            for j, (train_idx, test_idx) in enumerate(folds):
                X_train = X[train_idx]
                y_train = y[train_idx]
                X_holdout = X[test_idx]


                print ("Base model %d: fit %s model | fold %d" % (i+1, str(clf).split('(')[0], j+1))
                clf.fit(X_train, y_train)
                cross_score = cross_val_score(clf, X_train, y_train, cv=3, scoring='roc_auc')
                print("cross_score [roc-auc]: %.5f [gini]: %.5f" % (cross_score.mean(), 2*cross_score.mean()-1))
                y_pred = clf.predict_proba(X_holdout)[:,1]                

                S_train[test_idx, i] = y_pred
                S_test_i[:, j] = clf.predict_proba(T)[:,1]
            S_test[:, i] = S_test_i.mean(axis=1)

        results = cross_val_score(self.stacker, S_train, y, cv=3, scoring='roc_auc')
        # Calculate gini factor as 2 * AUC - 1
        print("Stacker score [gini]: %.5f" % (2 * results.mean() - 1))

        self.stacker.fit(S_train, y)
        result = self.stacker.predict_proba(S_test)[:,1]
        return result
# LightGBM params
# lgb_1
lgb_params1 = {}
lgb_params1['learning_rate'] = 0.02
lgb_params1['n_estimators'] = 650
lgb_params1['max_bin'] = 10
lgb_params1['subsample'] = 0.8
lgb_params1['subsample_freq'] = 10
lgb_params1['colsample_bytree'] = 0.8   
lgb_params1['min_child_samples'] = 500
lgb_params1['seed'] = 314

# lgb2
lgb_params2 = {}
lgb_params2['n_estimators'] = 1090
lgb_params2['learning_rate'] = 0.02
lgb_params2['colsample_bytree'] = 0.3   
lgb_params2['subsample'] = 0.7
lgb_params2['subsample_freq'] = 2
lgb_params2['num_leaves'] = 16
lgb_params2['seed'] = 314

# lgb3
lgb_params3 = {}
lgb_params3['n_estimators'] = 1100
lgb_params3['max_depth'] = 4
lgb_params3['learning_rate'] = 0.02
lgb_params3['seed'] = 314

# XGBoost params
xgb_params = {}
xgb_params['objective'] = 'binary:logistic'
xgb_params['learning_rate'] = 0.04
xgb_params['n_estimators'] = 490
xgb_params['max_depth'] = 4
xgb_params['subsample'] = 0.9
xgb_params['colsample_bytree'] = 0.9  
xgb_params['min_child_weight'] = 10
# Base models
lgb_model1 = LGBMClassifier(**lgb_params1)

lgb_model2 = LGBMClassifier(**lgb_params2)
       
lgb_model3 = LGBMClassifier(**lgb_params3)

xgb_model = XGBClassifier(**xgb_params)

# Stacking model
log_model = LogisticRegression()
stack = Ensemble(n_splits=5,
        stacker = log_model,
        base_models = (lgb_model1, lgb_model2, lgb_model3, xgb_model))
y_prediction = stack.fit_predict(train_X, train_y, test_X)        
Base model 1: fit LGBMClassifier model | fold 1
cross_score [roc-auc]: 0.63983 [gini]: 0.27966
Base model 1: fit LGBMClassifier model | fold 2
cross_score [roc-auc]: 0.63881 [gini]: 0.27762
Base model 1: fit LGBMClassifier model | fold 3
cross_score [roc-auc]: 0.63903 [gini]: 0.27805
Base model 1: fit LGBMClassifier model | fold 4
cross_score [roc-auc]: 0.63853 [gini]: 0.27706
Base model 1: fit LGBMClassifier model | fold 5
cross_score [roc-auc]: 0.63984 [gini]: 0.27969
Base model 2: fit LGBMClassifier model | fold 1
cross_score [roc-auc]: 0.63799 [gini]: 0.27598
Base model 2: fit LGBMClassifier model | fold 2
cross_score [roc-auc]: 0.63899 [gini]: 0.27799
Base model 2: fit LGBMClassifier model | fold 3
cross_score [roc-auc]: 0.63784 [gini]: 0.27567
Base model 2: fit LGBMClassifier model | fold 4
cross_score [roc-auc]: 0.63734 [gini]: 0.27469
Base model 2: fit LGBMClassifier model | fold 5
cross_score [roc-auc]: 0.63822 [gini]: 0.27644
Base model 3: fit LGBMClassifier model | fold 1
cross_score [roc-auc]: 0.63656 [gini]: 0.27313
Base model 3: fit LGBMClassifier model | fold 2
cross_score [roc-auc]: 0.63683 [gini]: 0.27367
Base model 3: fit LGBMClassifier model | fold 3
cross_score [roc-auc]: 0.63650 [gini]: 0.27300
Base model 3: fit LGBMClassifier model | fold 4
cross_score [roc-auc]: 0.63494 [gini]: 0.26988
Base model 3: fit LGBMClassifier model | fold 5
cross_score [roc-auc]: 0.63688 [gini]: 0.27375
Base model 4: fit XGBClassifier model | fold 1
cross_score [roc-auc]: 0.63875 [gini]: 0.27750
Base model 4: fit XGBClassifier model | fold 2
cross_score [roc-auc]: 0.63857 [gini]: 0.27713
Base model 4: fit XGBClassifier model | fold 3
cross_score [roc-auc]: 0.63861 [gini]: 0.27722
Base model 4: fit XGBClassifier model | fold 4
cross_score [roc-auc]: 0.63741 [gini]: 0.27481
Base model 4: fit XGBClassifier model | fold 5
cross_score [roc-auc]: 0.63914 [gini]: 0.27828
Stacker score [gini]: 0.28532

태그:

카테고리:

업데이트:

댓글남기기