[Kaggle] Home Credit Default Risk Competition
Home Credit Default Risk
PREVIEW
Home Credit Default Risk
Dataset Description
-
application_{train|test}.csv
-
This is the main table, broken into two files for Train (with TARGET) and Test (without TARGET).
-
Static data for all applications. One row represents one loan in our data sample.
-
-
bureau.csv
-
All client’s previous credits provided by other financial institutions that were reported to Credit Bureau (for clients who have a loan in our sample).
-
For every loan in our sample, there are as many rows as number of credits the client had in Credit Bureau before the application date.
-
-
bureau_balance.csv
-
Monthly balances of previous credits in Credit Bureau.
-
This table has one row for each month of history of every previous credit reported to Credit Bureau – i.e the table has (#loans in sample * # of relative previous credits * # of months where we have some history observable for the previous credits) rows.
-
-
POS_CASH_balance.csv
-
Monthly balance snapshots of previous POS (point of sales) and cash loans that the applicant had with Home Credit.
-
This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample * # of relative previous credits * # of months in which we have some history observable for the previous credits) rows.
-
-
credit_card_balance.csv
-
Monthly balance snapshots of previous credit cards that the applicant has with Home Credit.
-
This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample * # of relative previous credit cards * # of months where we have some history observable for the previous credit card) rows.
-
-
previous_application.csv
-
All previous applications for Home Credit loans of clients who have loans in our sample.
-
There is one row for each previous application related to loans in our data sample.
-
-
installments_payments.csv
-
Repayment history for the previously disbursed credits in Home Credit related to the loans in our sample.
-
There is a) one row for every payment that was made plus b) one row each for missed payment.
-
One row is equivalent to one payment of one installment OR one installment corresponding to one payment of one previous Home Credit credit related to loans in our sample.
-
-
HomeCredit_columns_description.csv
- This file contains descriptions for the columns in the various data files.
Loadig Packages
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import os
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import seaborn as sns
Read Data
app_train = pd.read_csv('application_train.csv')
app_test = pd.read_csv('application_test.csv')
print('Training data shape : ', app_train.shape)
app_train.head()
Training data shape : (307511, 122)
SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 122 columns
Exploratory Data Analysis
Target Variable
sns.countplot(app_train['TARGET'])
print(app_train['TARGET'].value_counts())
0 282686
1 24825
Name: TARGET, dtype: int64
TARGET
값은 불균형하게 분포되어 있다. 대출 상환 기한을 지킨 대출이 상환 기한을 지키지 않은 대출보다 훨씬 많기 때문이다. 머신러닝 모델로 진행할 때, 데이터 내에서 대표성에 따라 클래스에 가중치를 부여하여 불균형을 반영할 수 있다.
Missing Values
def missing_values_table(df):
mis_val = df.isnull().sum() # Total missing values
mis_val_percent = 100 * df.isnull().sum() / len(df) # Percentage of missing values
mis_val_table = pd.concat([mis_val, mis_val_percent], axis = 1) # Create missing values table
mis_val_table_ren_columns = mis_val_table.rename(columns = {0 : 'Missing Values', 1 : '% of Total Values'}) # Rename the Columns
mis_val_table_ren_columns = mis_val_table_ren_columns[mis_val_table_ren_columns.iloc[:, 1] != 0].sort_values('% of Total Values', ascending = False).round(1) # Sort the table by percentage of missing descending
print("Your selected dataframe has " + str(df.shape[1]) + "columns. \n There are " + str(mis_val_table_ren_columns.shape[0]) + " columns that have missing values.")
return mis_val_table_ren_columns
missing_values = missing_values_table(app_train)
missing_values
Your selected dataframe has 122columns.
There are 67 columns that have missing values.
Missing Values | % of Total Values | |
---|---|---|
COMMONAREA_MEDI | 214865 | 69.9 |
COMMONAREA_AVG | 214865 | 69.9 |
COMMONAREA_MODE | 214865 | 69.9 |
NONLIVINGAPARTMENTS_MEDI | 213514 | 69.4 |
NONLIVINGAPARTMENTS_MODE | 213514 | 69.4 |
... | ... | ... |
EXT_SOURCE_2 | 660 | 0.2 |
AMT_GOODS_PRICE | 278 | 0.1 |
AMT_ANNUITY | 12 | 0.0 |
CNT_FAM_MEMBERS | 2 | 0.0 |
DAYS_LAST_PHONE_CHANGE | 1 | 0.0 |
67 rows × 2 columns
Columns types
app_train.dtypes.value_counts()
float64 65
int64 41
object 16
dtype: int64
Object (Categorical) columns
app_train.select_dtypes('object').apply(pd.Series.nunique, axis = 0)
NAME_CONTRACT_TYPE 2
CODE_GENDER 3
FLAG_OWN_CAR 2
FLAG_OWN_REALTY 2
NAME_TYPE_SUITE 7
NAME_INCOME_TYPE 8
NAME_EDUCATION_TYPE 5
NAME_FAMILY_STATUS 6
NAME_HOUSING_TYPE 6
OCCUPATION_TYPE 18
WEEKDAY_APPR_PROCESS_START 7
ORGANIZATION_TYPE 58
FONDKAPREMONT_MODE 4
HOUSETYPE_MODE 3
WALLSMATERIAL_MODE 7
EMERGENCYSTATE_MODE 2
dtype: int64
Encoding
- Label Encoding : 카테고리에 임의의 순서를 부여하고 각 카테고리에 부여된 값은 임의의 값으로 카테고리의 본질적인 측면을 나타내지 않기 때문에 모델이 카테고리의 상대적인 값을 이용하여 가중치를 할당할 수 있다.
- One-Hot Encoding : 값을 목록화하여 개별로 값에 대한 이진값을 만드는 방법이지만 Dummy Variable Trap 문제가 발생한다.
- Dummy Variable Trap : one-hot encoding한 변수 하나의 결과가 다른 변수의 도움으로 예측되어질 수 있다. -> 변수들이 다른 변수들과 상관성이 있다는 것을 의미하고 multicollinearity 문제를 야기 시킨다.
le = LabelEncoder()
le_count = 0
for col in app_train:
if app_train[col].dtype == 'object':
if len(list(app_train[col].unique())) <= 2:
le.fit(app_train[col])
app_train[col] = le.transform(app_train[col])
app_test[col] = le.transform(app_test[col])
le_count += 1
print('%d columns were label encoded.' % le_count)
3 columns were label encoded.
print('Training Features shape: ', app_train.shape)
print('Testing Features shape: ', app_test.shape)
Training Features shape: (307511, 122)
Testing Features shape: (48744, 121)
app_train = pd.get_dummies(app_train)
app_test = pd.get_dummies(app_test)
print('Training Features shape: ', app_train.shape)
print('Testing Features shape: ', app_test.shape)
Training Features shape: (307511, 243)
Testing Features shape: (48744, 239)
Aligning Training and Testing Data
train_y = app_train['TARGET']
app_train, app_test = app_train.align(app_test, join = 'inner', axis = 1)
app_train['TARGET'] = train_y
print('Training Features shape: ', app_train.shape)
print('Testing Features shape: ', app_test.shape)
Training Features shape: (307511, 240)
Testing Features shape: (48744, 239)
Anomalies
(app_train['DAYS_BIRTH'] / -365).describe()
count 307511.000000
mean 43.936973
std 11.956133
min 20.517808
25% 34.008219
50% 43.150685
75% 53.923288
max 69.120548
Name: DAYS_BIRTH, dtype: float64
현재 대출 신청과 관련되어 기록되어 있기 때문에 음수인 DAYS_BIRTH
는 -365로 나누어 대출 신청자들의 나이를 알 수 있다.
(app_train['DAYS_EMPLOYED']).describe()
count 307511.000000
mean 63815.045904
std 141275.766519
min -17912.000000
25% -2760.000000
50% -1213.000000
75% -289.000000
max 365243.000000
Name: DAYS_EMPLOYED, dtype: float64
max값은 3655243으로 1000년이 넘기 때문에 이상치일 가능성이 크다.
plt.hist(app_train['DAYS_EMPLOYED'])
plt.title('Days Employment Histogram')
Text(0.5, 1.0, 'Days Employment Histogram')
anom = app_train[app_train['DAYS_EMPLOYED'] == 365243]
non_anom = app_train[app_train['DAYS_EMPLOYED'] != 365243]
print('The non-anomalies default on %0.2f%% of loans' % (100 * non_anom['TARGET'].mean()))
print('The anomalies default on %0.2f%% of loans' % (100 * anom['TARGET'].mean()))
print('There are %d anomalous days of employment' % len(anom))
The non-anomalies default on 8.66% of loans
The anomalies default on 5.40% of loans
There are 55374 anomalous days of employment
이상치로 보이는 값은 55374개가 존재하고 정상치로 보이는 값보다 확률이 낮은 것을 볼 수 있다. 이상치를 해결하는 방법 중 하나는 이상치를 결측치로 설정한 다음 결측값을 대체하는 것이 가장 안전하다. 우선 이상치를 제외한 데이터의 분포를 살펴보면,
app_train['DYAS_EMPLOYED_ANOM'] = app_train['DAYS_EMPLOYED'] == 365243
app_train['DAYS_EMPLOYED'].replace({365243: np.nan}, inplace = True)
plt.hist(app_train['DAYS_EMPLOYED'])
plt.title('Days Employment Histogram')
Text(0.5, 1.0, 'Days Employment Histogram')
예상했던 데이터 분포가 나왔다. test도 같은 작업을 해주자!
app_test['DAYS_EMPLOYED_ANOM'] = app_test["DAYS_EMPLOYED"] == 365243
app_test["DAYS_EMPLOYED"].replace({365243: np.nan}, inplace = True)
print('There are %d anomalies in the test data out of %d entries' % (app_test["DAYS_EMPLOYED_ANOM"].sum(), len(app_test)))
There are 9274 anomalies in the test data out of 48744 entries
Correlations
상관계수는 관련성을 나타내는데 가장 좋은 방법은 아니지만, 데이터 내에서 변수들의 관계를 알려준다. 일반적인 해석은 다음과 같다.
-
.00-.19 “very weak”
-
.20-.39 “weak”
-
.40-.59 “moderate”
-
.60-.79 “strong”
-
.80-1.0 “very strong”
# Find correlations with the target and sort
correlations = app_train.corr()['TARGET'].sort_values()
# Display correlations
print('Most Positive Correlations:\n', correlations.tail(15))
print('\nMost Negative Correlations:\n', correlations.head(15))
Most Positive Correlations:
OCCUPATION_TYPE_Laborers 0.043019
FLAG_DOCUMENT_3 0.044346
REG_CITY_NOT_LIVE_CITY 0.044395
FLAG_EMP_PHONE 0.045982
NAME_EDUCATION_TYPE_Secondary / secondary special 0.049824
REG_CITY_NOT_WORK_CITY 0.050994
DAYS_ID_PUBLISH 0.051457
CODE_GENDER_M 0.054713
DAYS_LAST_PHONE_CHANGE 0.055218
NAME_INCOME_TYPE_Working 0.057481
REGION_RATING_CLIENT 0.058899
REGION_RATING_CLIENT_W_CITY 0.060893
DAYS_EMPLOYED 0.074958
DAYS_BIRTH 0.078239
TARGET 1.000000
Name: TARGET, dtype: float64
Most Negative Correlations:
EXT_SOURCE_3 -0.178919
EXT_SOURCE_2 -0.160472
EXT_SOURCE_1 -0.155317
NAME_EDUCATION_TYPE_Higher education -0.056593
CODE_GENDER_F -0.054704
NAME_INCOME_TYPE_Pensioner -0.046209
DYAS_EMPLOYED_ANOM -0.045987
ORGANIZATION_TYPE_XNA -0.045987
FLOORSMAX_AVG -0.044003
FLOORSMAX_MEDI -0.043768
FLOORSMAX_MODE -0.043226
EMERGENCYSTATE_MODE_No -0.042201
HOUSETYPE_MODE_block of flats -0.040594
AMT_GOODS_PRICE -0.039645
REGION_POPULATION_RELATIVE -0.037227
Name: TARGET, dtype: float64
DAYS_BIRTH
는 가장 큰 양의 상관관계를 가진다. DAYS_BIRTH
는 대출 시점에서의 고객의 나이로 음수 값을 가지고 있다.
상관관계는 양의 값이지만, 이 변수의 값은 음수이므로 “고객의 나이가 들수록 대출 상환 실패 가능성이 낮아진다.”로 해석할 수 있다.
나이는 음수의 값을 가지면 혼란스러울 수 있기 때문에 절대값을 취하자!
# Find the correlation of the positive days since birth and target
app_train['DAYS_BIRTH'] = abs(app_train['DAYS_BIRTH'])
plt.style.use('seaborn-whitegrid')
plt.hist(app_train['DAYS_BIRTH'] / 365, edgecolor = 'k', bins = 25)
plt.title('Age of Client')
plt.xlabel('Age (years)')
plt.ylabel('Count')
Text(0, 0.5, 'Count')
plt.figure(figsize = (10, 8))
# KDE plot of loans that were repaid on time
sns.kdeplot(app_train.loc[app_train['TARGET'] == 0, 'DAYS_BIRTH'] / 365, label = 'target == 0')
# KDE plot of loans which were not repaid on time
sns.kdeplot(app_train.loc[app_train['TARGET'] == 1, 'DAYS_BIRTH'] / 365, label = 'target == 1')
# Labeling of plot
plt.xlabel('Age (years)')
plt.ylabel('Density')
plt.title('Distribution of Ages')
plt.legend(loc='upper right')
<matplotlib.legend.Legend at 0x7f7c5aa53550>
target == 1
곡선은 나이가 어린 쪽으로 기울어져 있다.
더 자세히 살펴보기 위해 연령대별 평균적인 대출 상환 실패율을 나타내는 그래프를 그려보자!
# Age information into a separate dataframe
age_data = app_train[['TARGET', 'DAYS_BIRTH']]
age_data['YEARS_BIRTH'] = age_data['DAYS_BIRTH'] / 365
# Bin the age data
age_data['YEARS_BINNED'] = pd.cut(age_data['YEARS_BIRTH'], bins = np.linspace(20, 70, num = 11))
age_data.head(10)
TARGET | DAYS_BIRTH | YEARS_BIRTH | YEARS_BINNED | |
---|---|---|---|---|
0 | 1 | 9461 | 25.920548 | (25.0, 30.0] |
1 | 0 | 16765 | 45.931507 | (45.0, 50.0] |
2 | 0 | 19046 | 52.180822 | (50.0, 55.0] |
3 | 0 | 19005 | 52.068493 | (50.0, 55.0] |
4 | 0 | 19932 | 54.608219 | (50.0, 55.0] |
5 | 0 | 16941 | 46.413699 | (45.0, 50.0] |
6 | 0 | 13778 | 37.747945 | (35.0, 40.0] |
7 | 0 | 18850 | 51.643836 | (50.0, 55.0] |
8 | 0 | 20099 | 55.065753 | (55.0, 60.0] |
9 | 0 | 14469 | 39.641096 | (35.0, 40.0] |
# Group by the bin and calculate averages
age_groups = age_data.groupby('YEARS_BINNED').mean()
age_groups
TARGET | DAYS_BIRTH | YEARS_BIRTH | |
---|---|---|---|
YEARS_BINNED | |||
(20.0, 25.0] | 0.123036 | 8532.795625 | 23.377522 |
(25.0, 30.0] | 0.111436 | 10155.219250 | 27.822518 |
(30.0, 35.0] | 0.102814 | 11854.848377 | 32.479037 |
(35.0, 40.0] | 0.089414 | 13707.908253 | 37.555913 |
(40.0, 45.0] | 0.078491 | 15497.661233 | 42.459346 |
(45.0, 50.0] | 0.074171 | 17323.900441 | 47.462741 |
(50.0, 55.0] | 0.066968 | 19196.494791 | 52.593136 |
(55.0, 60.0] | 0.055314 | 20984.262742 | 57.491131 |
(60.0, 65.0] | 0.052737 | 22780.547460 | 62.412459 |
(65.0, 70.0] | 0.037270 | 24292.614340 | 66.555108 |
plt.figure(figsize = (8, 8))
# Graph the age bins and the average of the target as a bar plot
plt.bar(age_groups.index.astype(str), 100 * age_groups['TARGET'])
# Plot labeling
plt.xticks(rotation = 75); plt.xlabel('Age Group (years)'); plt.ylabel('Failure to Repay (%)')
plt.title('Failure to Repay by Age Group');
경향을 확실하게 볼 수 있다. (어린 고객들이 더 많이 대출 상환에 실패하는 것을 볼 수 있다.)
나이가 어린 그룹 3개가 모두 10% 이상 대출 상환에 실패하는 것을 볼 수 있고, 나이가 가장 많은 그룹은 5%미만으로 대출상환에 실패하는 것을 볼 수 있다.
Exterior Sources
TARGET과 가장 강한 음의 상관관계를 가진 3개의 변수는 EXT_SOURCE_1, 2, 3
으로 외부데이터 소스에서 정규화된 점수를 나타낸다.
정확하게 무엇을 나타내는지는 모르지만 다양한 데이터 소스를 사용하여 작성된 누적형 신용 등급일 수 있다.
ext_data = app_train[['TARGET', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH']]
ext_data_corrs = ext_data.corr()
ext_data_corrs
TARGET | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | DAYS_BIRTH | |
---|---|---|---|---|---|
TARGET | 1.000000 | -0.155317 | -0.160472 | -0.178919 | -0.078239 |
EXT_SOURCE_1 | -0.155317 | 1.000000 | 0.213982 | 0.186846 | 0.600610 |
EXT_SOURCE_2 | -0.160472 | 0.213982 | 1.000000 | 0.109167 | 0.091996 |
EXT_SOURCE_3 | -0.178919 | 0.186846 | 0.109167 | 1.000000 | 0.205478 |
DAYS_BIRTH | -0.078239 | 0.600610 | 0.091996 | 0.205478 | 1.000000 |
plt.figure(figsize = (8, 6))
# Heatmap of correlations
sns.heatmap(ext_data_corrs, cmap = plt.cm.RdYlBu_r, vmin = -0.25, annot = True, vmax = 0.6)
plt.title('Correlation Heatmap')
Text(0.5, 1.0, 'Correlation Heatmap')
TARGET
과 다른 변수들은 모두 음의 상관관계를 나타내고, EXT_SOURCE_1
과 DAYS_BIRTH
는 양의 상관관계를 가지고 있다.
EXT_SOURCE_1
점수의 요인중 하나가 고객의 나이일 수 있다는 것을 예상해볼 수 있다.
plt.figure(figsize = (10, 12))
# iterate through the sources
for i, source in enumerate(['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']):
# create a new subplot for each source
plt.subplot(3, 1, i + 1)
# plot repaid loans
sns.kdeplot(app_train.loc[app_train['TARGET'] == 0, source], label = 'target == 0')
# plot loans that were not repaid
sns.kdeplot(app_train.loc[app_train['TARGET'] == 1, source], label = 'target == 1')
# Label the plots
plt.title('Distribution of %s by Target Value' % source)
plt.xlabel('%s' % source)
plt.ylabel('Density')
plt.legend(loc='upper right')
plt.tight_layout(h_pad = 2.5)
대출 상환에 대해 EXT_SOURCE_3
과 EXT_SOURCE_1
이 가장 값의 차이가 많이 나는 것을 알 수 있다.
Feature Engineering
Polynomial Features
기존 변수의 거듭제곱 및 기존 변수간의 상호 작용을 통해 변수를 만든다. 변수가 각각 대상에 영향을 미미하게 끼치고 있을 수 있지만, 하나의 작용 변수로 결합하면 강한 영향을 끼칠수도 있다.
통계모델에서는 여러 변수의 효과를 포착하기 위해 자주 사용하지만 머신러닝에서는 자주 사용되지 않는다.
# Make a new dataframe for polynomial features
poly_features = app_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH', 'TARGET']]
poly_features_test = app_test[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH']]
# imputer for handling missing values
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy = 'median')
poly_target = poly_features['TARGET']
poly_features = poly_features.drop(columns = ['TARGET'])
# Need to impute missing values
poly_features = imputer.fit_transform(poly_features)
poly_features_test = imputer.transform(poly_features_test)
from sklearn.preprocessing import PolynomialFeatures
# Create the polynomial object with specified degree
poly_transformer = PolynomialFeatures(degree = 3)
# Train the polynomial features
poly_transformer.fit(poly_features)
# Transform the features
poly_features = poly_transformer.transform(poly_features)
poly_features_test = poly_transformer.transform(poly_features_test)
print('Polynomial Features shape: ', poly_features.shape)
Polynomial Features shape: (307511, 35)
poly_transformer.get_feature_names(input_features = ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH'])[:15]
['1',
'EXT_SOURCE_1',
'EXT_SOURCE_2',
'EXT_SOURCE_3',
'DAYS_BIRTH',
'EXT_SOURCE_1^2',
'EXT_SOURCE_1 EXT_SOURCE_2',
'EXT_SOURCE_1 EXT_SOURCE_3',
'EXT_SOURCE_1 DAYS_BIRTH',
'EXT_SOURCE_2^2',
'EXT_SOURCE_2 EXT_SOURCE_3',
'EXT_SOURCE_2 DAYS_BIRTH',
'EXT_SOURCE_3^2',
'EXT_SOURCE_3 DAYS_BIRTH',
'DAYS_BIRTH^2']
# Create a dataframe of the features
poly_features = pd.DataFrame(poly_features,
columns = poly_transformer.get_feature_names(['EXT_SOURCE_1', 'EXT_SOURCE_2',
'EXT_SOURCE_3', 'DAYS_BIRTH']))
# Add in the target
poly_features['TARGET'] = poly_target
# Find the correlations with the target
poly_corrs = poly_features.corr()['TARGET'].sort_values()
# Display most negative and most positive
print(poly_corrs.head(10))
print(poly_corrs.tail(5))
EXT_SOURCE_2 EXT_SOURCE_3 -0.193939
EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 -0.189605
EXT_SOURCE_2 EXT_SOURCE_3 DAYS_BIRTH -0.181283
EXT_SOURCE_2^2 EXT_SOURCE_3 -0.176428
EXT_SOURCE_2 EXT_SOURCE_3^2 -0.172282
EXT_SOURCE_1 EXT_SOURCE_2 -0.166625
EXT_SOURCE_1 EXT_SOURCE_3 -0.164065
EXT_SOURCE_2 -0.160295
EXT_SOURCE_2 DAYS_BIRTH -0.156873
EXT_SOURCE_1 EXT_SOURCE_2^2 -0.156867
Name: TARGET, dtype: float64
DAYS_BIRTH -0.078239
DAYS_BIRTH^2 -0.076672
DAYS_BIRTH^3 -0.074273
TARGET 1.000000
1 NaN
Name: TARGET, dtype: float64
새로운 변수들 중 일부는 원래의 변수보다 TARGET값과 상관관계가 절대적으로 더 높은 것을 볼 수 있다.
새로운 변수들을 추가한 모델과 추가하지 않은 모델을 평가하면서 추가할 것인지 고민해보자 !
# Put test features into dataframe
poly_features_test = pd.DataFrame(poly_features_test,
columns = poly_transformer.get_feature_names(['EXT_SOURCE_1', 'EXT_SOURCE_2',
'EXT_SOURCE_3', 'DAYS_BIRTH']))
# Merge polynomial features into training dataframe
poly_features['SK_ID_CURR'] = app_train['SK_ID_CURR']
app_train_poly = app_train.merge(poly_features, on = 'SK_ID_CURR', how = 'left')
# Merge polnomial features into testing dataframe
poly_features_test['SK_ID_CURR'] = app_test['SK_ID_CURR']
app_test_poly = app_test.merge(poly_features_test, on = 'SK_ID_CURR', how = 'left')
# Align the dataframes
app_train_poly, app_test_poly = app_train_poly.align(app_test_poly, join = 'inner', axis = 1)
# Print out the new shapes
print('Training data with polynomial features shape: ', app_train_poly.shape)
print('Testing data with polynomial features shape: ', app_test_poly.shape)
Training data with polynomial features shape: (307511, 274)
Testing data with polynomial features shape: (48744, 274)
app_train_domain = app_train.copy()
app_test_domain = app_test.copy()
app_train_domain['CREDIT_INCOME_PERCENT'] = app_train_domain['AMT_CREDIT'] / app_train_domain['AMT_INCOME_TOTAL']
app_train_domain['ANNUITY_INCOME_PERCENT'] = app_train_domain['AMT_ANNUITY'] / app_train_domain['AMT_INCOME_TOTAL']
app_train_domain['CREDIT_TERM'] = app_train_domain['AMT_ANNUITY'] / app_train_domain['AMT_CREDIT']
app_train_domain['DAYS_EMPLOYED_PERCENT'] = app_train_domain['DAYS_EMPLOYED'] / app_train_domain['DAYS_BIRTH']
app_test_domain['CREDIT_INCOME_PERCENT'] = app_test_domain['AMT_CREDIT'] / app_test_domain['AMT_INCOME_TOTAL']
app_test_domain['ANNUITY_INCOME_PERCENT'] = app_test_domain['AMT_ANNUITY'] / app_test_domain['AMT_INCOME_TOTAL']
app_test_domain['CREDIT_TERM'] = app_test_domain['AMT_ANNUITY'] / app_test_domain['AMT_CREDIT']
app_test_domain['DAYS_EMPLOYED_PERCENT'] = app_test_domain['DAYS_EMPLOYED'] / app_test_domain['DAYS_BIRTH']
plt.figure(figsize = (12, 20))
# iterate through the new features
for i, feature in enumerate(['CREDIT_INCOME_PERCENT', 'ANNUITY_INCOME_PERCENT', 'CREDIT_TERM', 'DAYS_EMPLOYED_PERCENT']):
# create a new subplot for each source
plt.subplot(4, 1, i + 1)
# plot repaid loans
sns.kdeplot(app_train_domain.loc[app_train_domain['TARGET'] == 0, feature], label = 'target == 0')
# plot loans that were not repaid
sns.kdeplot(app_train_domain.loc[app_train_domain['TARGET'] == 1, feature], label = 'target == 1')
# Label the plots
plt.title('Distribution of %s by Target Value' % feature)
plt.xlabel('%s' % feature)
plt.ylabel('Density')
plt.legend(loc = 'best')
plt.tight_layout(h_pad = 2.5)
Baseline
Logistic Regression Implementation
카테고리형 변수를 인코딩 한 후에 모든 feature를 사용하여 기준 모델을 얻어 베이스 라인을 만든다.
결측값을 채우고 feature 범위를 정규화하여 데이터를 전처리해보자 !
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
# TARGET 변수 drop
if 'TARGET' in app_train:
train = app_train.drop(columns = ['TARGET'])
else:
train = app_train.copy()
features = list(train.columns)
test = app_test.copy()
imputer = SimpleImputer(strategy = 'median')
scaler = MinMaxScaler(feature_range = (0, 1))
imputer.fit(train)
train = imputer.transform(train)
test = imputer.transform(app_test)
print('Training data shape : ', train.shape)
print('Test data shape :', test.shape)
Training data shape : (307511, 240)
Test data shape : (48744, 240)
과적합에 관여하는 parameter “C”만 제어하는 기본적인 LogisitcRegression
을 만들어보자
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(C = 0.0001)
log_reg.fit(train, train_y)
LogisticRegression(C=0.0001)
대출 상환 여부의 확률을 예측하기 위해서 predict_proba
메소드를 사용한다.
m $\times$ 2 배열을 반환하고 첫 번째 열은 타겟이 0일 확률, 두 번째 열은 타겟이 1일 확률이다. 우리는 대출이 상환되지 않을 확률을 원하므로 두 번째 열을 뽑아내보자
log_reg_pred = log_reg.predict_proba(test)[:, 1]
log_reg_pred
array([0.62609191, 0.52680293, 0.47829203, ..., 0.39582795, 0.32130762,
0.44120819])
베이스 라인의 점수를 확인하기 위해 submit
을 제출하면 0.671의 점수가 나온다.
submit = app_test[['SK_ID_CURR']]
submit['TARGET'] = log_reg_pred
Improved Model
Random Forest
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators = 100, random_state = 50, verbose = 1, n_jobs = -1)
random_forest.fit(train, train_y)
feature_importance_values = random_forest.feature_importances_
feature_importances = pd.DataFrame({'feature' : features, 'importance' : feature_importance_values})
predictions = random_forest.predict_proba(test)[:, 1]
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 10.0s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 25.6s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 0.1s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed: 0.2s finished
submit = app_test[['SK_ID_CURR']]
submit['TARGET'] = predictions
submit.to_csv('random_forest_baseline.csv', index = False)
Random Forest
는 0.678의 점수가 나왔다.
+ Polynominal value
poly_features_names = list(app_train_poly.columns)
# Impute the polynomial features
imputer = SimpleImputer(strategy = 'median')
poly_features = imputer.fit_transform(app_train_poly)
poly_features_test = imputer.transform(app_test_poly)
# Scale the polynomial features
scaler = MinMaxScaler(feature_range = (0, 1))
poly_features = scaler.fit_transform(poly_features)
poly_features_test = scaler.transform(poly_features_test)
random_forest_poly = RandomForestClassifier(n_estimators = 100, random_state = 50, verbose = 1, n_jobs = -1)
# Train on the training data
random_forest_poly.fit(poly_features, train_y)
# Make predictions on the test data
predictions = random_forest_poly.predict_proba(poly_features_test)[:, 1]
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 15.6s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 40.3s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 0.1s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed: 0.1s finished
polynominal 변수를 추가한 random_forest
는 0.678이 나왔다.
Bureau DataSet
-
bureau
: ‘Home Credit’에 제출된 고객(Client)의 다른 금융기관에서의 과거의 대출 기록. (각각의 대출 기록은 각각의 열로 정리되어있습니다.) -
bureau_balance
: 과거 대출들의 월별 데이터.(각 월별 데이터는 각각의 열로 정리되어있습니다.)
Counts of a client’s previous loans
bureau = pd.read_csv('bureau.csv')
bureau.head()
SK_ID_CURR | SK_ID_BUREAU | CREDIT_ACTIVE | CREDIT_CURRENCY | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 215354 | 5714462 | Closed | currency 1 | -497 | 0 | -153.0 | -153.0 | NaN | 0 | 91323.0 | 0.0 | NaN | 0.0 | Consumer credit | -131 | NaN |
1 | 215354 | 5714463 | Active | currency 1 | -208 | 0 | 1075.0 | NaN | NaN | 0 | 225000.0 | 171342.0 | NaN | 0.0 | Credit card | -20 | NaN |
2 | 215354 | 5714464 | Active | currency 1 | -203 | 0 | 528.0 | NaN | NaN | 0 | 464323.5 | NaN | NaN | 0.0 | Consumer credit | -16 | NaN |
3 | 215354 | 5714465 | Active | currency 1 | -203 | 0 | NaN | NaN | NaN | 0 | 90000.0 | NaN | NaN | 0.0 | Credit card | -16 | NaN |
4 | 215354 | 5714466 | Active | currency 1 | -629 | 0 | 1197.0 | NaN | 77674.5 | 0 | 2700000.0 | NaN | NaN | 0.0 | Consumer credit | -21 | NaN |
SK_ID_CURR
를 기준으로 groupby 실행 -> 이전 대출 갯수 파악하고 column 이름 변경
previous_loan_counts = bureau.groupby('SK_ID_CURR', as_index = False)['SK_ID_BUREAU'].count().rename(columns = {'SK_ID_BUREAU' : 'previous_loan_counts'})
previous_loan_counts.head()
SK_ID_CURR | previous_loan_counts | |
---|---|---|
0 | 100001 | 7 |
1 | 100002 | 8 |
2 | 100003 | 4 |
3 | 100004 | 2 |
4 | 100005 | 3 |
application_train
과 병합
train = pd.read_csv('application_train.csv')
train = train.merge(previous_loan_counts, on = 'SK_ID_CURR', how = 'left')
train.head()
SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | previous_loan_counts | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 8.0 |
1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 |
2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 |
3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
5 rows × 123 columns
train['previous_loan_counts'] = train['previous_loan_counts'].fillna(0)
train.head()
SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | previous_loan_counts | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 8.0 |
1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 |
2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 |
3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 |
4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
5 rows × 123 columns
새롭게 추가된 변수 precious_loan_counts
변수의 유용성을 평가하기 위해 피어슨 상관계수를 살펴보자 !
target값에 대한 r-value가 커질 수록, 해당 변수가 목표값에 영향을 끼칠 가능성이 높아지기 때문에 target값에 대해 가장 큰 r-value의 절대값을 가지는 변수를 찾는 것도 모델을 발전시키는데 한가지 방법이 된다.
커널밀도추정 그래프는 단일 변수의 분포를 보여준다.
# 변수의 분포에 대한 그래프 target값에 따라 색을 달리하여 작성
# Plots the disribution of a variable colored by value of the target
def kde_target(var_name, df):
'''
Args
input:
var_name = str, 변수가 되는 Column
df : DataFrame, 대상 데이터프레임
return: None
'''
# 새롭게 생성된 변수와 target간의 상관계수를 계산
# Calculate the correlation coefficient between the new variable and the target
corr = df['TARGET'].corr(df[var_name])
# 대출의 상환한 그룹과 그렇지 않은 그룹의 중간값(median)계산
# Calculate medians for repaid vs not repaid
avg_repaid = df.loc[df['TARGET']==0,var_name].median()
avg_not_repaid = df.loc[df['TARGET']==1,var_name].median()
plt.figure(figsize = (12,6))
# target값에 따라 색을 달리하여 그래프 작성
# Plot the distribution for target == 0 and target == 1
sns.kdeplot(df.loc[df['TARGET']==0,var_name],label = 'TARGET == 0')
sns.kdeplot(df.loc[df['TARGET']==1,var_name],label = 'TARGET == 1')
# 그래프 라벨링
# label the plot
plt.xlabel(var_name); plt.ylabel('Density'); plt.title('%s Distribution' % var_name)
plt.legend();
# 상관계수 출력
# print out the correlation
print('The correlation between %s and the TARGET is %0.4f' % (var_name, corr))
# 중간값 출력
# Print out average values
print('Median value for loan that was not repaid = %0.4f' % avg_not_repaid)
print('Median value for loan that was repaid = %0.4f' % avg_repaid)
kde_target('EXT_SOURCE_3', train)
The correlation between EXT_SOURCE_3 and the TARGET is -0.1789
Median value for loan that was not repaid = 0.3791
Median value for loan that was repaid = 0.5460
kde_target('previous_loan_counts', train)
The correlation between previous_loan_counts and the TARGET is -0.0100
Median value for loan that was not repaid = 3.0000
Median value for loan that was repaid = 4.0000
새롭게 만들어진 변수 previous_loan_counts
는 상관계수도 너무 작고 target값에 따른 분포의 차이가 거의 없는 것을 알 수 있다.
Aggregating Numeric Columns
bureau
의 수치 데이터들을 활용하기 위해 대표값을 계산하고 훈련 데이터 셋과 병합해보자 !
bureau_agg = bureau.drop(columns = ['SK_ID_BUREAU']).groupby('SK_ID_CURR', as_index = False).agg(['count','mean','max','min','sum']).reset_index()
bureau_agg.head()
SK_ID_CURR | DAYS_CREDIT | CREDIT_DAY_OVERDUE | ... | DAYS_CREDIT_UPDATE | AMT_ANNUITY | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | mean | max | min | sum | count | mean | max | min | ... | count | mean | max | min | sum | count | mean | max | min | sum | ||
0 | 100001 | 7 | -735.000000 | -49 | -1572 | -5145 | 7 | 0.0 | 0 | 0 | ... | 7 | -93.142857 | -6 | -155 | -652 | 7 | 3545.357143 | 10822.5 | 0.0 | 24817.5 |
1 | 100002 | 8 | -874.000000 | -103 | -1437 | -6992 | 8 | 0.0 | 0 | 0 | ... | 8 | -499.875000 | -7 | -1185 | -3999 | 7 | 0.000000 | 0.0 | 0.0 | 0.0 |
2 | 100003 | 4 | -1400.750000 | -606 | -2586 | -5603 | 4 | 0.0 | 0 | 0 | ... | 4 | -816.000000 | -43 | -2131 | -3264 | 0 | NaN | NaN | NaN | 0.0 |
3 | 100004 | 2 | -867.000000 | -408 | -1326 | -1734 | 2 | 0.0 | 0 | 0 | ... | 2 | -532.000000 | -382 | -682 | -1064 | 0 | NaN | NaN | NaN | 0.0 |
4 | 100005 | 3 | -190.666667 | -62 | -373 | -572 | 3 | 0.0 | 0 | 0 | ... | 3 | -54.333333 | -11 | -121 | -163 | 3 | 1420.500000 | 4261.5 | 0.0 | 4261.5 |
5 rows × 61 columns
댓글남기기