데이터 변환

[2]:
import numpy as np
import pandas as pd
[10]:
train_data = pd.read_csv('./train.csv')
train_data.head()
[10]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

데이터 종류

숫자형(Numerical Type) 데이터

  • 연속성을 띄는 숫자로 이루어진 데이터

    • 예) Age, Fare 등

범주형(Categorical Type) 데이터

  • 연속적이지 않은 값(대부분의 경우 숫자를 제외한 나머지 값)을 갖는 데이터를 의미

    • 예) Name, Sex, Ticket, Cabin, Embarked

  • 어떤 경우, 숫자형 타입이라 할지라도 개념적으로 범주형으로 처리해야할 경우가 있음

    • 예) Pclass

데이터 타입 확인

[4]:
train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

데이터 타입 변환

[5]:
train_data['Pclass'] = train_data['Pclass'].astype(str)
[6]:
train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null object
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(4), object(6)
memory usage: 83.6+ KB

함수를 이용하여 데이터 변환하기

[11]:
import math

def age_categorize(age):
    if math.isnan(age):
        return -1
    return math.floor(age / 10) * 10
[12]:
train_data.head()
[12]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
[13]:
train_data['Age'].apply(age_categorize)
[13]:
0      20
1      30
2      20
3      30
4      30
5      -1
6      50
7       0
8      20
9      10
10      0
11     50
12     20
13     30
14     10
15     50
16      0
17     -1
18     30
19     -1
20     30
21     30
22     10
23     20
24      0
25     30
26     -1
27     10
28     -1
29     -1
       ..
861    20
862    40
863    -1
864    20
865    40
866    20
867    30
868    -1
869     0
870    20
871    40
872    30
873    40
874    20
875    10
876    20
877    10
878    -1
879    50
880    20
881    30
882    20
883    20
884    20
885    30
886    20
887    10
888    -1
889    20
890    30
Name: Age, Length: 891, dtype: int64

One-hot encoding

  • 범주형 데이터는 분석단계에서 계산이 어렵기 때문에 숫자형으로 변경이 필요함

  • 범주형 데이터의 각 범주(category)를 column레벨로 변경

  • 해당 범주에 해당하면 1, 아니면 0으로 채우는 인코딩 기법

  • pandas.get_dummies 함수 사용

    • drop_first : 첫번째 카테고리 값은 사용하지 않음

[14]:
pd.get_dummies(train_data, columns=['Pclass', 'Sex', 'Embarked'], drop_first=False).head()
[14]:
PassengerId Survived Name Age SibSp Parch Ticket Fare Cabin Pclass_1 Pclass_2 Pclass_3 Sex_female Sex_male Embarked_C Embarked_Q Embarked_S
0 1 0 Braund, Mr. Owen Harris 22.0 1 0 A/5 21171 7.2500 NaN 0 0 1 0 1 0 0 1
1 2 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 38.0 1 0 PC 17599 71.2833 C85 1 0 0 1 0 1 0 0
2 3 1 Heikkinen, Miss. Laina 26.0 0 0 STON/O2. 3101282 7.9250 NaN 0 0 1 1 0 0 0 1
3 4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0 1 0 113803 53.1000 C123 1 0 0 1 0 0 0 1
4 5 0 Allen, Mr. William Henry 35.0 0 0 373450 8.0500 NaN 0 0 1 0 1 0 0 1
[15]:
pd.get_dummies(train_data, columns=['Pclass', 'Sex', 'Embarked'], drop_first=True).head()
[15]:
PassengerId Survived Name Age SibSp Parch Ticket Fare Cabin Pclass_2 Pclass_3 Sex_male Embarked_Q Embarked_S
0 1 0 Braund, Mr. Owen Harris 22.0 1 0 A/5 21171 7.2500 NaN 0 1 1 0 1
1 2 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 38.0 1 0 PC 17599 71.2833 C85 0 0 0 0 0
2 3 1 Heikkinen, Miss. Laina 26.0 0 0 STON/O2. 3101282 7.9250 NaN 0 1 0 0 1
3 4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0 1 0 113803 53.1000 C123 0 0 0 0 1
4 5 0 Allen, Mr. William Henry 35.0 0 0 373450 8.0500 NaN 0 1 1 0 1