데이터 변환¶
[2]:
import numpy as np
import pandas as pd
[10]:
train_data = pd.read_csv('./train.csv')
train_data.head()
[10]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
데이터 종류¶
숫자형(Numerical Type) 데이터¶
연속성을 띄는 숫자로 이루어진 데이터
예) Age, Fare 등
범주형(Categorical Type) 데이터¶
연속적이지 않은 값(대부분의 경우 숫자를 제외한 나머지 값)을 갖는 데이터를 의미
예) Name, Sex, Ticket, Cabin, Embarked
어떤 경우, 숫자형 타입이라 할지라도 개념적으로 범주형으로 처리해야할 경우가 있음
예) Pclass
데이터 타입 확인¶
[4]:
train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
데이터 타입 변환¶
[5]:
train_data['Pclass'] = train_data['Pclass'].astype(str)
[6]:
train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null object
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(4), object(6)
memory usage: 83.6+ KB
함수를 이용하여 데이터 변환하기¶
[11]:
import math
def age_categorize(age):
if math.isnan(age):
return -1
return math.floor(age / 10) * 10
[12]:
train_data.head()
[12]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
[13]:
train_data['Age'].apply(age_categorize)
[13]:
0 20
1 30
2 20
3 30
4 30
5 -1
6 50
7 0
8 20
9 10
10 0
11 50
12 20
13 30
14 10
15 50
16 0
17 -1
18 30
19 -1
20 30
21 30
22 10
23 20
24 0
25 30
26 -1
27 10
28 -1
29 -1
..
861 20
862 40
863 -1
864 20
865 40
866 20
867 30
868 -1
869 0
870 20
871 40
872 30
873 40
874 20
875 10
876 20
877 10
878 -1
879 50
880 20
881 30
882 20
883 20
884 20
885 30
886 20
887 10
888 -1
889 20
890 30
Name: Age, Length: 891, dtype: int64
One-hot encoding¶
범주형 데이터는 분석단계에서 계산이 어렵기 때문에 숫자형으로 변경이 필요함
범주형 데이터의 각 범주(category)를 column레벨로 변경
해당 범주에 해당하면 1, 아니면 0으로 채우는 인코딩 기법
pandas.get_dummies 함수 사용
drop_first : 첫번째 카테고리 값은 사용하지 않음
[14]:
pd.get_dummies(train_data, columns=['Pclass', 'Sex', 'Embarked'], drop_first=False).head()
[14]:
| PassengerId | Survived | Name | Age | SibSp | Parch | Ticket | Fare | Cabin | Pclass_1 | Pclass_2 | Pclass_3 | Sex_female | Sex_male | Embarked_C | Embarked_Q | Embarked_S | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | Braund, Mr. Owen Harris | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
| 1 | 2 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 2 | 3 | 1 | Heikkinen, Miss. Laina | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 |
| 3 | 4 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| 4 | 5 | 0 | Allen, Mr. William Henry | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
[15]:
pd.get_dummies(train_data, columns=['Pclass', 'Sex', 'Embarked'], drop_first=True).head()
[15]:
| PassengerId | Survived | Name | Age | SibSp | Parch | Ticket | Fare | Cabin | Pclass_2 | Pclass_3 | Sex_male | Embarked_Q | Embarked_S | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | Braund, Mr. Owen Harris | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | 0 | 1 | 1 | 0 | 1 |
| 1 | 2 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | 0 | 0 | 0 | 0 | 0 |
| 2 | 3 | 1 | Heikkinen, Miss. Laina | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | 0 | 1 | 0 | 0 | 1 |
| 3 | 4 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | 0 | 0 | 0 | 0 | 1 |
| 4 | 5 | 0 | Allen, Mr. William Henry | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | 0 | 1 | 1 | 0 | 1 |