Feature selection (변수 선택)

개요

1) feature selection은 머신러닝에서 모델에 영향을 주는 핵심 과정 중 하나임

2) 관련 없거나 제한적으로만 연관돼 있다면 모델 성능 감소하는데 주요 요소가 될 수 있음

다양한 방법

1) missing value 제거

2) 변수 간의 상관관계가 높을 경우, 하나를 제외하고 제거

장점

1) overfitting 감소

2) 정확도 향상

3) 학습시간 감소

Filtering method

1) 변수 간의 상관관계를 측정하여 이를 알려줌

- 연산속도가 빠르며, 상관관계를 알아내는데 적합하므로 전처리에 많이 사용됨

2) i.e.

- chi-square test

- corr coefficient (heat map)

Wrapper method

1) 예측 모델에 변수들의 subset(부분집합)을 계속 만들어가며 테스트를 하고 유용한 feature를 선택하는것

- 큰 computing power가 요구됨

2) i.e.

- recursive feature elimination (RFE) : Using SVM, (forward selection, backward elimination, stepwise selection)

- sequential feature selection (SFS) : Using greedy algorithm

Embedded method

1) 모델의 정확도에 기여하는 변수를 학습

2) 적은 계수를 가지는 회귀식을 찾는 방향으로 제약

3) i.e.

- LASSO : L1-norm

- Ridge : L2-norm

- Elastic Net : LASSO + Ridge (combination)

- SelectFromModel : based on decision tree (RandomForest, LightGBM, etc)

코드 예시

1) RandomForest

from sklearn.ensemble import RandomForestClassifier

feat_labels = X_train.columns

rf = RandomForestClassifier(n_estimators = 1000, random_state = 0, n_jobs = -1)

rf.fit(X_train, y_train)
importances = rf.feature_importances_

# argsort : 리스트 인덱스의 정렬됐을 때의 인덱스 값 반환, [::-1] : 뒤집기
indices = np.argsort(importances)[::-1]

for i in range(X_train.shape[1]):
	print(feat_labels[indices[i]], importances[indices[i]])

2) SelectFromModel

- threshold를 사용하여 선택할 변수의 상관정도를 조정할 수 있음

from sklearn.feature_selection import SelectFromModel

# threshold : 'median', 숫자도 가능
sfm = SelectFromModel(rf, threshold = 'median', prefit = True)

n_features = sfm.transform(X_train)

selected_features = list(feat_labels[sfm.get_support()])

3) Using correlation (drop corr is 95%)

corr_matrix = data.corr()

upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k = 1).astype(np.bool))

to_drop = [column for column in upper.columns if any(abs(upper[column]) > 0.95)]

'Data > Data preprocessing' 카테고리의 다른 글

Sampling unbalanced data (1)	2021.02.23
Kernel method (0)	2021.02.22
Feature Extraction (PCA, LDA) (0)	2021.02.21
Data Scaling (normalization, standardization) (0)	2021.02.19
Categorical variables encoding / mean, one-hot, label (0)	2021.01.22

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

날아가는 개발자