본문 바로가기
자격증 공부/빅데이터분석기사

빅데이터 분석기사 실기 예제 - 작업형 2유형(5)

by 해모해모 2023. 6. 20.
728x90
반응형

성인 인구조사 소득 예측(50K 이하이면 0, 50K 초과이면 1)
# X_train.shape, X_test.shape, y_train.shape, y_test.shape
#((26048, 15), (6513, 15), (26048, 2), (6513, 2))

# print(X_train.head())
# print(y_train.head()) # id income
# print(X_train.info())
# print(y_train['income'].value_counts())

# 결측치
# print(X_train.isnull().sum())
# print(X_test.isnull().sum())

# 결측치 제거
cols = ['workclass', 'occupation', 'native.country']
for col in cols:
    X_train = X_train.drop(col, axis=1)
    X_test = X_test.drop(col, axis=1)
    
# print(X_train.isnull().sum())
# print(X_test.isnull().sum())

# object형 라벨인코딩
from sklearn.preprocessing import LabelEncoder
obj = ['education', 'marital.status', 'relationship', 'race', 'sex']
for col in obj :
    le = LabelEncoder()
    X_train[col] = le.fit_transform(X_train[col])
    X_test[col] = le.transform(X_test[col])

# print(X_train.info(), X_test.info())

# 필요없는 값 제거
# print(X_train.shape, X_test.shape)
X_train = X_train.drop('id', axis=1)
test_id = X_test.pop('id')
# print(X_train.shape, X_test.shape)

# 타겟 값을 0(50<=)과 1(>50)로 변경
# print(y_train.head())
y = (y_train['income'] == '>50K').astype(int) # income이 >50K면 True(1), <=50K면 False(0)
# print(y)

# 모델 학습 및 예측
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(random_state=2023)
model.fit(X_train, y)
pred = model.predict(X_test)
# print(pred)

pd.DataFrame({
    'id' : test_id,
    'income': pred
}).to_csv('0000.csv', index = False)

print(pd.read_csv('0000.csv'))

# 정확도 0.8436972209427299
728x90
반응형

댓글