Those will be the persons we appraoch first for political donations
Feature explanation:
age: continuous. workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
workclass: categorical data; different types of work class
fnlwgt: Final weight; Its a weight assigned by the US census bureau to each row. The literal meaning is that you will need to replicate each row, final weight times to get the full data. And it would be somewhat 6.1 billion rows in it. Dont be shocked by the size, its an accumulated data over decades.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, married-F-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspect, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: Black, White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
Income: Whether a person's income is more than $50,000 or not. This is our dependent variable.
import numpy as np
import pandas as pd
import os
from matplotlib import pyplot as plt
pd.options.mode.chained_assignment = None # removes warning messages
os.chdir("C:\\Users\\ASUS")
census = pd.read_csv("adult.csv")
census.head()
| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
| 1 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
| 2 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
| 3 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
| 4 | 37 | Private | 284582 | Masters | 14 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 0 | 0 | 40 | United-States | <=50K |
#Unique values of Income variable
census.Income.unique()
array([' <=50K', ' >50K'], dtype=object)
#Frequency distribution of Income variable
census.Income.value_counts()
<=50K 24719
>50K 7841
Name: Income, dtype: int64
census["Income"] = np.where(census["Income"] == ' <=50K',0,1 )
#Frequency distribution of Income variable
census.Income.value_counts()
0 24719
1 7841
Name: Income, dtype: int64
census.head(2)
| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | 0 |
| 1 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | 0 |
# number of missing values by variables
census.isnull().sum()
age 0
workclass 0
fnlwgt 0
education 0
education-num 0
marital-status 0
occupation 0
relationship 0
race 0
sex 0
capital-gain 0
capital-loss 0
hours-per-week 0
native-country 0
Income 0
dtype: int64
#segregating the numeric and categorical variables
dataset_categorical = census.select_dtypes(exclude = "number")
dataset_numeric = census.select_dtypes(include = "number")
#create the dummy variables
#dataset_categorical= pd.get_dummies(data = dataset_categorical, drop_first = True)
dataset_categorical.head(2)
| workclass | education | marital-status | occupation | relationship | race | sex | native-country | |
|---|---|---|---|---|---|---|---|---|
| 0 | Self-emp-not-inc | Bachelors | Married-civ-spouse | Exec-managerial | Husband | White | Male | United-States |
| 1 | Private | HS-grad | Divorced | Handlers-cleaners | Not-in-family | White | Male | United-States |
dataset_categorical.describe(include='object')
| workclass | education | marital-status | occupation | relationship | race | sex | native-country | |
|---|---|---|---|---|---|---|---|---|
| count | 32560 | 32560 | 32560 | 32560 | 32560 | 32560 | 32560 | 32560 |
| unique | 9 | 16 | 7 | 15 | 6 | 5 | 2 | 42 |
| top | Private | HS-grad | Married-civ-spouse | Prof-specialty | Husband | White | Male | United-States |
| freq | 22696 | 10501 | 14976 | 4140 | 13193 | 27815 | 21789 | 29169 |
dataset_categorical["native-country"].unique()
array([' United-States', ' Cuba', ' Jamaica', ' India', ' ?', ' Mexico',
' South', ' Puerto-Rico', ' Honduras', ' England', ' Canada',
' Germany', ' Iran', ' Philippines', ' Italy', ' Poland',
' Columbia', ' Cambodia', ' Thailand', ' Ecuador', ' Laos',
' Taiwan', ' Haiti', ' Portugal', ' Dominican-Republic',
' El-Salvador', ' France', ' Guatemala', ' China', ' Japan',
' Yugoslavia', ' Peru', ' Outlying-US(Guam-USVI-etc)', ' Scotland',
' Trinadad&Tobago', ' Greece', ' Nicaragua', ' Vietnam', ' Hong',
' Ireland', ' Hungary', ' Holand-Netherlands'], dtype=object)
dataset_categorical= dataset_categorical.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
dataset_categorical["native-country"].value_counts()
United-States 29169
Mexico 643
? 583
Philippines 198
Germany 137
Canada 121
Puerto-Rico 114
El-Salvador 106
India 100
Cuba 95
England 90
Jamaica 81
South 80
China 75
Italy 73
Dominican-Republic 70
Vietnam 67
Guatemala 64
Japan 62
Poland 60
Columbia 59
Taiwan 51
Haiti 44
Iran 43
Portugal 37
Nicaragua 34
Peru 31
Greece 29
France 29
Ecuador 28
Ireland 24
Hong 20
Trinadad&Tobago 19
Cambodia 19
Laos 18
Thailand 18
Yugoslavia 16
Outlying-US(Guam-USVI-etc) 14
Hungary 13
Honduras 13
Scotland 12
Holand-Netherlands 1
Name: native-country, dtype: int64
#We will replace the ? (which seems to be missing value) with mode, i.e. United-States
dataset_categorical["native-country"]= dataset_categorical["native-country"].str.replace("?","United-States")
dataset_categorical["native-country"].value_counts().plot(kind='bar', ylabel='frequency')
plt.show()
dataset_categorical["native-country"] = np.where(dataset_categorical["native-country"] == "United-States",1,0 )
dataset_categorical["native-country"].value_counts()
1 29752
0 2808
Name: native-country, dtype: int64
dataset_categorical["workclass"].value_counts()
Private 22696
Self-emp-not-inc 2541
Local-gov 2093
? 1836
State-gov 1297
Self-emp-inc 1116
Federal-gov 960
Without-pay 14
Never-worked 7
Name: workclass, dtype: int64
#We will replace the ? (which seems to be missing value) with mode, i.e. Private
dataset_categorical["workclass"]= dataset_categorical["workclass"].str.replace("?","Private")
dataset_categorical["workclass"].value_counts().plot(kind='bar', ylabel='frequency')
plt.show()
dataset_categorical["workclass"] = np.where(dataset_categorical["workclass"] == "Private",1,0 )
dataset_categorical["workclass"].value_counts()
1 24532
0 8028
Name: workclass, dtype: int64
dataset_categorical["occupation"].value_counts()
Prof-specialty 4140
Craft-repair 4099
Exec-managerial 4066
Adm-clerical 3769
Sales 3650
Other-service 3295
Machine-op-inspct 2002
? 1843
Transport-moving 1597
Handlers-cleaners 1370
Farming-fishing 994
Tech-support 928
Protective-serv 649
Priv-house-serv 149
Armed-Forces 9
Name: occupation, dtype: int64
#We will replace the ? (which seems to be missing value) with mode, i.e. Prof-specialty
dataset_categorical["occupation"]= dataset_categorical["occupation"].str.replace("?","Prof-specialty")
dataset_categorical["occupation"].value_counts().plot(kind='bar', ylabel='frequency')
plt.show()
dataset_categorical["occupation"] = np.where(dataset_categorical["occupation"].isin(["Farming-fishing",
"Tech-support","Protective-serv","Priv-house-serv","Armed-Forces"]),"others",dataset_categorical["occupation"])
dataset_categorical.head(2)
| workclass | education | marital-status | occupation | relationship | race | sex | native-country | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Bachelors | Married-civ-spouse | Exec-managerial | Husband | White | Male | 1 |
| 1 | 1 | HS-grad | Divorced | Handlers-cleaners | Not-in-family | White | Male | 1 |
dataset_categorical=pd.get_dummies(data=dataset_categorical,columns=['education', 'marital-status',
"occupation","relationship","race","sex"],drop_first=True)
dataset_categorical.head(2)
| workclass | native-country | education_11th | education_12th | education_1st-4th | education_5th-6th | education_7th-8th | education_9th | education_Assoc-acdm | education_Assoc-voc | ... | relationship_Not-in-family | relationship_Other-relative | relationship_Own-child | relationship_Unmarried | relationship_Wife | race_Asian-Pac-Islander | race_Black | race_Other | race_White | sex_Male | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
2 rows × 42 columns
dataset_numeric.head(2)
| age | fnlwgt | education-num | capital-gain | capital-loss | hours-per-week | Income | |
|---|---|---|---|---|---|---|---|
| 0 | 50 | 83311 | 13 | 0 | 0 | 13 | 0 |
| 1 | 38 | 215646 | 9 | 0 | 0 | 40 | 0 |
import seaborn as sns
import matplotlib.pyplot as plt
correlation_mat = dataset_numeric.corr()
sns.heatmap(correlation_mat, annot = True)
plt.show()
#Distribution plot
dataset_numeric.hist(figsize=(15,30),layout=(9,3))
array([[<AxesSubplot:title={'center':'age'}>,
<AxesSubplot:title={'center':'fnlwgt'}>,
<AxesSubplot:title={'center':'education-num'}>],
[<AxesSubplot:title={'center':'capital-gain'}>,
<AxesSubplot:title={'center':'capital-loss'}>,
<AxesSubplot:title={'center':'hours-per-week'}>],
[<AxesSubplot:title={'center':'Income'}>, <AxesSubplot:>,
<AxesSubplot:>],
[<AxesSubplot:>, <AxesSubplot:>, <AxesSubplot:>],
[<AxesSubplot:>, <AxesSubplot:>, <AxesSubplot:>],
[<AxesSubplot:>, <AxesSubplot:>, <AxesSubplot:>],
[<AxesSubplot:>, <AxesSubplot:>, <AxesSubplot:>],
[<AxesSubplot:>, <AxesSubplot:>, <AxesSubplot:>],
[<AxesSubplot:>, <AxesSubplot:>, <AxesSubplot:>]], dtype=object)
dataset_numeric.drop("fnlwgt",axis=1,inplace=True)
dataset_numeric.head(2)
| age | education-num | capital-gain | capital-loss | hours-per-week | Income | |
|---|---|---|---|---|---|---|
| 0 | 50 | 13 | 0 | 0 | 13 | 0 |
| 1 | 38 | 9 | 0 | 0 | 40 | 0 |
#dataset_categorical = census.select_dtypes(exclude = "number")
#dataset_numeric = census.select_dtypes(include = "number")
data = pd.concat([dataset_categorical,dataset_numeric],axis=1)
data.head(2)
| workclass | native-country | education_11th | education_12th | education_1st-4th | education_5th-6th | education_7th-8th | education_9th | education_Assoc-acdm | education_Assoc-voc | ... | race_Black | race_Other | race_White | sex_Male | age | education-num | capital-gain | capital-loss | hours-per-week | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 1 | 50 | 13 | 0 | 0 | 13 | 0 |
| 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 1 | 38 | 9 | 0 | 0 | 40 | 0 |
2 rows × 48 columns
#segregate data into dependent and independent variables
X = data.drop("Income", axis = 1)#independent variables
y = data["Income"]#dependent variable
# Splitting it into training and testing (70% train & 30% test)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(random_state=0)
classifier.fit(X_train, y_train)
DecisionTreeClassifier(random_state=0)
y_pred = classifier.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(classification_report(y_test,y_pred))
precision recall f1-score support
0 0.87 0.89 0.88 7395
1 0.62 0.58 0.60 2373
accuracy 0.81 9768
macro avg 0.75 0.73 0.74 9768
weighted avg 0.81 0.81 0.81 9768
report = classification_report(y_test,y_pred, output_dict=True)
df = pd.DataFrame(report).transpose()
#df
import numpy as np
df["model"]="Decision Tree"
df1 = df.iloc[0:3,np.r_[0:3,4] ]
df1
| precision | recall | f1-score | model | |
|---|---|---|---|---|
| 0 | 0.868724 | 0.886815 | 0.877677 | Decision Tree |
| 1 | 0.622803 | 0.582385 | 0.601916 | Decision Tree |
| accuracy | 0.812858 | 0.812858 | 0.812858 | Decision Tree |
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=1000,random_state=0)
rf.fit(X_train, y_train)
RandomForestClassifier(n_estimators=1000, random_state=0)
y_pred = rf.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(classification_report(y_test,y_pred))
precision recall f1-score support
0 0.88 0.92 0.90 7395
1 0.71 0.62 0.66 2373
accuracy 0.85 9768
macro avg 0.80 0.77 0.78 9768
weighted avg 0.84 0.85 0.84 9768
report = classification_report(y_test,y_pred, output_dict=True)
df = pd.DataFrame(report).transpose()
#df
import numpy as np
df["model"]="Random Forest"
df2 = df.iloc[0:3,np.r_[0:3,4] ]
df2
| precision | recall | f1-score | model | |
|---|---|---|---|---|
| 0 | 0.881826 | 0.919270 | 0.900159 | Random Forest |
| 1 | 0.710053 | 0.616098 | 0.659747 | Random Forest |
| accuracy | 0.845618 | 0.845618 | 0.845618 | Random Forest |
import xgboost as xgb
from xgboost import XGBClassifier
classi =XGBClassifier()
classi.fit(X_train, y_train)
C:\Users\ASUS\anaconda3\envs\py36\lib\site-packages\xgboost\sklearn.py:1224: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
warnings.warn(label_encoder_deprecation_msg, UserWarning)
[15:47:48] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
gamma=0, gpu_id=-1, importance_type=None,
interaction_constraints='', learning_rate=0.300000012,
max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=100, n_jobs=8,
num_parallel_tree=1, predictor='auto', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
tree_method='exact', validate_parameters=1, verbosity=None)
y_pred = classi.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(classification_report(y_test,y_pred))
precision recall f1-score support
0 0.90 0.94 0.92 7395
1 0.77 0.66 0.71 2373
accuracy 0.87 9768
macro avg 0.83 0.80 0.81 9768
weighted avg 0.86 0.87 0.87 9768
report = classification_report(y_test,y_pred, output_dict=True)
df = pd.DataFrame(report).transpose()
#df
import numpy as np
df["model"]="XGBoost"
df3 = df.iloc[0:3,np.r_[0:3,4] ]
pd.concat([df1,df2,df3],axis=0)
| precision | recall | f1-score | model | |
|---|---|---|---|---|
| 0 | 0.868724 | 0.886815 | 0.877677 | Decision Tree |
| 1 | 0.622803 | 0.582385 | 0.601916 | Decision Tree |
| accuracy | 0.812858 | 0.812858 | 0.812858 | Decision Tree |
| 0 | 0.881826 | 0.919270 | 0.900159 | Random Forest |
| 1 | 0.710053 | 0.616098 | 0.659747 | Random Forest |
| accuracy | 0.845618 | 0.845618 | 0.845618 | Random Forest |
| 0 | 0.895855 | 0.935227 | 0.915117 | XGBoost |
| 1 | 0.766113 | 0.661188 | 0.709794 | XGBoost |
| accuracy | 0.868653 | 0.868653 | 0.868653 | XGBoost |