Analytics Educator
  • Home
  • Courses
  • Blog
  • FAQ
  • Contact Us
  • Home
  • Courses
  • FAQ
  • Contact
Home   /   Blog   /   Details

Data science has risen to prominence in the last decade due to its capabilities in predictive algorithms. While many business verticals value the benefits of predictive algorithms using Data Science, insurance companies place a lot of importance as data science and predictive algorithms helps them keeps premium low. Data is always been at the core of what insurance companies do analyzing data such as claims, what kind of a vehicle one drives, how many miles do they drive per day among other.

The data science field is gaining strength with improvements in technology, availability of statistical libraries to compute regression or classifications of data collected. Actuaries, the data scientists at insurance companies as they were called a decade ago, used to collate data from different sources and analyze the premium and claim data to identify fraudulent transactions that helped them keep the premiums low. If anything, data science technology of today has given far more tools to perform their analysis.

The data has a few ordinal, categorical data that needs to be parsed and categorized properly.

Our goal is to predict a binary outcome of 1, to indicate safe driver, or 0, to indicate that the drivers' data needs a review. We will also look at the continuous variables and fill in the missing data with the mean or median in order to not skew our results.

After cleaning up the data and filling in missing data we will look at the features and their correlation so that we can drop highly correlated data which may impact our results.

We are importing all the required packages¶

In [34]:
# Import the necessary packages of Python that we will/may use in this notebook
                import pandas as pd
                import numpy as np
                import warnings 
                warnings.filterwarnings("ignore")
                from sklearn.preprocessing import LabelBinarizer
                from sklearn.ensemble import RandomForestClassifier
                from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
                from sklearn.metrics import roc_curve, precision_recall_curve, auc, make_scorer, recall_score, accuracy_score, precision_score, confusion_matrix, fbeta_score, roc_auc_score
                from sklearn.model_selection import train_test_split
                from sklearn.metrics import classification_report as cr
                import matplotlib.pyplot as plt
                plt.style.use("ggplot")
                import os
                

Import the data into python¶

In [35]:
# Read the data from the local drive
                os.chdir("C:\\Users\\ASUS\\Desktop")
                safe_driver = pd.read_csv('safe_driver.csv')
                safe_driver.head()
                
Out[35]:
ID target Gender EngineHP credit_history Years_Experience annual_claims Marital_Status Vehical_type Miles_driven_annually size_of_family Age_bucket EngineHP_bucket Years_Experience_bucket Miles_driven_annually_bucket credit_history_bucket State
0 1 1 F 522 656 1 0 Married Car 14749.0 5 <18 >350 <3 <15k Fair IL
1 2 1 F 691 704 16 0 Married Car 15389.0 6 28-34 >350 15-30 15k-25k Good NJ
2 3 1 M 133 691 15 0 Married Van 9956.0 3 >40 90-160 15-30 <15k Good CT
3 4 1 M 146 720 9 0 Married Van 77323.0 3 18-27 90-160 9-14' >25k Good CT
4 5 1 M 128 771 33 1 Married Van 14183.0 4 >40 90-160 >30 <15k Very Good WY

In the above data, target is our dependent variable, which shows 1 as safe drive and 0 as not safe driver. We have other independent variables like the type of car the driver is driving, their personal and professional details.¶

Checking the descriptive statistics to see if there is any visible impossible values (like family size in negative)¶

Here we don't get to see any problem with the data

In [36]:
safe_driver.describe()
                
Out[36]:
ID target EngineHP credit_history Years_Experience annual_claims Miles_driven_annually size_of_family
count 30240.000000 30240.00000 30240.000000 30240.000000 30240.000000 30240.000000 30232.000000 30240.000000
mean 15120.500000 0.70754 196.604266 685.769775 13.255721 1.138459 17422.938939 4.521296
std 8729.680407 0.45490 132.346961 102.454307 9.890246 1.082913 17483.782840 2.286531
min 1.000000 0.00000 80.000000 300.000000 1.000000 0.000000 5000.000000 1.000000
25% 7560.750000 0.00000 111.000000 668.000000 5.000000 0.000000 9668.500000 3.000000
50% 15120.500000 1.00000 141.000000 705.000000 10.000000 1.000000 12280.000000 5.000000
75% 22680.250000 1.00000 238.000000 753.000000 20.000000 2.000000 14697.250000 7.000000
max 30240.000000 1.00000 1005.000000 850.000000 40.000000 4.000000 99943.000000 8.000000

Checking is the data has any missing values¶

We get to see that there are 2 variables with missing values

In [37]:
# Check if there are any NULL data that need to be dropped
                safe_driver.isnull().mean()*100
                
Out[37]:
ID                              0.000000
                target                          0.000000
                Gender                          0.000000
                EngineHP                        0.000000
                credit_history                  0.000000
                Years_Experience                0.000000
                annual_claims                   0.000000
                Marital_Status                  0.000000
                Vehical_type                    0.000000
                Miles_driven_annually           0.026455
                size_of_family                  0.000000
                Age_bucket                      0.000000
                EngineHP_bucket                 0.000000
                Years_Experience_bucket         0.000000
                Miles_driven_annually_bucket    0.026455
                credit_history_bucket           0.000000
                State                           0.000000
                dtype: float64
In [38]:
#safe_driver = safe_driver.dropna()
                
In [39]:
safe_driver.head()
                
Out[39]:
ID target Gender EngineHP credit_history Years_Experience annual_claims Marital_Status Vehical_type Miles_driven_annually size_of_family Age_bucket EngineHP_bucket Years_Experience_bucket Miles_driven_annually_bucket credit_history_bucket State
0 1 1 F 522 656 1 0 Married Car 14749.0 5 <18 >350 <3 <15k Fair IL
1 2 1 F 691 704 16 0 Married Car 15389.0 6 28-34 >350 15-30 15k-25k Good NJ
2 3 1 M 133 691 15 0 Married Van 9956.0 3 >40 90-160 15-30 <15k Good CT
3 4 1 M 146 720 9 0 Married Van 77323.0 3 18-27 90-160 9-14' >25k Good CT
4 5 1 M 128 771 33 1 Married Van 14183.0 4 >40 90-160 >30 <15k Very Good WY

Here we are checking the frequency distribution of the dependent variable to check if it's a balanced data.¶

We see it's almost a balanced data with proportion of 1 and 0 being 70% and 0% respectively.

In [40]:
safe_driver.target.value_counts(normalize=True)*100
                
Out[40]:
1    70.753968
                0    29.246032
                Name: target, dtype: float64

We are now extracting the categorical variables, to transform/drop them as required.¶

In [41]:
cat_features = safe_driver.select_dtypes(include=['object'])
                print(cat_features.columns)
                
Index(['Gender', 'Marital_Status', 'Vehical_type', 'Age_bucket',
                       'EngineHP_bucket', 'Years_Experience_bucket',
                       'Miles_driven_annually_bucket', 'credit_history_bucket', 'State'],
                      dtype='object')
                
In [42]:
cat_features.head(2)
                
Out[42]:
Gender Marital_Status Vehical_type Age_bucket EngineHP_bucket Years_Experience_bucket Miles_driven_annually_bucket credit_history_bucket State
0 F Married Car <18 >350 <3 <15k Fair IL
1 F Married Car 28-34 >350 15-30 15k-25k Good NJ

Among the categorical variables we retain the following:

  1. Gender
  2. Marital_Status
  3. Vehicle_Type, and
  4. Age_bucket

    EngineHP_bucket, Years_Experience_bucket, Miles_driven_annually_bucket, credit_history_bucket have a corresponding continuous variable. Creating each with their own dummies along with the continuous variable does not make sense. We will keep the Age_bucket as there is no continuous variable to represent age.

    We can split the dataset by State (one sub-dataset for each state) and analyze each state by itself. As each US state has its own regulations it may make sense to analyze each state by itself. We could aggregate our results across states later to get a national statistic.

    Or, for now, we could drop the State column and analyze the data across the nation later.

Drop these 5 columns: ID, EngineHP_bucket, Years_Experience_bucket, Miles_driven_annually_bucket, credit_history_bucket¶

In [43]:
safe_driver.drop(['ID', 'EngineHP_bucket', 'Years_Experience_bucket',
                                  'Miles_driven_annually_bucket',
                                  'credit_history_bucket'], axis=1, inplace=True)
                
In [44]:
# Check if the dataset has any NaN values as these values will make our algorithms throw an exception
                
                safe_driver.isnull().sum()
                
Out[44]:
target                   0
                Gender                   0
                EngineHP                 0
                credit_history           0
                Years_Experience         0
                annual_claims            0
                Marital_Status           0
                Vehical_type             0
                Miles_driven_annually    8
                size_of_family           0
                Age_bucket               0
                State                    0
                dtype: int64

The Miles_driven_annually feature has some null values. Let us explore which particular cells have NaN and ingest them with the median data.

In [45]:
safe_driver[safe_driver.isnull().any(axis=1)]
                
Out[45]:
target Gender EngineHP credit_history Years_Experience annual_claims Marital_Status Vehical_type Miles_driven_annually size_of_family Age_bucket State
1235 1 F 124 793 27 0 Married Truck NaN 3 >40 NJ
7365 0 F 465 696 5 0 Married Truck NaN 8 18-27 SD
11464 1 F 137 787 18 1 Married Truck NaN 1 >40 CT
18158 0 F 108 747 8 1 Married Truck NaN 1 18-27 OR
19795 1 F 121 774 19 0 Married Truck NaN 2 28-34 NY
25731 1 F 355 694 15 1 Married Truck NaN 5 28-34 CT
26512 1 F 109 743 40 0 Married Truck NaN 1 >40 OR
27045 1 F 83 784 21 0 Married Truck NaN 1 >40 CT

It may make sense to ingest the median of Vehicle_Type=='Truck' as all the NaN values are for Truck only. Let us look at the median of Miles_driven_annually by each vehicle type.¶

In [46]:
safe_driver.head(2)
                
Out[46]:
target Gender EngineHP credit_history Years_Experience annual_claims Marital_Status Vehical_type Miles_driven_annually size_of_family Age_bucket State
0 1 F 522 656 1 0 Married Car 14749.0 5 <18 IL
1 1 F 691 704 16 0 Married Car 15389.0 6 28-34 NJ

Replace NaN values in Miles_driven_annually with the median value for Truck. There may be better ways to impute missing data. But we have just 8 NaN cells out of some 30,000+ rows which is less than 0.03%. So, imputing with median for all the 8 cells is not going to skew our results.¶

In [47]:
m = safe_driver.groupby("Vehical_type")["Miles_driven_annually"].median()
                m = pd.DataFrame(m)
                median_values = m.loc["Truck",]
                median_values = pd.DataFrame(median_values)
                median_values.iloc[0,0]
                
Out[47]:
12370.5
In [48]:
# Replace NaN values in Miles_driven_annually with the median value for Truck
                # There may be better ways to impute missing data. But we have just 8 NaN cells out of some 30,000+ rows which is
                # less than 0.03%
                # So, imputing with median for all the 8 cells is not going to skew our results.
                
                #safe_driver.fillna(median_values.loc['Truck', 'Miles_driven_annually'], inplace=True)
                
                safe_driver.loc[(safe_driver["Miles_driven_annually"].isnull() == True) & (safe_driver["Vehical_type"] == "Truck"),"Miles_driven_annually"] = median_values.iloc[0,0] 
                
In [49]:
safe_driver.loc[safe_driver["Miles_driven_annually"] == 12370.5,]
                
Out[49]:
target Gender EngineHP credit_history Years_Experience annual_claims Marital_Status Vehical_type Miles_driven_annually size_of_family Age_bucket State
1235 1 F 124 793 27 0 Married Truck 12370.5 3 >40 NJ
7365 0 F 465 696 5 0 Married Truck 12370.5 8 18-27 SD
11464 1 F 137 787 18 1 Married Truck 12370.5 1 >40 CT
18158 0 F 108 747 8 1 Married Truck 12370.5 1 18-27 OR
19795 1 F 121 774 19 0 Married Truck 12370.5 2 28-34 NY
25731 1 F 355 694 15 1 Married Truck 12370.5 5 28-34 CT
26512 1 F 109 743 40 0 Married Truck 12370.5 1 >40 OR
27045 1 F 83 784 21 0 Married Truck 12370.5 1 >40 CT

Check for null values again to make sure we did not miss any accidentally¶

In [50]:
safe_driver[safe_driver.isnull().any(axis=1)]
                
Out[50]:
target Gender EngineHP credit_history Years_Experience annual_claims Marital_Status Vehical_type Miles_driven_annually size_of_family Age_bucket State

Check the data types of all remaining features¶

In [51]:
safe_driver.info()
                
<class 'pandas.core.frame.DataFrame'>
                RangeIndex: 30240 entries, 0 to 30239
                Data columns (total 12 columns):
                 #   Column                 Non-Null Count  Dtype  
                ---  ------                 --------------  -----  
                 0   target                 30240 non-null  int64  
                 1   Gender                 30240 non-null  object 
                 2   EngineHP               30240 non-null  int64  
                 3   credit_history         30240 non-null  int64  
                 4   Years_Experience       30240 non-null  int64  
                 5   annual_claims          30240 non-null  int64  
                 6   Marital_Status         30240 non-null  object 
                 7   Vehical_type           30240 non-null  object 
                 8   Miles_driven_annually  30240 non-null  float64
                 9   size_of_family         30240 non-null  int64  
                 10  Age_bucket             30240 non-null  object 
                 11  State                  30240 non-null  object 
                dtypes: float64(1), int64(6), object(5)
                memory usage: 2.8+ MB
                

Looking at the feature values above, the range of values of each vary a lot. For example 'Miles_driven_annually' is in the 10s of thousands, whereas 'credit_history' is in the 100s and 'annual-claims' is in single digit. Due to the varying magnitudes of the feature values we will scale the features with Z-scores using sklearn.preprocessing.scale.

In [52]:
# To standardize the numeric features we need to isolate them first into a separate dataframe
                
                safe_driver_num_features = safe_driver.drop(safe_driver.select_dtypes(['object']), axis=1)
                
                # Do not standardize 'target' which is our label
                
                safe_driver_num_features.drop(['target'], axis=1, inplace=True)
                
                safe_driver_cat_features = safe_driver.select_dtypes(['object'])
                

Check if there are any NaN values one more time¶

In [53]:
safe_driver_num_features[safe_driver_num_features.isnull().any(axis=1)]
                
Out[53]:
EngineHP credit_history Years_Experience annual_claims Miles_driven_annually size_of_family

Restore the column names from the original dataset¶

We now have the scaled feature set. Now we need to concatenate the categorical features back with our scaled¶

dataset before running OneHotEncoder or dummies.¶

In [54]:
from sklearn import preprocessing
                safe_driver_scaled = pd.DataFrame(preprocessing.scale(safe_driver_num_features),
                columns=safe_driver_num_features.columns)
                
In [55]:
# We will concatenate the scaled dataframe with the categorical feature set
                
                safe_driver = pd.concat([safe_driver_scaled, safe_driver['target'], safe_driver_cat_features], axis=1)
                
In [56]:
safe_driver.head(2)
                
Out[56]:
EngineHP credit_history Years_Experience annual_claims Miles_driven_annually size_of_family target Gender Marital_Status Vehical_type Age_bucket State
0 2.458697 -0.290571 -1.239193 -1.051311 -0.152883 0.209362 1 F Married Car <18 IL
1 3.735665 0.177938 0.277478 -1.051311 -0.116272 0.646712 1 F Married Car 28-34 NJ

Univariate analysis and dummy variable creation¶

Now we are going to extract all object variables (categorical variables) and check their values¶

In [57]:
char = safe_driver.select_dtypes(exclude='number')
                num = safe_driver.select_dtypes(include='number')
                char.head()
                
Out[57]:
Gender Marital_Status Vehical_type Age_bucket State
0 F Married Car <18 IL
1 F Married Car 28-34 NJ
2 M Married Van >40 CT
3 M Married Van 18-27 CT
4 M Married Van >40 WY

Now we are going to check the frequency distribution of each categorical variables, and treat them according to their distributions¶

In [58]:
# this option will display all rows
                pd.set_option('display.max_rows', None)
                
                #W are extracting all the unique values of the categorical variables
                char.apply(lambda x: x.value_counts()).T.stack()
                
Out[58]:
Gender          F          13881.0
                                M          16359.0
                Marital_Status  Married    19820.0
                                Single     10420.0
                Vehical_type    Car        11582.0
                                Truck       8798.0
                                Utility     4007.0
                                Van         5853.0
                Age_bucket      18-27       8097.0
                                28-34       2056.0
                                35-40       6546.0
                                <18          911.0
                                >40        12630.0
                State           AK           205.0
                                AL           246.0
                                AR           255.0
                                AZ           225.0
                                CA           251.0
                                CO           272.0
                                CT          4444.0
                                DE           261.0
                                FL           251.0
                                GA           242.0
                                HI           225.0
                                IA           242.0
                                ID           251.0
                                IL           220.0
                                IN           241.0
                                KS           241.0
                                KY           248.0
                                LA           264.0
                                MA           284.0
                                MD           247.0
                                ME           248.0
                                MI           235.0
                                MN           242.0
                                MO           237.0
                                MS           220.0
                                MT           238.0
                                NC           221.0
                                ND           245.0
                                NE           222.0
                                NH           229.0
                                NJ          4884.0
                                NM           236.0
                                NV           239.0
                                NY          3686.0
                                OH           223.0
                                OK           260.0
                                OR          3838.0
                                PA           257.0
                                RI           242.0
                                SC           249.0
                                SD           229.0
                                TN           242.0
                                TX           233.0
                                UT           244.0
                                VA           252.0
                                VT          1429.0
                                WA           233.0
                                WI           271.0
                                WV          1253.0
                                WY           288.0
                dtype: float64

Observation: We can see that the variable state has too many values to create a dummy variable. Hence, we will drop it. For the rest of the variables, we will create a dummy variables.¶

In [59]:
# dropping the state column
                char = char.drop("State",axis=1)
                
In [60]:
char = pd.get_dummies(data=char,drop_first=True)
                
In [61]:
safe_driver_num_features = pd.concat(
                    [safe_driver_num_features, safe_driver['target']], axis=1)
                
In [62]:
safe_driver_num_features.info()
                
<class 'pandas.core.frame.DataFrame'>
                RangeIndex: 30240 entries, 0 to 30239
                Data columns (total 7 columns):
                 #   Column                 Non-Null Count  Dtype  
                ---  ------                 --------------  -----  
                 0   EngineHP               30240 non-null  int64  
                 1   credit_history         30240 non-null  int64  
                 2   Years_Experience       30240 non-null  int64  
                 3   annual_claims          30240 non-null  int64  
                 4   Miles_driven_annually  30240 non-null  float64
                 5   size_of_family         30240 non-null  int64  
                 6   target                 30240 non-null  int64  
                dtypes: float64(1), int64(6)
                memory usage: 1.6 MB
                

Here, below, we separate our feature set from the label target and convert all the categorical variables to numeric. Then split the feature set into training and test data sets.

Let us convert some of the categorical features into numeric giving weightage to each variable.

  1. Gender: 1 = Female and 2 = Male
  2. Marital_Status: 1 = Single and 2 = Married
  3. Vehicle_Type: Use LabelEncoder
  4. Age_bucket: Use LabelEncoder

    We are not using dummies or OneHotEncoder because these create sparse matrices and increase dimensionality. By giving a 1 or a 2 for say Marital_Status we give higher weightage to Married by assigning a value of 2.
In [63]:
safe_driver.head(3)
                
Out[63]:
EngineHP credit_history Years_Experience annual_claims Miles_driven_annually size_of_family target Gender Marital_Status Vehical_type Age_bucket State
0 2.458697 -0.290571 -1.239193 -1.051311 -0.152883 0.209362 1 F Married Car <18 IL
1 3.735665 0.177938 0.277478 -1.051311 -0.116272 0.646712 1 F Married Car 28-34 NJ
2 -0.480595 0.051050 0.176366 -1.051311 -0.427060 -0.665340 1 M Married Van >40 CT
In [64]:
# Convert Gender to a 1 or a 2
                safe_driver['Gender'] = np.where(safe_driver['Gender'] == 'F', 1, 2)
                
                # Convert Marital_Status to a 1 or a 2
                safe_driver['Marital_Status'] = np.where(
                    safe_driver['Marital_Status'] == 'Single', 1, 2)
                
                # Convert Vehicle_Type using LabelEncoder
                le = preprocessing.LabelEncoder()
                le.fit(safe_driver['Vehical_type'])
                
                safe_driver['Vehical_type'] = le.transform(safe_driver['Vehical_type'])
                
                # Convert Age_bucket using LabelEncoder
                le.fit(safe_driver['Age_bucket'])
                
                safe_driver['Age_bucket'] = le.transform(safe_driver['Age_bucket'])
                
In [93]:
safe_driver.head(2)
                
Out[93]:
EngineHP credit_history Years_Experience annual_claims Miles_driven_annually size_of_family target Gender Marital_Status Vehical_type Age_bucket State
0 2.458697 -0.290571 -1.239193 -1.051311 -0.152883 0.209362 1 1 2 0 3 IL
1 3.735665 0.177938 0.277478 -1.051311 -0.116272 0.646712 1 1 2 0 1 NJ
In [176]:
#panu = safe_driver.copy()
                safe_driver = panu.copy()
                

Segregating the dependent and independent variables as X and y¶

In [177]:
# Drop the 'target' column from training dataframe as that is our label
                X = safe_driver.drop(['target', 'State'], 1)
                
                # The 'target' column is our label or outcome that we want to predict
                y = safe_driver['target']
                

We found out much earlier that our target label is 70% success (good driver or target == 1) and 30% failure (bad driver or target == 0). Let us do class balancing using SMOTE and see the distribution.

In [178]:
from imblearn.over_sampling import SMOTE
                from sklearn.model_selection import train_test_split
                
                os = SMOTE(random_state=0)
                
                columns = X.columns
                os_data_X, os_data_y = os.fit_resample(X, y)
                #os_data_X = pd.DataFrame(data=os_data_X, columns=columns)
                #os_data_y = pd.DataFrame(data=os_data_y, columns=['y'])
                

Split the resulting balanced data set as train and test¶

In [181]:
X_train, X_test, y_train, y_test = train_test_split(os_data_X, os_data_y, test_size=0.3, random_state=0)
                

Decision Tree Classifier¶

In [180]:
def dt():
                    from sklearn.tree import DecisionTreeClassifier
                    classifier = DecisionTreeClassifier(random_state=0)
                    classifier.fit(X_train, y_train)
                    y_pred = classifier.predict(X_test)
                    # validation of the model
                    from sklearn.metrics import classification_report, confusion_matrix
                    #print(classifier.score(X_test, y_test))
                    r=classifier.score(X_test, y_test)
                    print(classification_report(y_test, y_pred))
                    return(r)
                dt()  
                
              precision    recall  f1-score   support
                
                           0       0.66      0.68      0.67      6418
                           1       0.67      0.65      0.66      6420
                
                    accuracy                           0.67     12838
                   macro avg       0.67      0.67      0.67     12838
                weighted avg       0.67      0.67      0.67     12838
                
                
Out[180]:
0.6664589499922107

We could achieve a score of 66.6% using Decision Tree¶

In [ ]:
 
                

Random Forest Classifier¶

In [183]:
def rf():
                    from sklearn.ensemble import RandomForestClassifier
                    rf = RandomForestClassifier(n_estimators=1000,random_state=0)
                    rf.fit(X_train, y_train)
                    y_pred = rf.predict(X_test)
                    # validation of the model
                    from sklearn.metrics import classification_report, confusion_matrix
                    r=rf.score(X_test, y_test)
                    print(classification_report(y_test, y_pred))
                    return(r)
                rf()  
                
              precision    recall  f1-score   support
                
                           0       0.79      0.70      0.74      6418
                           1       0.73      0.81      0.77      6420
                
                    accuracy                           0.76     12838
                   macro avg       0.76      0.76      0.76     12838
                weighted avg       0.76      0.76      0.76     12838
                
                
Out[183]:
0.75767253466272

We could achieve a score of 75.7% using Random Forest¶

In [ ]:
 
                

Stochastic Gradient Descent Classifier¶

In [184]:
def rf():
                    from sklearn import linear_model
                    clf = linear_model.SGDClassifier(max_iter=200, tol=1e-3)
                    clf.fit(X_train, y_train)
                    y_pred = clf.predict(X_test)
                    # validation of the model
                    from sklearn.metrics import classification_report, confusion_matrix
                    r=clf.score(X_test, y_test)
                    print(classification_report(y_test, y_pred))
                    return(r)
                rf()  
                
              precision    recall  f1-score   support
                
                           0       0.27      0.00      0.00      6418
                           1       0.50      1.00      0.67      6420
                
                    accuracy                           0.50     12838
                   macro avg       0.39      0.50      0.33     12838
                weighted avg       0.39      0.50      0.33     12838
                
                
Out[184]:
0.49968842498831595

We could achieve a score of 49.9% using Stochastic Gradient Descent¶

In [ ]:
 
                

Ridge Classifier¶

In [185]:
def rf():
                    from sklearn.linear_model import RidgeClassifier
                    clf = RidgeClassifier()
                    clf.fit(X_train, y_train)
                    y_pred = clf.predict(X_test)
                    # validation of the model
                    from sklearn.metrics import classification_report, confusion_matrix
                    r=clf.score(X_test, y_test)
                    print(classification_report(y_test, y_pred))
                    return(r)
                rf()  
                
              precision    recall  f1-score   support
                
                           0       0.50      0.49      0.50      6418
                           1       0.50      0.52      0.51      6420
                
                    accuracy                           0.50     12838
                   macro avg       0.50      0.50      0.50     12838
                weighted avg       0.50      0.50      0.50     12838
                
                
Out[185]:
0.5037389001402087

We could achieve a score of 50.3% using Ridge Classifier¶

In [ ]:
 
                

Gradient Boosting Classifier¶

In [186]:
def rf():
                    from sklearn.ensemble import GradientBoostingClassifier
                    clf = GradientBoostingClassifier()
                    clf.fit(X_train, y_train)
                    y_pred = clf.predict(X_test)
                    # validation of the model
                    from sklearn.metrics import classification_report, confusion_matrix
                    r=clf.score(X_test, y_test)
                    print(classification_report(y_test, y_pred))
                    return(r)
                rf()  
                
              precision    recall  f1-score   support
                
                           0       0.84      0.49      0.62      6418
                           1       0.64      0.91      0.75      6420
                
                    accuracy                           0.70     12838
                   macro avg       0.74      0.70      0.68     12838
                weighted avg       0.74      0.70      0.68     12838
                
                
Out[186]:
0.6983174949369061

We could achieve a score of 69.8% using Gradient Boosting Classifier¶

In [ ]:
 
                

Extreme Gradient Boosting Classifier¶

In [188]:
def rf():
                    from xgboost import XGBClassifier
                    clf = XGBClassifier()
                    clf.fit(X_train, y_train)
                    y_pred = clf.predict(X_test)
                    # validation of the model
                    from sklearn.metrics import classification_report, confusion_matrix
                    r=clf.score(X_test, y_test)
                    print(classification_report(y_test, y_pred))
                    return(r)
                rf()  
                
[18:18:05] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
                              precision    recall  f1-score   support
                
                           0       0.90      0.58      0.70      6418
                           1       0.69      0.94      0.79      6420
                
                    accuracy                           0.76     12838
                   macro avg       0.80      0.76      0.75     12838
                weighted avg       0.80      0.76      0.75     12838
                
                
Out[188]:
0.7568157033805889

We could achieve a score of 75.8% using Extreme Gradient Boosting¶

In [ ]:
 
                
In [195]:
models =  {
                    'Algorithm' : ['Decision Tree', 'Random Forest', 'Ridge',"Gradient Boosting","Extreme Gradient Boosting"],
                    'Accuracy' : [66.6, 75.6,50.3,69.8,75.8],
                    
                }
                score = pd.DataFrame(models)
                score = score.sort_values("Accuracy",ascending=0)
                score
                
Out[195]:
Algorithm Accuracy
4 Extreme Gradient Boosting 75.8
1 Random Forest 75.6
3 Gradient Boosting 69.8
0 Decision Tree 66.6
2 Ridge 50.3
In [202]:
import seaborn as sns
                import matplotlib.pyplot as plt
                
                plt.figure(figsize=(15,3))
                sns.barplot(x = 'Algorithm', y = 'Accuracy', data = score)
                
Out[202]:
<AxesSubplot:xlabel='Algorithm', ylabel='Accuracy'>

Analytics Educator is the best institute for Data Science courses, based out of Kolkata. We specialize in providing training on data science even to students coming from non-technical background with zero programming or statistical knowledge. We help the associates to learn data science and get job in this field.¶

We provide a 100% money back guarantee on learning. It means that each and every student of analytics educator will be able to understand every line of codes and algorithm, else we will refund the money back.¶

The readers of this blog might mail their opinion how to further improve the model and you will get our contact details here.

If you want to read more such case studies then click on Whom should you ask for donations for a charity or Identify if a patient has cancer

Regression problems can be found at House Price Prediction and Insurance Premium Prediction

Copyright © 2017 Analytics Educator