# Import the necessary packages of Python that we will/may use in this notebook
                import pandas as pd
                import numpy as np
                import warnings 
                warnings.filterwarnings("ignore")
                from sklearn.preprocessing import LabelBinarizer
                from sklearn.ensemble import RandomForestClassifier
                from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
                from sklearn.metrics import roc_curve, precision_recall_curve, auc, make_scorer, recall_score, accuracy_score, precision_score, confusion_matrix, fbeta_score, roc_auc_score
                from sklearn.model_selection import train_test_split
                from sklearn.metrics import classification_report as cr
                import matplotlib.pyplot as plt
                plt.style.use("ggplot")
                import os


                # Read the data from the local drive
                os.chdir("C:\\Users\\ASUS\\Desktop")
                safe_driver = pd.read_csv('safe_driver.csv')
                safe_driver.head()


                safe_driver.describe()


                # Check if there are any NULL data that need to be dropped
                safe_driver.isnull().mean()*100

ID                              0.000000
                target                          0.000000
                Gender                          0.000000
                EngineHP                        0.000000
                credit_history                  0.000000
                Years_Experience                0.000000
                annual_claims                   0.000000
                Marital_Status                  0.000000
                Vehical_type                    0.000000
                Miles_driven_annually           0.026455
                size_of_family                  0.000000
                Age_bucket                      0.000000
                EngineHP_bucket                 0.000000
                Years_Experience_bucket         0.000000
                Miles_driven_annually_bucket    0.026455
                credit_history_bucket           0.000000
                State                           0.000000
                dtype: float64


                #safe_driver = safe_driver.dropna()


                safe_driver.head()


                safe_driver.target.value_counts(normalize=True)*100

1    70.753968
                0    29.246032
                Name: target, dtype: float64


                cat_features = safe_driver.select_dtypes(include=['object'])
                print(cat_features.columns)

Index(['Gender', 'Marital_Status', 'Vehical_type', 'Age_bucket',
                       'EngineHP_bucket', 'Years_Experience_bucket',
                       'Miles_driven_annually_bucket', 'credit_history_bucket', 'State'],
                      dtype='object')


                cat_features.head(2)


                safe_driver.drop(['ID', 'EngineHP_bucket', 'Years_Experience_bucket',
                                  'Miles_driven_annually_bucket',
                                  'credit_history_bucket'], axis=1, inplace=True)


                # Check if the dataset has any NaN values as these values will make our algorithms throw an exception
                
                safe_driver.isnull().sum()

target                   0
                Gender                   0
                EngineHP                 0
                credit_history           0
                Years_Experience         0
                annual_claims            0
                Marital_Status           0
                Vehical_type             0
                Miles_driven_annually    8
                size_of_family           0
                Age_bucket               0
                State                    0
                dtype: int64


                safe_driver[safe_driver.isnull().any(axis=1)]


                safe_driver.head(2)


                m = safe_driver.groupby("Vehical_type")["Miles_driven_annually"].median()
                m = pd.DataFrame(m)
                median_values = m.loc["Truck",]
                median_values = pd.DataFrame(median_values)
                median_values.iloc[0,0]

12370.5


                # Replace NaN values in Miles_driven_annually with the median value for Truck
                # There may be better ways to impute missing data. But we have just 8 NaN cells out of some 30,000+ rows which is
                # less than 0.03%
                # So, imputing with median for all the 8 cells is not going to skew our results.
                
                #safe_driver.fillna(median_values.loc['Truck', 'Miles_driven_annually'], inplace=True)
                
                safe_driver.loc[(safe_driver["Miles_driven_annually"].isnull() == True) & (safe_driver["Vehical_type"] == "Truck"),"Miles_driven_annually"] = median_values.iloc[0,0]


                safe_driver.loc[safe_driver["Miles_driven_annually"] == 12370.5,]


                safe_driver[safe_driver.isnull().any(axis=1)]


                safe_driver.info()

<class 'pandas.core.frame.DataFrame'>
                RangeIndex: 30240 entries, 0 to 30239
                Data columns (total 12 columns):
                 #   Column                 Non-Null Count  Dtype  
                ---  ------                 --------------  -----  
                 0   target                 30240 non-null  int64  
                 1   Gender                 30240 non-null  object 
                 2   EngineHP               30240 non-null  int64  
                 3   credit_history         30240 non-null  int64  
                 4   Years_Experience       30240 non-null  int64  
                 5   annual_claims          30240 non-null  int64  
                 6   Marital_Status         30240 non-null  object 
                 7   Vehical_type           30240 non-null  object 
                 8   Miles_driven_annually  30240 non-null  float64
                 9   size_of_family         30240 non-null  int64  
                 10  Age_bucket             30240 non-null  object 
                 11  State                  30240 non-null  object 
                dtypes: float64(1), int64(6), object(5)
                memory usage: 2.8+ MB


                # To standardize the numeric features we need to isolate them first into a separate dataframe
                
                safe_driver_num_features = safe_driver.drop(safe_driver.select_dtypes(['object']), axis=1)
                
                # Do not standardize 'target' which is our label
                
                safe_driver_num_features.drop(['target'], axis=1, inplace=True)
                
                safe_driver_cat_features = safe_driver.select_dtypes(['object'])


                safe_driver_num_features[safe_driver_num_features.isnull().any(axis=1)]


                from sklearn import preprocessing
                safe_driver_scaled = pd.DataFrame(preprocessing.scale(safe_driver_num_features),
                columns=safe_driver_num_features.columns)


                # We will concatenate the scaled dataframe with the categorical feature set
                
                safe_driver = pd.concat([safe_driver_scaled, safe_driver['target'], safe_driver_cat_features], axis=1)


                safe_driver.head(2)


                char = safe_driver.select_dtypes(exclude='number')
                num = safe_driver.select_dtypes(include='number')
                char.head()


                # this option will display all rows
                pd.set_option('display.max_rows', None)
                
                #W are extracting all the unique values of the categorical variables
                char.apply(lambda x: x.value_counts()).T.stack()

Gender          F          13881.0
                                M          16359.0
                Marital_Status  Married    19820.0
                                Single     10420.0
                Vehical_type    Car        11582.0
                                Truck       8798.0
                                Utility     4007.0
                                Van         5853.0
                Age_bucket      18-27       8097.0
                                28-34       2056.0
                                35-40       6546.0
                                <18          911.0
                                >40        12630.0
                State           AK           205.0
                                AL           246.0
                                AR           255.0
                                AZ           225.0
                                CA           251.0
                                CO           272.0
                                CT          4444.0
                                DE           261.0
                                FL           251.0
                                GA           242.0
                                HI           225.0
                                IA           242.0
                                ID           251.0
                                IL           220.0
                                IN           241.0
                                KS           241.0
                                KY           248.0
                                LA           264.0
                                MA           284.0
                                MD           247.0
                                ME           248.0
                                MI           235.0
                                MN           242.0
                                MO           237.0
                                MS           220.0
                                MT           238.0
                                NC           221.0
                                ND           245.0
                                NE           222.0
                                NH           229.0
                                NJ          4884.0
                                NM           236.0
                                NV           239.0
                                NY          3686.0
                                OH           223.0
                                OK           260.0
                                OR          3838.0
                                PA           257.0
                                RI           242.0
                                SC           249.0
                                SD           229.0
                                TN           242.0
                                TX           233.0
                                UT           244.0
                                VA           252.0
                                VT          1429.0
                                WA           233.0
                                WI           271.0
                                WV          1253.0
                                WY           288.0
                dtype: float64


                # dropping the state column
                char = char.drop("State",axis=1)


                char = pd.get_dummies(data=char,drop_first=True)


                safe_driver_num_features = pd.concat(
                    [safe_driver_num_features, safe_driver['target']], axis=1)


                safe_driver_num_features.info()

<class 'pandas.core.frame.DataFrame'>
                RangeIndex: 30240 entries, 0 to 30239
                Data columns (total 7 columns):
                 #   Column                 Non-Null Count  Dtype  
                ---  ------                 --------------  -----  
                 0   EngineHP               30240 non-null  int64  
                 1   credit_history         30240 non-null  int64  
                 2   Years_Experience       30240 non-null  int64  
                 3   annual_claims          30240 non-null  int64  
                 4   Miles_driven_annually  30240 non-null  float64
                 5   size_of_family         30240 non-null  int64  
                 6   target                 30240 non-null  int64  
                dtypes: float64(1), int64(6)
                memory usage: 1.6 MB


                safe_driver.head(3)


                # Convert Gender to a 1 or a 2
                safe_driver['Gender'] = np.where(safe_driver['Gender'] == 'F', 1, 2)
                
                # Convert Marital_Status to a 1 or a 2
                safe_driver['Marital_Status'] = np.where(
                    safe_driver['Marital_Status'] == 'Single', 1, 2)
                
                # Convert Vehicle_Type using LabelEncoder
                le = preprocessing.LabelEncoder()
                le.fit(safe_driver['Vehical_type'])
                
                safe_driver['Vehical_type'] = le.transform(safe_driver['Vehical_type'])
                
                # Convert Age_bucket using LabelEncoder
                le.fit(safe_driver['Age_bucket'])
                
                safe_driver['Age_bucket'] = le.transform(safe_driver['Age_bucket'])


                safe_driver.head(2)


                #panu = safe_driver.copy()
                safe_driver = panu.copy()


                # Drop the 'target' column from training dataframe as that is our label
                X = safe_driver.drop(['target', 'State'], 1)
                
                # The 'target' column is our label or outcome that we want to predict
                y = safe_driver['target']


                from imblearn.over_sampling import SMOTE
                from sklearn.model_selection import train_test_split
                
                os = SMOTE(random_state=0)
                
                columns = X.columns
                os_data_X, os_data_y = os.fit_resample(X, y)
                #os_data_X = pd.DataFrame(data=os_data_X, columns=columns)
                #os_data_y = pd.DataFrame(data=os_data_y, columns=['y'])


                X_train, X_test, y_train, y_test = train_test_split(os_data_X, os_data_y, test_size=0.3, random_state=0)


                def dt():
                    from sklearn.tree import DecisionTreeClassifier
                    classifier = DecisionTreeClassifier(random_state=0)
                    classifier.fit(X_train, y_train)
                    y_pred = classifier.predict(X_test)
                    # validation of the model
                    from sklearn.metrics import classification_report, confusion_matrix
                    #print(classifier.score(X_test, y_test))
                    r=classifier.score(X_test, y_test)
                    print(classification_report(y_test, y_pred))
                    return(r)
                dt()

              precision    recall  f1-score   support
                
                           0       0.66      0.68      0.67      6418
                           1       0.67      0.65      0.66      6420
                
                    accuracy                           0.67     12838
                   macro avg       0.67      0.67      0.67     12838
                weighted avg       0.67      0.67      0.67     12838

0.6664589499922107


                def rf():
                    from sklearn.ensemble import RandomForestClassifier
                    rf = RandomForestClassifier(n_estimators=1000,random_state=0)
                    rf.fit(X_train, y_train)
                    y_pred = rf.predict(X_test)
                    # validation of the model
                    from sklearn.metrics import classification_report, confusion_matrix
                    r=rf.score(X_test, y_test)
                    print(classification_report(y_test, y_pred))
                    return(r)
                rf()

              precision    recall  f1-score   support
                
                           0       0.79      0.70      0.74      6418
                           1       0.73      0.81      0.77      6420
                
                    accuracy                           0.76     12838
                   macro avg       0.76      0.76      0.76     12838
                weighted avg       0.76      0.76      0.76     12838

0.75767253466272


                def rf():
                    from sklearn import linear_model
                    clf = linear_model.SGDClassifier(max_iter=200, tol=1e-3)
                    clf.fit(X_train, y_train)
                    y_pred = clf.predict(X_test)
                    # validation of the model
                    from sklearn.metrics import classification_report, confusion_matrix
                    r=clf.score(X_test, y_test)
                    print(classification_report(y_test, y_pred))
                    return(r)
                rf()

              precision    recall  f1-score   support
                
                           0       0.27      0.00      0.00      6418
                           1       0.50      1.00      0.67      6420
                
                    accuracy                           0.50     12838
                   macro avg       0.39      0.50      0.33     12838
                weighted avg       0.39      0.50      0.33     12838

0.49968842498831595


                def rf():
                    from sklearn.linear_model import RidgeClassifier
                    clf = RidgeClassifier()
                    clf.fit(X_train, y_train)
                    y_pred = clf.predict(X_test)
                    # validation of the model
                    from sklearn.metrics import classification_report, confusion_matrix
                    r=clf.score(X_test, y_test)
                    print(classification_report(y_test, y_pred))
                    return(r)
                rf()

              precision    recall  f1-score   support
                
                           0       0.50      0.49      0.50      6418
                           1       0.50      0.52      0.51      6420
                
                    accuracy                           0.50     12838
                   macro avg       0.50      0.50      0.50     12838
                weighted avg       0.50      0.50      0.50     12838

0.5037389001402087


                def rf():
                    from sklearn.ensemble import GradientBoostingClassifier
                    clf = GradientBoostingClassifier()
                    clf.fit(X_train, y_train)
                    y_pred = clf.predict(X_test)
                    # validation of the model
                    from sklearn.metrics import classification_report, confusion_matrix
                    r=clf.score(X_test, y_test)
                    print(classification_report(y_test, y_pred))
                    return(r)
                rf()

              precision    recall  f1-score   support
                
                           0       0.84      0.49      0.62      6418
                           1       0.64      0.91      0.75      6420
                
                    accuracy                           0.70     12838
                   macro avg       0.74      0.70      0.68     12838
                weighted avg       0.74      0.70      0.68     12838

0.6983174949369061


                def rf():
                    from xgboost import XGBClassifier
                    clf = XGBClassifier()
                    clf.fit(X_train, y_train)
                    y_pred = clf.predict(X_test)
                    # validation of the model
                    from sklearn.metrics import classification_report, confusion_matrix
                    r=clf.score(X_test, y_test)
                    print(classification_report(y_test, y_pred))
                    return(r)
                rf()

[18:18:05] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
                              precision    recall  f1-score   support
                
                           0       0.90      0.58      0.70      6418
                           1       0.69      0.94      0.79      6420
                
                    accuracy                           0.76     12838
                   macro avg       0.80      0.76      0.75     12838
                weighted avg       0.80      0.76      0.75     12838

0.7568157033805889


                models =  {
                    'Algorithm' : ['Decision Tree', 'Random Forest', 'Ridge',"Gradient Boosting","Extreme Gradient Boosting"],
                    'Accuracy' : [66.6, 75.6,50.3,69.8,75.8],
                    
                }
                score = pd.DataFrame(models)
                score = score.sort_values("Accuracy",ascending=0)
                score


                import seaborn as sns
                import matplotlib.pyplot as plt
                
                plt.figure(figsize=(15,3))
                sns.barplot(x = 'Algorithm', y = 'Accuracy', data = score)

<AxesSubplot:xlabel='Algorithm', ylabel='Accuracy'>

	ID	target	EngineHP	credit_history	Years_Experience	annual_claims	Miles_driven_annually	size_of_family
count	30240.000000	30240.00000	30240.000000	30240.000000	30240.000000	30240.000000	30232.000000	30240.000000
mean	15120.500000	0.70754	196.604266	685.769775	13.255721	1.138459	17422.938939	4.521296
std	8729.680407	0.45490	132.346961	102.454307	9.890246	1.082913	17483.782840	2.286531
min	1.000000	0.00000	80.000000	300.000000	1.000000	0.000000	5000.000000	1.000000
25%	7560.750000	0.00000	111.000000	668.000000	5.000000	0.000000	9668.500000	3.000000
50%	15120.500000	1.00000	141.000000	705.000000	10.000000	1.000000	12280.000000	5.000000
75%	22680.250000	1.00000	238.000000	753.000000	20.000000	2.000000	14697.250000	7.000000
max	30240.000000	1.00000	1005.000000	850.000000	40.000000	4.000000	99943.000000	8.000000

We are importing all the required packages¶

Import the data into python¶

In the above data, target is our dependent variable, which shows 1 as safe drive and 0 as not safe driver. We have other independent variables like the type of car the driver is driving, their personal and professional details.¶

Checking the descriptive statistics to see if there is any visible impossible values (like family size in negative)¶

Checking is the data has any missing values¶

Here we are checking the frequency distribution of the dependent variable to check if it's a balanced data.¶

We are now extracting the categorical variables, to transform/drop them as required.¶

Drop these 5 columns: ID, EngineHP_bucket, Years_Experience_bucket, Miles_driven_annually_bucket, credit_history_bucket¶

It may make sense to ingest the median of `Vehicle_Type=='Truck'` as all the NaN values are for Truck only. Let us look at the median of Miles_driven_annually by each vehicle type.¶

Replace NaN values in Miles_driven_annually with the median value for Truck. There may be better ways to impute missing data. But we have just 8 NaN cells out of some 30,000+ rows which is less than 0.03%. So, imputing with median for all the 8 cells is not going to skew our results.¶

Check for null values again to make sure we did not miss any accidentally¶

Check the data types of all remaining features¶

Check if there are any NaN values one more time¶

Restore the column names from the original dataset¶

We now have the scaled feature set. Now we need to concatenate the categorical features back with our scaled¶

dataset before running OneHotEncoder or dummies.¶

Univariate analysis and dummy variable creation¶

Now we are going to extract all object variables (categorical variables) and check their values¶

Now we are going to check the frequency distribution of each categorical variables, and treat them according to their distributions¶

Observation: We can see that the variable state has too many values to create a dummy variable. Hence, we will drop it. For the rest of the variables, we will create a dummy variables.¶

Segregating the dependent and independent variables as X and y¶

We found out much earlier that our target label is 70% success (good driver or `target` == 1) and 30% failure (bad driver or `target` == 0). Let us do class balancing using SMOTE and see the distribution.

Split the resulting balanced data set as train and test¶

Decision Tree Classifier¶

We could achieve a score of 66.6% using Decision Tree¶

Random Forest Classifier¶

We could achieve a score of 75.7% using Random Forest¶

Stochastic Gradient Descent Classifier¶

We could achieve a score of 49.9% using Stochastic Gradient Descent¶

Ridge Classifier¶

We could achieve a score of 50.3% using Ridge Classifier¶

Gradient Boosting Classifier¶

We could achieve a score of 69.8% using Gradient Boosting Classifier¶

Extreme Gradient Boosting Classifier¶

We could achieve a score of 75.8% using Extreme Gradient Boosting¶

We provide a 100% money back guarantee on learning. It means that each and every student of analytics educator will be able to understand every line of codes and algorithm, else we will refund the money back.¶

	ID	target	Gender	EngineHP	credit_history	Years_Experience	annual_claims	Marital_Status	Vehical_type	Miles_driven_annually	size_of_family	Age_bucket	EngineHP_bucket	Years_Experience_bucket	Miles_driven_annually_bucket	credit_history_bucket	State
0	1	1	F	522	656	1	0	Married	Car	14749.0	5	<18	>350	<3	<15k	Fair	IL
1	2	1	F	691	704	16	0	Married	Car	15389.0	6	28-34	>350	15-30	15k-25k	Good	NJ
2	3	1	M	133	691	15	0	Married	Van	9956.0	3	>40	90-160	15-30	<15k	Good	CT
3	4	1	M	146	720	9	0	Married	Van	77323.0	3	18-27	90-160	9-14'	>25k	Good	CT
4	5	1	M	128	771	33	1	Married	Van	14183.0	4	>40	90-160	>30	<15k	Very Good	WY

	target	Gender	EngineHP	credit_history	Years_Experience	annual_claims	Marital_Status	Vehical_type	Miles_driven_annually	size_of_family	Age_bucket	State
1235	1	F	124	793	27	0	Married	Truck	NaN	3	>40	NJ
7365	0	F	465	696	5	0	Married	Truck	NaN	8	18-27	SD
11464	1	F	137	787	18	1	Married	Truck	NaN	1	>40	CT
18158	0	F	108	747	8	1	Married	Truck	NaN	1	18-27	OR
19795	1	F	121	774	19	0	Married	Truck	NaN	2	28-34	NY
25731	1	F	355	694	15	1	Married	Truck	NaN	5	28-34	CT
26512	1	F	109	743	40	0	Married	Truck	NaN	1	>40	OR
27045	1	F	83	784	21	0	Married	Truck	NaN	1	>40	CT

	EngineHP	credit_history	Years_Experience	annual_claims	Miles_driven_annually	size_of_family	target	Gender	Marital_Status	Vehical_type	Age_bucket	State
0	2.458697	-0.290571	-1.239193	-1.051311	-0.152883	0.209362	1	F	Married	Car	<18	IL
1	3.735665	0.177938	0.277478	-1.051311	-0.116272	0.646712	1	F	Married	Car	28-34	NJ

	Algorithm	Accuracy
4	Extreme Gradient Boosting	75.8
1	Random Forest	75.6
3	Gradient Boosting	69.8
0	Decision Tree	66.6
2	Ridge	50.3

We are importing all the required packages¶

Import the data into python¶

In the above data, target is our dependent variable, which shows 1 as safe drive and 0 as not safe driver. We have other independent variables like the type of car the driver is driving, their personal and professional details.¶

Checking the descriptive statistics to see if there is any visible impossible values (like family size in negative)¶

Checking is the data has any missing values¶

Here we are checking the frequency distribution of the dependent variable to check if it's a balanced data.¶

We are now extracting the categorical variables, to transform/drop them as required.¶

Drop these 5 columns: ID, EngineHP_bucket, Years_Experience_bucket, Miles_driven_annually_bucket, credit_history_bucket¶

It may make sense to ingest the median of Vehicle_Type=='Truck' as all the NaN values are for Truck only. Let us look at the median of Miles_driven_annually by each vehicle type.¶

Replace NaN values in Miles_driven_annually with the median value for Truck. There may be better ways to impute missing data. But we have just 8 NaN cells out of some 30,000+ rows which is less than 0.03%. So, imputing with median for all the 8 cells is not going to skew our results.¶

Check for null values again to make sure we did not miss any accidentally¶

Check the data types of all remaining features¶

Check if there are any NaN values one more time¶

Restore the column names from the original dataset¶

We now have the scaled feature set. Now we need to concatenate the categorical features back with our scaled¶

dataset before running OneHotEncoder or dummies.¶

Univariate analysis and dummy variable creation¶

Now we are going to extract all object variables (categorical variables) and check their values¶

Now we are going to check the frequency distribution of each categorical variables, and treat them according to their distributions¶

Observation: We can see that the variable state has too many values to create a dummy variable. Hence, we will drop it. For the rest of the variables, we will create a dummy variables.¶

Segregating the dependent and independent variables as X and y¶

We found out much earlier that our target label is 70% success (good driver or target == 1) and 30% failure (bad driver or target == 0). Let us do class balancing using SMOTE and see the distribution.

Split the resulting balanced data set as train and test¶

Decision Tree Classifier¶

We could achieve a score of 66.6% using Decision Tree¶

Random Forest Classifier¶

We could achieve a score of 75.7% using Random Forest¶

Stochastic Gradient Descent Classifier¶

We could achieve a score of 49.9% using Stochastic Gradient Descent¶

Ridge Classifier¶

We could achieve a score of 50.3% using Ridge Classifier¶

Gradient Boosting Classifier¶

We could achieve a score of 69.8% using Gradient Boosting Classifier¶

Extreme Gradient Boosting Classifier¶

We could achieve a score of 75.8% using Extreme Gradient Boosting¶

We provide a 100% money back guarantee on learning. It means that each and every student of analytics educator will be able to understand every line of codes and algorithm, else we will refund the money back.¶

It may make sense to ingest the median of `Vehicle_Type=='Truck'` as all the NaN values are for Truck only. Let us look at the median of Miles_driven_annually by each vehicle type.¶

We found out much earlier that our target label is 70% success (good driver or `target` == 1) and 30% failure (bad driver or `target` == 0). Let us do class balancing using SMOTE and see the distribution.