Analytics Educator
  • Home
  • Courses
  • Blog
  • FAQ
  • Contact Us
  • Home
  • Courses
  • FAQ
  • Contact
Home   /   Blog   /   Details

Loss curtailing at Calcutta Medical Clinic¶

The manager of the Calcutta Medical clinic, Dr. Joyita Sanyal, is in difficulty due to clinic losses. She was just given a promotion, but she is aware that the clinic has had the best staff and has been running really efficiently. She hired a third-party company to audit the finance department so she could feel secure about the financial side of things. The company, however, could find no proof relating to the problem at hand. She needed to gather information by the upcoming board meeting to explain this oddity.¶

To further investigate, Dr. Joyita Sanyal obtains the past data of the clinic about their transactions.¶

Get the dataset¶

Dr. Joyita Sanyal seeks help from Analytics Educator to deep dive into the data to extract insights out of it.¶

In [15]:
import pandas as pd
              import numpy as np
              import matplotlib.pyplot as plt 
              import seaborn as sns 
              import os
              os.chdir("C:\\Users\\ASUS\\Desktop")
              data = pd.read_csv("clinic_data.csv")
              data.head(3)
              
Out[15]:
Age Gender AppointmentRegistration ApointmentData DayOfTheWeek Status Diabetes Alcoolism HiperTension Handcap Smokes Scholarship Tuberculosis Sms_Reminder AwaitingTime
0 19 M 2014-12-16T14:46:25Z 2015-01-14T00:00:00Z Wednesday Show-Up 0 0 0 0 0 0 0 0 -29
1 24 F 2015-08-18T07:01:26Z 2015-08-19T00:00:00Z Wednesday Show-Up 0 0 0 0 0 0 0 0 -1
2 4 F 2014-02-17T12:53:46Z 2014-02-18T00:00:00Z Tuesday Show-Up 0 0 0 0 0 0 0 0 -1

Data Dictionary for the clinic data¶

Age¶

Age of patient

Gender¶

Gender of patient

AppointmentRegistration¶

Date on which appointment was issued to the patient

ApointmentData¶

Date for which appointment was issued to the patient

DayOfTheWeek¶

Day of the week for which appointment was issued

Status¶

Day of the week for which appointment was issued (dependent variable)

Diabetes¶

Whether the patient has diabetes or not

Alcoolism¶

Whether the patient is affected by Alcoolism or not

HiperTension¶

Whether the patient has HiperTension or not

Handicap¶

Whether the patient is handicapped or not

Smokes¶

Whether the patient smokes or not

Tuberculosis¶

Whether the patient has tuberculosis or not

Scholarship¶

Whether or not a patient has been granted scholarship from a social welfare organization or not. Poor families may benefit by receiving financial aid.

Sms_Reminder¶

Whether SMS reminder for appointment has been issued to the patient or not

AwaitingTime¶

AwaitingTime = AppointmentRegistration – ApointmentData

Dr. Joyita was hoping to do the following with the information from the data dump:¶

Discover reasons that losses are coming up even though the rate of appointments is going up?

If patients are not reporting at the time of their scheduled appointments, come up with a method to determine whether a patient would show up on the basis of his/her characteristics. She believed that knowing which patients were likely not to show up would enable the hospital to take countermeasures like the following:

Provide constant appointment reminders and confirmations Make the head count of doctors and hospital staff in line with the demand at hand.

In [4]:
data.head(3)
              
Out[4]:
Age Gender AppointmentRegistration ApointmentData DayOfTheWeek Status Diabetes Alcoolism HiperTension Handcap Smokes Scholarship Tuberculosis Sms_Reminder AwaitingTime
0 19 M 2014-12-16T14:46:25Z 2015-01-14T00:00:00Z Wednesday Show-Up 0 0 0 0 0 0 0 0 -29
1 24 F 2015-08-18T07:01:26Z 2015-08-19T00:00:00Z Wednesday Show-Up 0 0 0 0 0 0 0 0 -1
2 4 F 2014-02-17T12:53:46Z 2014-02-18T00:00:00Z Tuesday Show-Up 0 0 0 0 0 0 0 0 -1
In [ ]:
 
              

She observes the following about the variables:¶

Integer:¶

Age, waiting time

String:¶

Gender, DayOfTheWeek, Status

Datetime:¶

AppointmentRegistration, ApointmentData

Boolean:¶

HiperTension, Handicap, Smokes, Scholarship, Tuberculosis, Sms_Reminder

Descriptive Analytics¶

In [35]:
data.describe()
              
Out[35]:
Age Diabetes Alcoolism HiperTension Handcap Smokes Scholarship Tuberculosis Sms_Reminder AwaitingTime
count 300000.000000 300000.000000 300000.000000 300000.000000 300000.000000 300000.000000 300000.000000 300000.000000 300000.000000 300000.000000
mean 37.808017 0.077967 0.025010 0.215890 0.020523 0.052370 0.096897 0.000450 0.574173 -13.841813
std 22.809014 0.268120 0.156156 0.411439 0.155934 0.222772 0.295818 0.021208 0.499826 15.687697
min -2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -398.000000
25% 19.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -20.000000
50% 38.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 -8.000000
75% 56.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 -4.000000
max 113.000000 1.000000 1.000000 1.000000 4.000000 1.000000 1.000000 1.000000 2.000000 -1.000000

Show the number of unique values by variables¶

In [12]:
n = data.nunique(axis=0)
              n
              
Out[12]:
Age                           109
              Gender                          2
              AppointmentRegistration    295425
              ApointmentData                534
              DayOfTheWeek                    7
              Status                          2
              Diabetes                        2
              Alcoolism                       2
              HiperTension                    2
              Handcap                         5
              Smokes                          2
              Scholarship                     2
              Tuberculosis                    2
              Sms_Reminder                    3
              AwaitingTime                  213
              dtype: int64

Create visualization to gain insights¶

In [31]:
def features_plots(discrete_vars):
              
                  plt.figure(figsize=(20,30))
              
                  for i, cv in enumerate(['Age', 'AwaitingTime']):
                      plt.subplot(7, 2, i+1)
                      plt.hist(data[cv], bins=len(data[cv].unique()))
                      plt.title(cv)
                      plt.ylabel('Frequency')
              
                  for i, dv in enumerate(discrete_vars):
                      plt.subplot(7, 2, i+3)
                      data[dv].value_counts().plot(kind='bar', title=dv)
                      plt.ylabel('Frequency')
                      
              discrete_vars = ['Gender', 'DayOfTheWeek', 'Status', 'Diabetes','Alcoolism', 'HiperTension', 'Handcap', 'Smokes',
              'Scholarship', 'Tuberculosis', 'Sms_Reminder']
              features_plots(discrete_vars)
              

Looking at the plots Dr. Joyita realized the following:¶

Age: Age lay in the range of -2 and 113. Age between 0 and 113 did make sense, but what surprised her was how it could be negative. It seemed to her that these were the outliers.

Handicap: Instead of being Boolean, this feature had values in the range of 0 and 4.

Sms_Reminder: Instead of being a Boolean entity, it had values in the range of 0 and 2. It seemed to her that Sms_Reminder represented the frequency of reminders sent to each and every patient.

AwaitingTime: Dr. Judy was puzzled to see AwaitingTime in negative terms. By definition this feature represented the number of days from which the appointment was issued to the date for which the appointment was issued. She believed that positive numbers would have made more sense.

Data Cleaning¶

Presence of negative values within "Age" which didn't make any intuitive sense to her, and thus she referred to these as noise and decided to delete these rows.¶

In [16]:
data = data.loc[data['Age'] >= 0,]
              

Handicap was supposed to be a binary variable. However, Dr. Joyita noticed that 0 was consisting of more than 98% of the total count. It is concluded that this variable doesn't have enough variance to have any impact, hence it will be dropped from the data.¶

In [39]:
data.Handcap.value_counts(normalize=True)
              
Out[39]:
0    0.981343
              1    0.016994
              2    0.001497
              3    0.000130
              4    0.000037
              Name: Handcap, dtype: float64
In [4]:
data = data.drop("Handcap",axis=1)
              

Dr. Joyita also recalled that some values in waiting time had appeared to be negative, and hence it made sense to turn them into a positive value.¶

In [17]:
data["AwaitingTime"] = abs(data["AwaitingTime"])
              
In [5]:
data.head(2)
              
Out[5]:
Age Gender AppointmentRegistration ApointmentData DayOfTheWeek Status Diabetes Alcoolism HiperTension Smokes Scholarship Tuberculosis Sms_Reminder AwaitingTime
0 19 M 2014-12-16T14:46:25Z 2015-01-14T00:00:00Z Wednesday Show-Up 0 0 0 0 0 0 0 29
1 24 F 2015-08-18T07:01:26Z 2015-08-19T00:00:00Z Wednesday Show-Up 0 0 0 0 0 0 0 1

Dr. Joyita recalls that she had read that Machine learning works best with numbers rather than strings. Hence, she decides to convert the string variable such as Gender, Day of the week, and Status into numbers.¶

In [18]:
from sklearn import preprocessing
              le = preprocessing.LabelEncoder()
              data["Status"] = le.fit_transform(data["Status"])
              data["Gender"] = le.fit_transform(data["Gender"])
              

Analytics Educator had advised Dr. Joyita not to use the same technique on day of the week since it converts the string into numeric as per their alphabetical order. E.g. the Friday would have been coded into 0, Monday as 1, Saturday as 2.¶

Hence it was decided to use the mapping function to convert the day of the week into numbers¶

In [19]:
dow_mapping = {'Monday' : 0, 'Tuesday' : 1, 'Wednesday' : 2, 'Thursday' : 3, 'Friday' : 4, 'Saturday' : 5, 'Sunday' : 6}
              data['DayOfTheWeek'] = data['DayOfTheWeek'].map(dow_mapping)
              

Now Dr. Joyita will check back the visualization once again.¶

In [12]:
discrete_vars = ['Gender', 'DayOfTheWeek', 'Status', 'Diabetes',
                                   'Alcoolism', 'HiperTension', 'Smokes',
                                       'Scholarship', 'Tuberculosis', 'Sms_Reminder']
              
              features_plots(discrete_vars)
              

Dr. Joyita noticed that AwaitingTime seemed to decay in an exponential fashion. As per her observation, majority of the patients have an age of 0 (i.e., infants whose age is in months). She also pointed out to the hikes at the ages of 19, 38, and 57. Other than this, another surprising fact was that one-third of patients were males, and that the same proportion of patients didn’t show up at the date and time of their appointments. This information gave her a clue as to why the clinic was seeing losses despite of an increase in the number of appointments. She also noticed that majority of the patients were sent at least one SMS reminder; however, two-thirds of the time no reminder was sent. The absence of appointment reminders, she believed might be the reason behind patients not showing up.

Once she understood the features within the dataset and after she had removed the ambiguities by performing data cleaning. Dr. Joyita was interested in identifying relationships between different features within the dataset. She wanted to perform this multivariate analysis to gain an intuitive understanding of the types of patients who don't show up on their appointment dates and time .

Exploratory Data Analysis¶

Dr. Joyita had a preconceived notion, just like any other person, that people will require to see a doctor more as they grow old. Hence, she created a scatter plot, with the help of Analytics Educator, between Age and Awaiting Time.

She is disappointed to see that her hypothesis was wrong, and no such correlation among the variables exist.¶

In [22]:
plt.figure(figsize=(15,5))
              sns.scatterplot(data=data,x="Age",y="AwaitingTime",hue="Status")
              plt.xlim(0, 120)
              plt.ylim(0, 120)
              plt.show()
              

Now she is interested to see if sms reminder increases the chance of people to show up.¶

In [20]:
data_Analytics_Educator = data.groupby(['Sms_Reminder', 'Status'])['Sms_Reminder'].count().unstack('Status').fillna(0)
              data_Analytics_Educator
              
Out[20]:
Status 0 1
Sms_Reminder
0 38915 89631
1 51546 119103
2 268 531
In [22]:
data_Analytics_Educator[[0, 1]].plot(kind='bar', stacked=True)
              plt.title('Frequency of people showing up and not showing up by number of SMS reminders sent')
              plt.xlabel('Number of SMS reminders')
              plt.ylabel('Frequency')
              plt.show()
              

Dr. Joyita noticed that number of people showed up after 1 sms reminder in comparison to the number of people showed up with 0 sms reminder is significantly more. ( about ((119103-89631)/89631)*100 ~ 32.8% more.

So it seems that sms reminder does increase, though marginally, the likelyhood of a patient to turn up¶

Now she is interested to see if different days of the week impacts the chance of people to show up.¶

In [27]:
data_AE = data.groupby(['DayOfTheWeek', 'Status'])['DayOfTheWeek'].count().unstack('Status').fillna(0)
              data_AE.plot(kind='bar', stacked=True)
              plt.title('Frequency of people showing up and not showing up by Day of the week')
              plt.xlabel('Day of the week')
              plt.ylabel('Frequency')
              plt.show()
              

She noticed that hardly any patient is coming on Saturday (day 5) and Sunday the clinic is closed. Otherwise, number of no show is pretty consistent through out the week. However, the distribution for showing up seems like a normal distribution.

Main reason for loosing money¶

Now Dr. Joyita has a lot of insights about the data and understood that the clinic was loosing money mainly due to no show of the patients. The clinic has to pay the fees for the doctor's time but there are not enough patients for the doctors. Although the patients are turning up at a later time, at the moment the doctor's capacity is getting underutilized.

Hence, Dr. Joyita decides to build a Machine Learning algorithm, along with the help of Analytics Educator¶

Here, her expectation from the Machine Learning algorithm is by looking into the data the algorithm can predict whether a patient would show up or not. So this is a typical binary classification problem.

In machine learning, the greater the number of observations and feature sets within the dataset, the greater the likelihood that the model will capture the variability within it, to understand its true essence. Dr. Joyita didn't have an option to increase the number of observations since she didn't have any more data. So she decided to extract the different year and months from the Appointment Date. Once done, drop the original variable.

In [21]:
from datetime import date, time, datetime
              data["app_date"] = data["ApointmentData"].str[:10]
              data["app_date"] = pd.to_datetime(data["app_date"], format="%Y-%m-%d")
              data["app_year"] = data["app_date"].dt.year
              data["app_month"] = data["app_date"].dt.month
              
In [22]:
# dropping the variables app_date and ApointmentData
              data = data.drop(["app_date","ApointmentData"],axis=1)
              
In [25]:
# She decided to drop AppointmentRegistration as well, since it will be of no other use
              data = data.drop(["AppointmentRegistration"],axis=1)
              
In [26]:
data.head()
              
Out[26]:
Age Gender DayOfTheWeek Status Diabetes Alcoolism HiperTension Handcap Smokes Scholarship Tuberculosis Sms_Reminder AwaitingTime app_year app_month
0 19 1 2 1 0 0 0 0 0 0 0 0 29 2015 1
1 24 0 2 1 0 0 0 0 0 0 0 0 1 2015 8
2 4 0 1 1 0 0 0 0 0 0 0 0 1 2014 2
3 5 1 3 1 0 0 0 0 0 0 0 1 15 2014 8
4 38 1 1 1 0 0 0 0 0 0 0 1 6 2015 10

app_year and app_month is a categorical variable. Simply it means one value is not of higher weightage than others. Eg. app_month value 2 doesn't mean that it's double of app_month value 1. Hence, they need to be converted into a dummy variable¶

In [27]:
data = pd.get_dummies(data=data,columns=['app_year', 'app_month'],drop_first=True)
              

Now Dr. Joyita also checks if there are any missing values in the data¶

In [31]:
# There are no missing values
              data.isnull().sum()
              
Out[31]:
Age              0
              Gender           0
              DayOfTheWeek     0
              Status           0
              Diabetes         0
              Alcoolism        0
              HiperTension     0
              Handcap          0
              Smokes           0
              Scholarship      0
              Tuberculosis     0
              Sms_Reminder     0
              AwaitingTime     0
              app_year_2015    0
              app_month_2      0
              app_month_3      0
              app_month_4      0
              app_month_5      0
              app_month_6      0
              app_month_7      0
              app_month_8      0
              app_month_9      0
              app_month_10     0
              app_month_11     0
              app_month_12     0
              dtype: int64

Classification¶

Classification helps us decide which of the given classes a new observation will fall into. Classification comes under supervised learning where the model can only be trained once a membership labeled data is provided as an input. These membership variables are usually categorical variables which can be nominal as well as Boolean in nature. E.g. Suppose a man is applying for a loan in a bank, and the bank is keen to predict that in future whether the person will pay back his loan on time or be a default. The following methods can be used to evaluate a classification model:

Accuracy: Classifier and predictor accuracy

Speed: Time to train and predict from the model

Robustness: Handling missing values and noise

Scalability: Efficiency in disk-related databases

Interpretability: Predictions made by the model make intuitive sense

Model Evaluation Techniques¶

Python allows the provision of measuring classification performance with the aid of several score, loss, and utility functions. These metrics require probability estimates of confidence values, positive class, binary decision values or value within the sample_weight parameter (i.e., weighted contribution of each sample to the overall score). These can be divided in several ways.

Confusion Matrix¶

Confusion matrix counts the true negatives, false positives, false negatives, and true positives.

True negatives is the frequency of instances in which the model correctly predicted 0 as 0.

False negatives is the frequency of instances in which the model predicted 1 as 0.

True positives is the frequency of instances in which the model correctly predicted 1 as 1.

False positives is the frequency of instances in which the model predicted 0 as 1.

Dr. Joyita writes the following code to run the Machine Learning algorithms¶

Data split : Segregating the independent variables as X and dependent variable as y¶

In [32]:
# We remove the label values from our training data
              X = data.drop(['Status'],axis=1)
              
              # We assigned those label values to our Y dataset
              y = data['Status']
              

Now we will split the data into training (70% of the data) and rest 30% - named test, will be kept aside for later use.¶

In [33]:
# Split it to a 70:30 Ratio Train:Test
              from sklearn.model_selection import train_test_split
              X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
              

Decision Tree Classification¶

Decision trees form a tree in a hierarchical fashion with each node having a decision boundary to proceed downward. The tree stops branching out at the level where there are no more splits possible. Interior nodes represent input variables having edges to each of the children. Children split the values from the input variable. They do that by partitioning the data at each level with nodes branching out to children. This behavior is known as recursive partitioning. Decision trees are easy to interpret and time efficient, and hence they can work well with large datasets. Decision trees can also handle both numerical and categorical data, that is, regression in case of numerical and classification in case of categorical data. However, the accuracy of decision trees is not as good as that produced by other machine learning classification algorithms.

Moreover, decision trees generalize highly to the training dataset and thus are highly susceptible to over fitting. A decision tree aims to partition the data so that each of the partitioned instances has similar/homogeneous values. Its algorithms are used to calculate the homogeneity of a sample, and if it is completely homogeneous it translates into an entropy of 0 and into a value of 1 or vice versa. A decision tree is a form of a parametric supervised learning method, and by parametric we mean that it can be applied to any data regardless of its underlying distribution.

In [34]:
from sklearn.tree import DecisionTreeClassifier
              classifier = DecisionTreeClassifier()
              classifier.fit(X_train, y_train)
              
Out[34]:
DecisionTreeClassifier()

Analytics Educator pointed out to Dr. Joyita that because no configuration parameters were passed to the decision tree classifier, it took the default values of configuration parameters. The next step was to apply the trained model on a testing dataset to find the predicted labels of Status. She then aimed to compare the predicted labels to the original label of Status to calculate accuracy of the model.

Predict the results¶

In [36]:
#predict the test data
              y_pred = classifier.predict(X_test)
              from sklearn.metrics import classification_report, confusion_matrix
              print(classification_report(y_test, y_pred))
              
              precision    recall  f1-score   support
              
                         0       0.33      0.36      0.34     27214
                         1       0.71      0.68      0.69     62785
              
                  accuracy                           0.58     89999
                 macro avg       0.52      0.52      0.52     89999
              weighted avg       0.59      0.58      0.59     89999
              
              

Interpretation¶

Analytics Educator explained to Dr. Joyita that here 0 means the people who didn't turn up for the appointment (no show). In the the above result (confusion matrix) it is shown that the precision for 0 is 33% meaning out of all the 0 predicted by the model, only 33% were correct. On the other hand, recall states that out of all the 0 in the data, only 36% were identified correctly.

Dr. Joyita is not very happy with this result, hence she decides to try another method, an ensemble technique.¶

Ensemble Methods¶

Bagging¶

Bagging, also known as a bootstrap method, optimizes on minimizing the variance. It does that by generating additional data for the training dataset using combinations to produce multisets of same size as that of the original data. The application of Bagging is ideal when the model overfits and you tend to go to higher variance. This can be taken care of by taking many resamples, each overfitting, and averaging them out together. This in turn cancels some of the variance. An ensemble method combines predictions from multiple machine learning algorithms, which result in relatively more accurate predictions than an individual model could have captured. Ensemble methods are usually divided into two variants (Bagging and Boosting). Decision trees are sensitive to specific data on which they are trained on. If training data is changed, the resulting decision tree can be quite different and can yield different predictions. A decision tree being a high-variance machine learning algorithm has the application of Bagging by means of the bootstrap procedure. Consider a dataset that has 50 features and 3,000 observations. Bagging might create 500 trees with 500 random observations for 20 features in each tree. Finally it will average out the predictions for all of those 500 tree models to get the final prediction.

Boosting¶

Boosting defines an objective function to measure the performance of a model given a certain set of parameters. The objective function contains two parts: regularization and training loss, both of which add to one another. The training loss measures how predictive our model is on the training data. The most commonly used training loss function includes mean squared error and logistic regression. The regularization term controls the complexity of the model, which helps avoid overfitting. Boosting trees use tree ensembles because they sum together the prediction of multiple trees .

Random Forest Classification¶

Random forest classification is a type of Bagging, and it is one of the most powerful machine learning algorithms available currently. In decision tree classification, different subtrees can have a lot of structural similarities which can result in prediction outputs that are strongly correlated to each other. The random forest classifier reduces this correlation among the subtrees by limiting the features at each split point. So, instead of choosing a variable from all variables available, random forest searches for the variable that will minimize the error from a limited random sample of features. Random forest classifiers are fast and can work with data which is unbalanced or has missing values.

In [43]:
from sklearn.ensemble import RandomForestClassifier
              rf = RandomForestClassifier()
              rf.fit(X_train, y_train)
              
Out[43]:
RandomForestClassifier()

Predict the result¶

In [44]:
y_pred = rf.predict(X_test)
              from sklearn.metrics import classification_report, confusion_matrix
              print(classification_report(y_test, y_pred))
              
              precision    recall  f1-score   support
              
                         0       0.35      0.23      0.28     27214
                         1       0.71      0.82      0.76     62785
              
                  accuracy                           0.64     89999
                 macro avg       0.53      0.52      0.52     89999
              weighted avg       0.60      0.64      0.61     89999
              
              

Here the precision has improved but the recall performance has gone down. We may try one more algorithm to improve the accuracy¶

Gradient Boosting¶

In Boosting , the selection of samples is done by giving more and more weight to hard-to-classify observations. Gradient boosting classification produces a prediction model in the form of an ensemble of weak predictive models, usually decision trees. It generalizes the model by optimizing for the arbitrary differentiable loss function. At each stage, regression trees fit on the negative gradient of binomial or multinomial deviance loss function.

In simple terminology, the gradient boosting classifier does the following:

Gradient boosting builds an ensemble of trees one by one.

Predictions of all individual trees are summed.

Discrepancy between target function and current ensemble prediction (i.e., residual) is reconstructed.

The next tree in the ensemble should complement existing trees and minimize the residual of the ensemble.

In [58]:
from sklearn.ensemble import GradientBoostingClassifier
              clf = GradientBoostingClassifier()
              clf.fit(X_train, y_train)
              y_pred = clf.predict(X_test)
              

Predict the result¶

In [59]:
print(classification_report(y_pred,y_test))
              
              precision    recall  f1-score   support
              
                         0       0.01      0.52      0.03       780
                         1       0.99      0.70      0.82     89219
              
                  accuracy                           0.70     89999
                 macro avg       0.50      0.61      0.43     89999
              weighted avg       0.99      0.70      0.81     89999
              
              

Here it is observed that recall has improved significantly, but precision has become almost 0. Dr. Joyita thought of applying the Deep Neural Network technique to check if it improves it further but since she has a meeting with the board scheduled to be held in next half an hour, she decides to leave with these findings. The readers of this blog might mail their opinion how to further improve the model and you will get our contact details here.

If you want to read more such case studies then click on Whom should you ask for donations for a charity or Identify if a patient has cancer

Regression problems can be found at House Price Prediction and Insurance Premium Prediction

Copyright © 2017 Analytics Educator