{ "cells": [ { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "# Logistic Regression\n", "---" ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "Lets first import required libraries:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "from sklearn import preprocessing\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.metrics import classification_report, confusion_matrix, jaccard_score, log_loss\n", "import itertools\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Customer churn with Logistic Regression\n", "\n", "A telecommunications company is concerned about the number of customers leaving their land-line business for cable competitors. They need to understand who is leaving. Imagine that you are an analyst at this company and you have to find out who is leaving and why." ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "

About the dataset

\n", "\n", "We will use a telecommunications dataset for predicting customer churn. This is a historical customer dataset where each row represents one customer. The data is relatively easy to understand, and you may uncover insights you can use immediately. Typically it is less expensive to keep customers than acquire new ones, so the focus of this analysis is to predict the customers who will stay with the company. \n", "\n", "This data set provides information to help you predict what behavior will help you to retain customers. You can analyze all relevant customer data and develop focused customer retention programs.\n", "\n", "The dataset includes information about:\n", "\n", "- Customers who left within the last month – the column is called Churn\n", "- Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies\n", "- Customer account information – how long they had been a customer, contract, payment method, paperless billing, monthly charges, and total charges\n", "- Demographic info about customers – gender, age range, and if they have partners and dependents\n" ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "Telco Churn is a hypothetical data file that concerns a telecommunications company's efforts to reduce turnover in its customer base. Each case corresponds to a separate customer and it records various demographic and service usage information. Before you can work with the data, you must use the URL to get the ChurnData.csv.\n" ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "### Load Data From CSV File " ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tenureageaddressincomeedemployequipcallcardwirelesslongmon...pagerinternetcallwaitconferebillloglonglogtolllninccustcatchurn
011.033.07.0136.05.05.00.01.01.04.40...1.00.01.01.00.01.4823.0334.9134.01.0
133.033.012.033.02.00.00.00.00.09.45...0.00.00.00.00.02.2463.2403.4971.01.0
223.030.09.030.01.02.00.00.00.06.30...0.00.00.01.00.01.8413.2403.4013.00.0
338.035.05.076.02.010.01.01.01.06.05...1.01.01.01.01.01.8003.8074.3314.00.0
47.035.014.080.02.015.00.01.00.07.10...0.00.01.01.00.01.9603.0914.3823.00.0
\n", "

5 rows × 28 columns

\n", "
" ], "text/plain": [ " tenure age address income ed employ equip callcard wireless \\\n", "0 11.0 33.0 7.0 136.0 5.0 5.0 0.0 1.0 1.0 \n", "1 33.0 33.0 12.0 33.0 2.0 0.0 0.0 0.0 0.0 \n", "2 23.0 30.0 9.0 30.0 1.0 2.0 0.0 0.0 0.0 \n", "3 38.0 35.0 5.0 76.0 2.0 10.0 1.0 1.0 1.0 \n", "4 7.0 35.0 14.0 80.0 2.0 15.0 0.0 1.0 0.0 \n", "\n", " longmon ... pager internet callwait confer ebill loglong logtoll \\\n", "0 4.40 ... 1.0 0.0 1.0 1.0 0.0 1.482 3.033 \n", "1 9.45 ... 0.0 0.0 0.0 0.0 0.0 2.246 3.240 \n", "2 6.30 ... 0.0 0.0 0.0 1.0 0.0 1.841 3.240 \n", "3 6.05 ... 1.0 1.0 1.0 1.0 1.0 1.800 3.807 \n", "4 7.10 ... 0.0 0.0 1.0 1.0 0.0 1.960 3.091 \n", "\n", " lninc custcat churn \n", "0 4.913 4.0 1.0 \n", "1 3.497 1.0 1.0 \n", "2 3.401 3.0 0.0 \n", "3 4.331 4.0 0.0 \n", "4 4.382 3.0 0.0 \n", "\n", "[5 rows x 28 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "churn_df = pd.read_csv(\"ChurnData.csv\")\n", "churn_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Data pre-processing and selection

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets select some features for the modeling. Also we change the target data type to be integer, as it is a requirement by the skitlearn algorithm:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tenureageaddressincomeedemployequipcallcardwirelesschurn
011.033.07.0136.05.05.00.01.01.01
133.033.012.033.02.00.00.00.00.01
223.030.09.030.01.02.00.00.00.00
338.035.05.076.02.010.01.01.01.00
47.035.014.080.02.015.00.01.00.00
\n", "
" ], "text/plain": [ " tenure age address income ed employ equip callcard wireless \\\n", "0 11.0 33.0 7.0 136.0 5.0 5.0 0.0 1.0 1.0 \n", "1 33.0 33.0 12.0 33.0 2.0 0.0 0.0 0.0 0.0 \n", "2 23.0 30.0 9.0 30.0 1.0 2.0 0.0 0.0 0.0 \n", "3 38.0 35.0 5.0 76.0 2.0 10.0 1.0 1.0 1.0 \n", "4 7.0 35.0 14.0 80.0 2.0 15.0 0.0 1.0 0.0 \n", "\n", " churn \n", "0 1 \n", "1 1 \n", "2 0 \n", "3 0 \n", "4 0 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "churn_df = churn_df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip', 'callcard', 'wireless','churn']]\n", "churn_df['churn'] = churn_df['churn'].astype('int')\n", "churn_df.head()" ] }, { "cell_type": "markdown", "metadata": { "button": true, "new_sheet": true, "run_control": { "read_only": false } }, "source": [ "How many rows and columns are in this dataset in total? What are the name of columns?" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(200, 10)\n", "Empty DataFrame\n", "Columns: [tenure, age, address, income, ed, employ, equip, callcard, wireless, churn]\n", "Index: []\n" ] } ], "source": [ "print(churn_df.shape)\n", "print(churn_df.head(0))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets define X, and y for our dataset:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 11., 33., 7., 136., 5., 5., 0.],\n", " [ 33., 33., 12., 33., 2., 0., 0.],\n", " [ 23., 30., 9., 30., 1., 2., 0.],\n", " [ 38., 35., 5., 76., 2., 10., 1.],\n", " [ 7., 35., 14., 80., 2., 15., 0.]])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X = np.asarray(churn_df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip']])\n", "X[0:5]" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1, 1, 0, 0, 0])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y = np.asarray(churn_df['churn'])\n", "y [0:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also, we normalize the dataset:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[-1.13518441, -0.62595491, -0.4588971 , 0.4751423 , 1.6961288 ,\n", " -0.58477841, -0.85972695],\n", " [-0.11604313, -0.62595491, 0.03454064, -0.32886061, -0.6433592 ,\n", " -1.14437497, -0.85972695],\n", " [-0.57928917, -0.85594447, -0.261522 , -0.35227817, -1.42318853,\n", " -0.92053635, -0.85972695],\n", " [ 0.11557989, -0.47262854, -0.65627219, 0.00679109, -0.6433592 ,\n", " -0.02518185, 1.16316 ],\n", " [-1.32048283, -0.47262854, 0.23191574, 0.03801451, -0.6433592 ,\n", " 0.53441472, -0.85972695]])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "X = preprocessing.StandardScaler().fit(X).transform(X)\n", "X[0:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train/Test dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Okay, we split our dataset into train and test set:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train set: (160, 7) (160,)\n", "Test set: (40, 7) (40,)\n" ] } ], "source": [ "\n", "X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)\n", "print ('Train set:', X_train.shape, y_train.shape)\n", "print ('Test set:', X_test.shape, y_test.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Modeling (Logistic Regression with Scikit-learn)

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets build our model using __LogisticRegression__ from Scikit-learn package. This function implements logistic regression and can use different numerical optimizers to find parameters, including ‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’ solvers. You can find extensive information about the pros and cons of these optimizers if you search it in internet.\n", "\n", "The version of Logistic Regression in Scikit-learn, support regularization. Regularization is a technique used to solve the overfitting problem in machine learning models.\n", "__C__ parameter indicates __inverse of regularization strength__ which must be a positive float. Smaller values specify stronger regularization. \n", "Now lets fit our model with train set:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LogisticRegression(C=0.01, solver='liblinear')" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)\n", "LR" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can predict using our test set:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1\n", " 0 0 0]\n", "[0 0 1 0 1 1 1 0 1 1 0 0 0 1 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1\n", " 1 0 0]\n" ] } ], "source": [ "yhat = LR.predict(X_test)\n", "print(yhat)\n", "print(y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__predict_proba__ returns estimates for all classes, ordered by the label of classes. So, the first column is the probability of class 1, P(Y=1|X), and second column is probability of class 0, P(Y=0|X):" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "yhat_prob = LR.predict_proba(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Evaluation

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### jaccard index\n", "Lets try jaccard index for accuracy evaluation. we can define jaccard as the size of the intersection divided by the size of the union of two label sets. If the entire set of predicted labels for a sample strictly match with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.375" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "jaccard_score(y_test, yhat)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### confusion matrix\n", "Another way of looking at accuracy of classifier is to look at __confusion matrix__." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[ 6 9]\n", " [ 1 24]]\n" ] } ], "source": [ "def plot_confusion_matrix(cm, classes,\n", " normalize=False,\n", " title='Confusion matrix',\n", " cmap=plt.cm.Blues):\n", " \"\"\"\n", " This function prints and plots the confusion matrix.\n", " Normalization can be applied by setting `normalize=True`.\n", " \"\"\"\n", " if normalize:\n", " cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n", " print(\"Normalized confusion matrix\")\n", " else:\n", " print('Confusion matrix, without normalization')\n", " print(cm)\n", " plt.imshow(cm, interpolation='nearest', cmap=cmap)\n", " plt.title(title)\n", " plt.colorbar()\n", " tick_marks = np.arange(len(classes))\n", " plt.xticks(tick_marks, classes, rotation=45)\n", " plt.yticks(tick_marks, classes)\n", " fmt = '.2f' if normalize else 'd'\n", " thresh = cm.max() / 2.\n", " for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n", " plt.text(j, i, format(cm[i, j], fmt),\n", " horizontalalignment=\"center\",\n", " color=\"white\" if cm[i, j] > thresh else \"black\")\n", " plt.tight_layout()\n", " plt.ylabel('True label')\n", " plt.xlabel('Predicted label')\n", "print(confusion_matrix(y_test, yhat, labels=[1,0]))" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Confusion matrix, without normalization\n", "[[ 6 9]\n", " [ 1 24]]\n" ] }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAU4AAAEmCAYAAAAN9HleAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAeoElEQVR4nO3de7wd49338c93J0ScG5FIQoQWaWjFoUVKhPRpKa1qqaJoy42qarVOrT5CtfetrVYPlEa5KUKoQxFFHw2aOiYRhziXOCQhJyEIkvg9f8ysWra911qTrL3WzF7ft9e8suawrvntTPbPNdd1zTWKCMzMrHZtzQ7AzKxonDjNzDJy4jQzy8iJ08wsIydOM7OMnDjNzDJy4rS6kdRb0g2SXpV01QqUc6CkW+sZW7NI2knSE82Ow+pLHsfZeiQdAHwfGAosAqYBP4uISStY7kHAd4AREbF0RePMO0kBbBIRTzc7Fmss1zhbjKTvA78B/hvoDwwG/gDsVYfiNwSebIWkWQtJPZsdg3WRiPDSIguwFvA6sG+FY3qRJNZZ6fIboFe6bxTwIvADYA4wG/hGuu804B1gSXqOQ4FTgUvLyh4CBNAzXf868AxJrfdZ4MCy7ZPKvjcCuB94Nf1zRNm+24HTgX+l5dwK9O3kZyvFf0JZ/F8EPgc8CSwAflR2/CeBu4GF6bFnAyun++5Mf5Y30p93v7LyTwReAi4pbUu/8+H0HFun6wOBecCoZv/b8JJtcY2ztewArAJcW+GYk4HtgeHAliTJ48dl+9cjScCDSJLjOZI+FBFjSGqx4yNi9Yi4oFIgklYDfgfsHhFrkCTHaR0c1weYkB67DvBrYIKkdcoOOwD4BtAPWBk4rsKp1yP5OxgEnAKcD3wN2AbYCThF0sbpscuAY4G+JH93o4GjACJiZHrMlunPO76s/D4kte/Dy08cEf8mSaqXSVoV+F/gooi4vUK8lkNOnK1lHWBeVL6VPhD4SUTMiYi5JDXJg8r2L0n3L4mIm0hqW5stZzzvAltI6h0RsyNiegfH7AE8FRGXRMTSiLgceBz4fNkx/xsRT0bEYuBKkqTfmSUk7blLgCtIkuJvI2JRev7pwMcBImJKRNyTnncG8Edg5xp+pjER8XYaz/tExPnAU8C9wACS/1FZwThxtpb5QN8qbW8DgefK1p9Lt/2njHaJ901g9ayBRMQbJLe3RwKzJU2QNLSGeEoxDSpbfylDPPMjYln6uZTYXi7bv7j0fUmbSrpR0kuSXiOpUfetUDbA3Ih4q8ox5wNbAL+PiLerHGs55MTZWu4G3iJp1+vMLJLbzJLB6bbl8Qawatn6euU7I+KWiPg/JDWvx0kSSrV4SjHNXM6YsjiXJK5NImJN4EeAqnyn4jAVSauTtBtfAJyaNkVYwThxtpCIeJWkXe8cSV+UtKqklSTtLukX6WGXAz+WtK6kvunxly7nKacBIyUNlrQW8MPSDkn9JX0hbet8m+SWf1kHZdwEbCrpAEk9Je0HDANuXM6YslgDeA14Pa0Nf6vd/peBjT/wrcp+C0yJiMNI2m7PW+EoreGcOFtMRPyaZAznj4G5wAvA0cB16SE/BSYDDwEPA1PTbctzrr8D49OypvD+ZNdG0js/i6SneWfSjpd2ZcwH9kyPnU/SI75nRMxbnpgyOo6k42kRSW14fLv9pwIXS1oo6SvVCpO0F7AbSfMEJNdha0kH1i1iawgPgDczy8g1TjOzjJw4zcwycuI0M8vIidPMLCNPQlDF2n3WiYHrD252GNaBt5Z2NHrJmm3urBdZtHBBtfGumfRYc8OIpR94EOsDYvHcWyJit3qeuyNOnFUMXH8wl1x/R7PDsA48ueC1ZodgHTj5a5+re5mxdDG9Nqs64ou3pp1T7cmuunDiNLMCECg/LYtOnGaWfwLaejQ7iv9w4jSzYlBdm01XiBOnmRWAb9XNzLJzjdPMLAPJbZxmZpn5Vt3MLCPfqpuZZeHOITOzbDyO08wsK9c4zcyya3Mbp5lZ7YRrnGZm2Xgcp5lZdh6OZGaWkW/VzcwykFzjNDPLzG2cZmZZeBynmVl2vlU3M8vA4zjNzLLyOE4zs+xc4zQzy8htnGZmGci96mZmmanNidPMrGYC5Ft1M7MMlC454cRpZgUg1zjNzLJqcxunmVk2eapx5ieFm5l1RjUu1YqRNpA0UdJjkqZL+m66vY+kv0t6Kv3zQ5XKceI0s9xT2sZZbanBUuAHEfFRYHvg25KGAScBt0XEJsBt6XqnfKtuZoVQjzbOiJgNzE4/L5L0GDAI2AsYlR52MXA7cGJn5Thxmlkh1Fij7Ctpctn62IgY20l5Q4CtgHuB/mlSJSJmS+pX6SROnGaWf7WP45wXEdtWLU5aHbga+F5EvJa148ltnGZWCHVq40TSSiRJ87KIuCbd/LKkAen+AcCcSmU4cZpZ7gnR1tZWdalaTpJdLwAei4hfl+26Hjgk/XwI8NdK5fhW3cyKoT7DOD8FHAQ8LGlauu1HwBnAlZIOBZ4H9q1UiBOnmeWf6jMAPiIm0XkKHl1rOU6cZlYIeXpyyInTzHKv1MaZF06cZlYM+alwule9VSx6bSEnfOsgvjx6W/b59Cd4aOp9zQ7JgL+Nu4ATvjKa4/cdzd/G/anZ4eSX6jccqR5c42wRZ552EiN2/jS/OPcSlrzzDm+99WazQ2p5Lzz9OBOvG8fpF99Iz5VW4ozvHMTwHUczYPBGzQ4tl/LUxukaZwt4fdFrPHDfv9hrv4MBWGnllVljzbWbG5Qx89mn+cgWW9Ord2969OzJR7fejskTb252WLmlNlVdGsWJswXMfGEGa/fpy2nHH8UBe+zI6ScezeI332h2WC1vg49sxuMP3Muiha/w9uLFTPvXROa/PKvZYeVWnm7VG5o4JV0kaZ9GnrPd+S+UNEfSI82KoRmWLV3KE9MfZJ8DD2XchEn0XnU1Ljr3rGaH1fIGbbQJnz/kKP7nqAP4+Xe+xoabDqNHjx7NDiuXakma3TZxrihJK/qv6iJgtzqEUij9Bgyi33qD2GKrZO6D0bvvxePTH2xyVAawyxe/yn+P+xun/OlqVltzLdbbwO2bnWmZxCnpYEkPSXpQ0iXp5pGS7pL0TKn2KWmUpBvLvne2pK+nn2dIOkXSJGDfdP00SVMlPSxpaK3xRMSdwIL6/YTF0Hfd/vQfMIgZ/34KgPvuuoONP7JZk6MygFcXzANg3uyZ3P+Pm9lht72aHFF+5SlxdlmvuqTNgZOBT0XEPEl9gF8DA4AdgaEkD9b/pYbi3oqIHdNyzyCZOmprSUcBxwGHSdoF6Oj+882IGJEx9sOBwwHWG7hBlq/m1vGn/YL/e+xhLHlnCYMGD2HML89pdkgG/Ob4w3n91YX06NmTb5z0U1Z3p12nGtn5U01XDkfaFfhLRMwDiIgF6f8RrouId4FHJfWvsazx7dZLU0FNAb6Ulj8RGL6iQadljQXGAgz7+FZRjzKbbbNhH+eS6+9odhjWzpgLrql+kNXtWfV66crEKaCjpPN2u2MgeQ9IebPBKu2+074LuFTGMtKfoZ41TjPLFwE5yptdmjhvA66VdFZEzE9v1TvzHDBMUi+SpDkamJTlZPWscZpZ3jS2DbOaLkucETFd0s+AOyQtAx6ocOwLkq4EHgKeqnTsipB0OckLmfpKehEYExEXdMW5zKy+2lqkjZOIuJjkjXGd7V+97PMJwAkdHDOks/WImMx7b6arJZ79az3WzHJErXOrbmZWF6KFapxmZvXiGqeZWRZyjdPMLJNkOJITp5lZBi0yHMnMrJ5ylDedOM2sANzGaWaWjds4zcyWQ47yphOnmRWDa5xmZlm4jdPMLJtWmlbOzKxOPI7TzCyzHOVNJ04zKwC3cZqZZeNxnGZmy8GJ08wsoxzlTSdOMysAt3GamWWjnA1Haqt+iJlZ80nVl+pl6EJJcyQ9UrbtVEkzJU1Ll89VK8eJ08wKoU2qutTgImC3DrafFRHD0+WmaoX4Vt3Mck91auOMiDslDVnRcjpNnJJ+D0SFAI5Z0ZObmdWqxrzZV9LksvWxETG2hu8dLelgYDLwg4h4pdLBlWqckyvsMzNrqBo7h+ZFxLYZiz4XOJ2kong68Cvgm5W+0GnijIiLy9clrRYRb2QMyMysLrqqUz0iXn7vHDofuLHad6p2DknaQdKjwGPp+paS/rAigZqZZSGgh1R1Wa6ypQFlq3sDj3R2bEktnUO/AT4LXA8QEQ9KGrk8AZqZLRfVZxynpMuBUSRtoS8CY4BRkoaT3KrPAI6oVk5NveoR8UK7oJdlC9fMbMXU41Y9IvbvYPMFWcupJXG+IGkEEJJWBo4hvW03M2sEQa3jNBuilsR5JPBbYBAwE7gF+HZXBmVm1l6hnlWPiHnAgQ2IxcysQ7U+UtkotfSqbyzpBklz02c8/ypp40YEZ2ZWUqdHLusTSw3HjAOuBAYAA4GrgMu7Migzs/ZUw9IotSRORcQlEbE0XS6lwqOYZmb1JqBHm6oujVLpWfU+6ceJkk4CriBJmPsBExoQm5lZok7jOOulUufQFJJEWYq2fFBo6ZlOM7OGyFHerPis+kaNDMTMrJKi1Dj/Q9IWwDBgldK2iPhzVwVlZlau1MaZF1UTp6QxJM92DgNuAnYHJgFOnGbWMPlJm7X1qu8DjAZeiohvAFsCvbo0KjOzMlK+xnHWcqu+OCLelbRU0prAHMAD4M2soXLUxFlT4pwsaW3gfJKe9teB+7oyKDOz9grVORQRR6Ufz5N0M7BmRDzUtWGZmb1HNHaAezWVBsBvXWlfREztmpDMzNrJ2SQflWqcv6qwL4Bd6xxLLvVeqQebr79ms8OwDuy494+aHYJ14O0ZL3VJuYW4VY+IXRoZiJlZJbUMAWqUmgbAm5k1U+EGwJuZ5UGO8qYTp5nlXzIDfH4yZy0zwEvS1ySdkq4PlvTJrg/NzOw9baq+NCyWGo75A7ADUHqt5iLgnC6LyMysncJMZFxmu4jYWtIDABHxSvqaYDOzhilar/oSST1IX5chaV3g3S6NysysnRw1cdaUOH8HXAv0k/QzktmSftylUZmZlVGDZz+qppZn1S+TNIVkajkBX4yIx7o8MjOzMj1ydK9ey0TGg4E3gRvKt0XE810ZmJlZiaBYNU6SN1qWXtq2CrAR8ASweRfGZWb2PjnKmzXdqn+sfD2dNemITg43M6u/Bo/TrCbzk0MRMVXSJ7oiGDOzjgjokaMqZy1tnN8vW20DtgbmdllEZmYdKFqNc42yz0tJ2jyv7ppwzMw6lqdn1SsmznTg++oRcXyD4jEz+4CkV73ZUbyn0qszekbE0kqv0DAzawgVZz7O+0jaM6dJuh64CnijtDMiruni2MzMgPrVOCVdCOwJzImILdJtfYDxwBBgBvCViHilUjm1jMXvA8wnecfQnsDn0z/NzBpGqr7U4CJgt3bbTgJui4hNgNvS9Yoq1Tj7pT3qj/DeAPiSqClEM7O6EG2seJUzIu6UNKTd5r2AUenni4HbgRMrlVMpcfYAVocOo3XiNLOGkbr0WfX+ETEbICJmS+pX7QuVEufsiPhJ3UIzM1sBNT6r3lfS5LL1sRExtt6xVEqc+enCMrOWJmpuw5wXEdtmLP5lSQPS2uYAYE61L1Sq/I7OeHIzsy7Tls7JWWlZTtcDh6SfDwH+Wu0LndY4I2LB8kZhZlZPybPqdShHupykI6ivpBeBMcAZwJWSDgWeB/atVo5fD2xm+Ven1wNHxP6d7Mp0h+3EaWaFkKdOFydOM8u9Is4Ab2bWdDl6VN2J08yKQMWZVs7MLA9EbRNrNIoTp5kVgmucZmZZyJ1DZmaZ+FbdzGw5+FbdzCyj/KRNJ04zK4DCvVfdzCwPcpQ3nTjNrAiEcnSz7sRpZoXgGqeZWQbJcKT8ZE4nTjPLP0FbjgZy5igU6ypHHPZNBg/sxzbDt2h2KC1v/f5rc/PYY3jg6h8z5S8n8+39R71v//cOGs3iB85mnbVXa06AOaYa/msUJ84WcNAhX+evN97c7DAMWLrsXU769TVs9eWfsvPBZ3LEfiMZuvF6QJJUd91+KM/P9ltr2kvm46y+NIoTZwvYcaeR9OnTp9lhGPDSvNeY9viLALz+5ts8/uxLDFx3bQB+cdyXOfm31xERTYwwv/JU43Qbp1mTDB7Qh+Gbrc/9j8xgj50/xqw5C3n4yZnNDiu38jTJR0NrnJIukrRPI8/Z7vy7SXpC0tOSTmpWHGar9V6Zy888jOPPvJqly5Zx4qGf5SfnTmh2WLnlW/UVIKnHCn73HGB3YBiwv6Rh9YrNrFY9e7Zx+Zn/xfi/Teav/3iQjddflw0HrcN943/I4xNOY1C/tbl73In0X2eNZoeaI7XcqHeTziFJB0t6SNKDki5JN4+UdJekZ0q1T0mjJN1Y9r2zJX09/TxD0imSJgH7puunSZoq6WFJQ2sM55PA0xHxTES8A1wB7FW3H9asRueNOZAnnn2J3136DwCmPz2LDUf/kKF7jGHoHmOYOWchOxzwc16ev6jJkeaIkgHw1ZZG6bLEKWlz4GRg14jYEvhuumsAsCOwJ8mL4GvxVkTsGBFXpOvzImJr4FzguPR8u0ia1sFyV/qdQcALZWW+mG7r9g7+2v6M2mkHnnziCT48ZH0uuvCCZofUskYM35gD99yOnT+xKfdccRL3XHESn93RNz7VlCb5qLY0Sld2Du0K/CUi5gFExIJ0Pr3rIuJd4FFJ/Wssa3y79WvSP6cAX0rLnwgMr1BGR3+rHXZfSjocOBxgg8GDawwxv/586eXNDsFSd017ht5bHV3xmKF7jGlQNMWSn66hrk2couPE9Ha7YwCW8v7a7yrtvvNGJ2UsI/0ZJO0CnNXB+d6MiBEkNcwNyravD8zqKPCIGAuMBdhmm209NsQsD3KUObsycd4GXCvprIiYL6nSQMLngGGSepEkzdHApCwnq6HGeT+wiaSNgJnAV4EDspzDzJqnJWZHiojpkn4G3CFpGfBAhWNfkHQl8BDwVKVjVyCepZKOBm4BegAXRsT0ep/HzLpGI4cbVdOlA+Aj4mLg4gr7Vy/7fAJwQgfHDOlsPSImA6MyxHMTcFOtx5tZjrRK4jQzqwfRIrfqZmZ10+BxmtU4cZpZIThxmpll4ncOmZll5hqnmVkGIled6k6cZlYMylGV04nTzAqhXnlT0gxgEckj20sjYtusZThxmlkh1Lm+uUtpAqLl4cRpZvmXs0bOQs0Ab2atKXl1hqouQF9Jk8uWwzsoLoBbJU3pZH9VrnGaWSHUWOGcV0Ob5aciYpakfsDfJT0eEXdmicU1TjMrBtWw1CAiZqV/zgGuJXmtTiZOnGZWCPV4WZuk1SStUfoMfAZ4JGssvlU3s0Ko03yc/UkmWIck/42LiJuzFuLEaWbFUIfEGRHPAFuuaDlOnGaWe56P08wsK8/HaWaWnROnmVkmno/TzCwz1zjNzDLI2aPqTpxmVgyej9PMLKMc5U0nTjMrhhzlTSdOMysAj+M0M8tGuI3TzCyz/KRNJ04zK4gcVTidOM2sGPzkkJlZRq5xmpllIPeqm5ll51t1M7Os8pM3nTjNrBhylDedOM2sCERbjho5nTjNLPeSJ4eaHcV7/F51M7OMXOM0s0LIU43TidPM8k+4jdPMLAu/OsPMbHnkKHM6cZpZIfjJITOzjNrykzedOM2sIJw4zcyyydOtuiKi2THkmqS5wHPNjqNO+gLzmh2Edag7XZsNI2LdehYo6WaSv6Nq5kXEbvU8d4fxOHG2DkmTI2LbZsdhH+RrUyx+5NLMLCMnTjOzjJw4W8vYZgdgnfK1KRC3cZqZZeQap5lZRk6cZmYZOXGamWXkxGkfIMn/LnJI0srt1vPzKE2LceeQ/YekTwBzIuI5SW0R8W6zY7KEpM8CewBzgRuA6RGxRJLCv8QN55qFASBpd+CfwARJm0XEu6555kP6P7TLgNuBDYGDgeMkrRwR4Zpn4/kXw5DUG9gbOBI4G7isLHn2aG50BvQBLoyIa4BjgFuB/sCxknq6xtl4nh3JiIjFkk4BlkXEXElrkyTPgyLisSaHZ/Ay8GVJ10XEXZJuI5lk7TPAh4EnmhpdC3KN0wCIiJciYm76+QzgL8AlktaQNELSrs2NsDWlbc3TgF8Bh0naMiKWkNy29wf2bGJ4Lcs1zhYnqUdELCt1BpU6GyLiDEkLgBeBt4ARTQ615ZSuTbp6BbAW8D1JF0TEJEn3Av3aHWcN4BpnCytLmoOBSyX1SjsbSu2aS4A3gF0j4t/Ni7T1lF2bDSVdCiwExgGTSZpRzgNOBi5z0mw8D0dqUWW/mOsD44Hfk/Sqvx0R8yStCfwO+FVEPNzMWFtNB9fmbJJb87ci4hVJw4C1gZkR0V0m2S4UJ84W1O4X8yrgl8ADwC3A4RFxe3rcyhHxTvMibT0Vrs2tJNdmYlMDNMC36i2p7Pb8GuAXJL+YVwHfj4jbS+MCnTQbr8K1OTYiJnrMZj64xtkC2j9dkrZhnkHSXnY/ScfD6RFxQ5NCbFm+NsXkxNnNlf9iShoCLIyIhelzz32BfwAnRMT1TQyzJfnaFJcTZzfW7hfzWJIng+4Gno2I09JbwoERcU8z42xFvjbF5jbObqzsF3N7YDOSxyrPAzaX9LOIeD4i7vFjlY3na1NsTpzdnKSdgQkkj1M+CkwFTgc+IulsSDokmhhiy/K1KS4nzm6mvNdV0qHAEOA04DOStkl7yqeTdECsIalfUwJtQb423Ycfuexmym4BPwNsTjKAfaakAMalE3fcJ+lB4L885KhxfG26DyfObqJdZ8NqJO1lLwOl59B/L2kpyXybu0XEFMC/mA3ga9P9+Fa9myj7xdwWWAUYCfQCvl6ayT0izgV+RPLcszWIr0334+FIBVeqzaSztfcleURvBvAbktl0JgB/joifNy3IFuVr0325xllwZU+dKCLmAH8A1gGOBl4heU/N99KxgtZAvjbdlxNnNyBpJPBnSb0j4l7gYpIe25NJXu61HeCnT5rA16Z7cuIsoA4mephDMtnwWZJWjYj7SSaG+CpwBPCi59NsDF+b1uDEWTCSVinrbNhK0scj4nHgVCBI5tAEeBv4F3B5+DW/DeFr0zrcOVQgkj4GbA9cCnwT+C7wEvByROwraSBwJskjfCsB+4VfttYQvjatxeM4i2VDYHdgVWAH4JPpbDr3SroqIvYFDpA0gmSyiNnNDLbF+Nq0EN+qF0A6nIWIuJHkFm9L4EMkQ1yIiO2AQZL+ka7f5V/MxvC1aU1OnAVQageTdCSwNfD/gNeAnSRtkB4zAng3feWCNYivTWvyrXpBSPoC8G1gj4h4XtJrwH7JLk2MiGcj4tPNjbI1+dq0HifO4hhI0gv7vKSeEXGjpGUkHRGLJb1AMj2Ze/saz9emxfhWvTieI7n92ywilqbb2oD5wMSIWOpfzKbxtWkxHo5UEErec34CyS/kXSTv1T4G+GpEPNPE0Fqer03rceIsEEkDgL2ALwCvAv8TEQ81NyoDX5tW48RZQOlbEP3e8xzytWkNTpxmZhm5c8jMLCMnTjOzjJw4zcwycuI0M8vIidPMLCMnTquJpGWSpkl6RNJVklZdgbIukrRP+vlPkoZVOHZUOhVb1nPMkNS31u3tjnk947lOlXRc1hituJw4rVaLI2J4RGxB8s7vI8t3SuqxPIVGxGER8WiFQ0YBmROnWVdy4rTl8U/gI2ltcKKkccDDknpI+qWk+yU9JOkISKYIknS2pEclTQD6lQqSdHv6vnEk7SZpqqQHJd0maQhJgj42re3uJGldSVen57hf0qfS764j6VZJD0j6I9D+3T8fIOk6SVMkTZd0eLt9v0pjuU3Suum2D0u6Of3OPyUNrcvfphWOZ0eyTCT1JJnp/OZ00yeBLSLi2TT5vBoRn5DUC/iXpFuBrUheGfExoD/wKHBhu3LXBc4HRqZl9YmIBZLOA16PiDPT48YBZ0XEJEmDgVuAjwJjgEkR8RNJewDvS4Sd+GZ6jt7A/ZKujoj5wGrA1Ij4gaRT0rKPBsYCR0bEU5K2I3nd767L8ddoBefEabXqLWla+vmfwAUkt9D3RcSz6fbPAB8vtV8CawGbACNJpl1bBswqzYbezvbAnaWyImJBJ3F8Ghim914muaakNdJzfCn97gRJr9TwMx0jae/08wZprPOBd4Hx6fZLgWskrZ7+vFeVnbtXDeewbsiJ02q1OCKGl29IE8gb5ZuA70TELe2O+xzJWx4rUQ3HQNK8tENELO4glpqfH5Y0iiQJ7xARb0q6HVilk8MjPe/C9n8H1prcxmn1dAvwLUkrAUjaVNJqwJ3AV9M20AHALh18925gZ0kbpd/tk25fBKxRdtytJLfNpMcNTz/eCRyYbtud5L0/lawFvJImzaEkNd6SNqBUaz6ApAngNeBZSfum55CkLaucw7opJ06rpz+RtF9OlfQI8EeSu5prgaeAh4FzgTvafzEi5pK0S14j6UHeu1W+Adi71DlEMs/ltmnn06O817t/GjBS0lSSJoPnq8R6M9BT0kPA6cA9ZfveADaXNIWkDfMn6fYDgUPT+KaTTCNnLcizI5mZZeQap5lZRk6cZmYZOXGamWXkxGlmlpETp5lZRk6cZmYZOXGamWX0/wHUBszU3QiRVwAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Compute confusion matrix\n", "cnf_matrix = confusion_matrix(y_test, yhat, labels=[1,0])\n", "np.set_printoptions(precision=2)\n", "\n", "# Plot non-normalized confusion matrix\n", "plt.figure()\n", "plot_confusion_matrix(cnf_matrix, classes=['churn=1','churn=0'],normalize= False, title='Confusion matrix')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Look at first row. The first row is for customers whose actual churn value in test set is 1.\n", "As you can calculate, out of 40 customers, the churn value of 15 of them is 1. \n", "And out of these 15, the classifier correctly predicted 6 of them as 1, and 9 of them as 0. \n", "\n", "It means, for 6 customers, the actual churn value were 1 in test set, and classifier also correctly predicted those as 1. However, while the actual label of 9 customers were 1, the classifier predicted those as 0, which is not very good. We can consider it as error of the model for first row.\n", "\n", "What about the customers with churn value 0? Lets look at the second row.\n", "It looks like there were 25 customers whom their churn value were 0. \n", "\n", "\n", "The classifier correctly predicted 24 of them as 0, and one of them wrongly as 1. So, it has done a good job in predicting the customers with churn value 0. A good thing about confusion matrix is that shows the model’s ability to correctly predict or separate the classes. In specific case of binary classifier, such as this example, we can interpret these numbers as the count of true positives, false positives, true negatives, and false negatives. " ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 0.73 0.96 0.83 25\n", " 1 0.86 0.40 0.55 15\n", "\n", " accuracy 0.75 40\n", " macro avg 0.79 0.68 0.69 40\n", "weighted avg 0.78 0.75 0.72 40\n", "\n" ] } ], "source": [ "print (classification_report(y_test, yhat))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Based on the count of each section, we can calculate precision and recall of each label:\n", "\n", "\n", "- __Precision__ is a measure of the accuracy provided that a class label has been predicted. It is defined by: precision = TP / (TP + FP)\n", "\n", "- __Recall__ is true positive rate. It is defined as: Recall =  TP / (TP + FN)\n", "\n", " \n", "So, we can calculate precision and recall of each class.\n", "\n", "__F1 score:__\n", "Now we are in the position to calculate the F1 scores for each label based on the precision and recall of that label. \n", "\n", "The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. It is a good way to show that a classifer has a good value for both recall and precision.\n", "\n", "\n", "And finally, we can tell the average accuracy for this classifier is the average of the F1-score for both labels, which is 0.72 in our case." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### log loss\n", "Now, lets try __log loss__ for evaluation. In logistic regression, the output can be the probability of customer churn is yes (or equals to 1). This probability is a value between 0 and 1.\n", "Log loss( Logarithmic loss) measures the performance of a classifier where the predicted output is a probability value between 0 and 1. \n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.6017092478101186" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "log_loss(y_test, yhat_prob)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Thanks for Reading :)\n", "Created by [Saeed Aghabozorgi](https://www.linkedin.com/in/saeedaghabozorgi/) and modified by [Tarun Kamboj](https://www.linkedin.com/in/kambojtarun/)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" }, "widgets": { "state": {}, "version": "1.1.2" } }, "nbformat": 4, "nbformat_minor": 2 }