{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# SVM (Support Vector Machines)\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Importing Needed packages" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "from sklearn import preprocessing, svm\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import classification_report, confusion_matrix, f1_score\n", "import itertools\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline " ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "## Load the Cancer data\n", "\n", "The example is based on a dataset that is publicly available from the UCI Machine Learning Repository (Asuncion and Newman, 2007) http://mlearn.ics.uci.edu/MLRepository.html. The dataset consists of several hundred human cell sample records, each of which contains the values of a set of cell characteristics. The fields in each record are:\n", "\n", "|Field name|Description|\n", "|--- |--- |\n", "|ID|Clump thickness|\n", "|Clump|Clump thickness|\n", "|UnifSize|Uniformity of cell size|\n", "|UnifShape|Uniformity of cell shape|\n", "|MargAdh|Marginal adhesion|\n", "|SingEpiSize|Single epithelial cell size|\n", "|BareNuc|Bare nuclei|\n", "|BlandChrom|Bland chromatin|\n", "|NormNucl|Normal nucleoli|\n", "|Mit|Mitoses|\n", "|Class|Benign or malignant|\n" ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "### Load Data From CSV File " ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDClumpUnifSizeUnifShapeMargAdhSingEpiSizeBareNucBlandChromNormNuclMitClass
010000255111213112
1100294554457103212
210154253111223112
310162776881343712
410170234113213112
\n", "
" ], "text/plain": [ " ID Clump UnifSize UnifShape MargAdh SingEpiSize BareNuc \\\n", "0 1000025 5 1 1 1 2 1 \n", "1 1002945 5 4 4 5 7 10 \n", "2 1015425 3 1 1 1 2 2 \n", "3 1016277 6 8 8 1 3 4 \n", "4 1017023 4 1 1 3 2 1 \n", "\n", " BlandChrom NormNucl Mit Class \n", "0 3 1 1 2 \n", "1 3 2 1 2 \n", "2 3 1 1 2 \n", "3 3 7 1 2 \n", "4 3 1 1 2 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cell_df = pd.read_csv(\"cell_samples.csv\")\n", "cell_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ID field contains the patient identifiers. The characteristics of the cell samples from each patient are contained in fields Clump to Mit. The values are graded from 1 to 10, with 1 being the closest to benign.\n", "\n", "The Class field contains the diagnosis, as confirmed by separate medical procedures, as to whether the samples are benign (value = 2) or malignant (value = 4).\n", "\n", "Lets look at the distribution of the classes based on Clump thickness and Uniformity of cell size:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "ax = cell_df[cell_df['Class'] == 4][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='DarkBlue', label='malignant');\n", "cell_df[cell_df['Class'] == 2][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='Yellow', label='benign', ax=ax);\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data pre-processing and selection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets first look at columns data types:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ID int64\n", "Clump int64\n", "UnifSize int64\n", "UnifShape int64\n", "MargAdh int64\n", "SingEpiSize int64\n", "BareNuc object\n", "BlandChrom int64\n", "NormNucl int64\n", "Mit int64\n", "Class int64\n", "dtype: object" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cell_df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like the __BareNuc__ column includes some values that are not numerical. We can drop those rows:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ID int64\n", "Clump int64\n", "UnifSize int64\n", "UnifShape int64\n", "MargAdh int64\n", "SingEpiSize int64\n", "BareNuc int32\n", "BlandChrom int64\n", "NormNucl int64\n", "Mit int64\n", "Class int64\n", "dtype: object" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cell_df = cell_df[pd.to_numeric(cell_df['BareNuc'], errors='coerce').notnull()]\n", "cell_df['BareNuc'] = cell_df['BareNuc'].astype('int')\n", "cell_df.dtypes" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 5, 1, 1, 1, 2, 1, 3, 1, 1],\n", " [ 5, 4, 4, 5, 7, 10, 3, 2, 1],\n", " [ 3, 1, 1, 1, 2, 2, 3, 1, 1],\n", " [ 6, 8, 8, 1, 3, 4, 3, 7, 1],\n", " [ 4, 1, 1, 3, 2, 1, 3, 1, 1]], dtype=int64)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "feature_df = cell_df[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize', 'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']]\n", "X = np.asarray(feature_df)\n", "X[0:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want the model to predict the value of Class (that is, benign (=2) or malignant (=4)). As this field can have one of only two possible values, we need to change its measurement level to reflect this." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 4, 2, 4])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cell_df['Class'] = cell_df['Class'].astype('int')\n", "y = np.asarray(cell_df['Class'])\n", "y [0:15]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train/Test dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Okay, we split our dataset into train and test set:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train set: (546, 9) (546,)\n", "Test set: (137, 9) (137,)\n" ] } ], "source": [ "X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)\n", "print ('Train set:', X_train.shape, y_train.shape)\n", "print ('Test set:', X_test.shape, y_test.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Modeling (SVM with Scikit-learn)

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The SVM algorithm offers a choice of kernel functions for performing its processing. Basically, mapping data into a higher dimensional space is called kernelling. The mathematical function used for the transformation is known as the kernel function, and can be of different types, such as:\n", "\n", " 1.Linear\n", " 2.Polynomial\n", " 3.Radial basis function (RBF)\n", " 4.Sigmoid\n", "Each of these functions has its characteristics, its pros and cons, and its equation, but as there's no easy way of knowing which function performs best with any given dataset, we usually choose different functions in turn and compare the results. Let's just use the default, RBF (Radial Basis Function) for this lab." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "SVC()" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf = svm.SVC(kernel='rbf')\n", "clf.fit(X_train, y_train) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After being fitted, the model can then be used to predict new values:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([2, 4, 2, 4, 2])" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "yhat = clf.predict(X_test)\n", "yhat [0:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Evaluation

" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "def plot_confusion_matrix(cm, classes,\n", " normalize=False,\n", " title='Confusion matrix',\n", " cmap=plt.cm.Blues):\n", " \"\"\"\n", " This function prints and plots the confusion matrix.\n", " Normalization can be applied by setting `normalize=True`.\n", " \"\"\"\n", " if normalize:\n", " cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n", " print(\"Normalized confusion matrix\")\n", " else:\n", " print('Confusion matrix, without normalization')\n", "\n", " print(cm)\n", "\n", " plt.imshow(cm, interpolation='nearest', cmap=cmap)\n", " plt.title(title)\n", " plt.colorbar()\n", " tick_marks = np.arange(len(classes))\n", " plt.xticks(tick_marks, classes, rotation=45)\n", " plt.yticks(tick_marks, classes)\n", "\n", " fmt = '.2f' if normalize else 'd'\n", " thresh = cm.max() / 2.\n", " for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n", " plt.text(j, i, format(cm[i, j], fmt),\n", " horizontalalignment=\"center\",\n", " color=\"white\" if cm[i, j] > thresh else \"black\")\n", "\n", " plt.tight_layout()\n", " plt.ylabel('True label')\n", " plt.xlabel('Predicted label')" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 2 1.00 0.94 0.97 90\n", " 4 0.90 1.00 0.95 47\n", "\n", " accuracy 0.96 137\n", " macro avg 0.95 0.97 0.96 137\n", "weighted avg 0.97 0.96 0.96 137\n", "\n" ] } ], "source": [ "# Compute confusion matrix\n", "cnf_matrix = confusion_matrix(y_test, yhat, labels=[2,4])\n", "np.set_printoptions(precision=2)\n", "\n", "print (classification_report(y_test, yhat))" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Confusion matrix, without normalization\n", "[[85 5]\n", " [ 0 47]]\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Plot non-normalized confusion matrix\n", "plt.figure()\n", "plot_confusion_matrix(cnf_matrix, classes=['Benign(2)','Malignant(4)'],normalize= False, title='Confusion matrix')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also easily use the __f1_score__ from sklearn library:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9639038982104676" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f1_score(y_test, yhat, average='weighted') " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Thanks for reading :)\n", "Created by [Saeed Aghabozorgi](https://www.linkedin.com/in/saeedaghabozorgi/) and modified by [Tarun Kamboj](https://www.linkedin.com/in/kambojtarun/)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 2 }