{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Pie Chart\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's import all the dependencies first." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# primary data structure library\n", "import pandas as pd \n", "\n", "# primary plotting library\n", "import matplotlib as mpl \n", "\n", "# importing the pyplot layer of matplotlib for easy usage\n", "import matplotlib.pyplot as plt \n", "\n", "# optional: for ggplot-like style of plots\n", "mpl.style.use(['ggplot']) \n", "\n", "# using the inline backend\n", "%matplotlib inline " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preprocessing Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Dataset: Immigration to Canada from 1980 to 2013 - International migration flows to and from selected countries - The 2015 revision from United Nation's website.\n", "\n", "The dataset contains annual data on the flows of international migrants as recorded by the countries of destination. The data presents both inflows and outflows according to the place of birth, citizenship or place of previous / next residence both for foreigners and nationals. In this lab, we will focus on the Canadian Immigration data." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TypeCoverageOdNameAREAAreaNameREGRegNameDEVDevName1980...2004200520062007200820092010201120122013
0ImmigrantsForeignersAfghanistan935Asia5501Southern Asia902Developing regions16...2978343630092652211117461758220326352004
1ImmigrantsForeignersAlbania908Europe925Southern Europe901Developed regions1...14501223856702560716561539620603
2ImmigrantsForeignersAlgeria903Africa912Northern Africa902Developing regions80...3616362648073623400553934752432537744331
3ImmigrantsForeignersAmerican Samoa909Oceania957Polynesia902Developing regions0...0010000000
4ImmigrantsForeignersAndorra908Europe925Southern Europe901Developed regions0...0011000011
\n", "

5 rows × 43 columns

\n", "
" ], "text/plain": [ " Type Coverage OdName AREA AreaName REG \\\n", "0 Immigrants Foreigners Afghanistan 935 Asia 5501 \n", "1 Immigrants Foreigners Albania 908 Europe 925 \n", "2 Immigrants Foreigners Algeria 903 Africa 912 \n", "3 Immigrants Foreigners American Samoa 909 Oceania 957 \n", "4 Immigrants Foreigners Andorra 908 Europe 925 \n", "\n", " RegName DEV DevName 1980 ... 2004 2005 2006 \\\n", "0 Southern Asia 902 Developing regions 16 ... 2978 3436 3009 \n", "1 Southern Europe 901 Developed regions 1 ... 1450 1223 856 \n", "2 Northern Africa 902 Developing regions 80 ... 3616 3626 4807 \n", "3 Polynesia 902 Developing regions 0 ... 0 0 1 \n", "4 Southern Europe 901 Developed regions 0 ... 0 0 1 \n", "\n", " 2007 2008 2009 2010 2011 2012 2013 \n", "0 2652 2111 1746 1758 2203 2635 2004 \n", "1 702 560 716 561 539 620 603 \n", "2 3623 4005 5393 4752 4325 3774 4331 \n", "3 0 0 0 0 0 0 0 \n", "4 1 0 0 0 0 1 1 \n", "\n", "[5 rows x 43 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_can = pd.read_excel('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DV0101EN/labs/Data_Files/Canada.xlsx',\n", " sheet_name='Canada by Citizenship',\n", " skiprows=range(20),\n", " skipfooter=2\n", " )\n", "\n", "df_can.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Clean up data. We will make some modifications to the original dataset to make it easier to create our visualizations." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "data dimensions: (195, 38)\n" ] } ], "source": [ "# clean up the dataset to remove unnecessary columns (eg. REG) \n", "df_can.drop(['AREA', 'REG', 'DEV', 'Type', 'Coverage'], axis=1, inplace=True)\n", "\n", "# let's rename the columns so that they make sense\n", "df_can.rename(columns={'OdName':'Country', 'AreaName':'Continent','RegName':'Region'}, inplace=True)\n", "\n", "# for sake of consistency, let's also make all column labels of type string\n", "df_can.columns = list(map(str, df_can.columns))\n", "\n", "# set the country name as index - useful for quickly looking up countries using .loc method\n", "df_can.set_index('Country', inplace=True)\n", "\n", "# add total column\n", "df_can['Total'] = df_can.sum(axis=1)\n", "\n", "# years that we will be using in this lesson - useful for plotting later on\n", "years = list(map(str, range(1980, 2014)))\n", "print('data dimensions:', df_can.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's use a pie chart to explore the proportion (percentage) of new immigrants grouped by continents for the entire time period from 1980 to 2013. \n", "\n", "Step 1: Gather data. \n", "\n", "We will use *pandas* `groupby` method to summarize the immigration data by `Continent`. The general process of `groupby` involves the following steps:\n", "\n", "1. **Split:** Splitting the data into groups based on some criteria.\n", "2. **Apply:** Applying a function to each group independently:\n", " .sum()\n", " .count()\n", " .mean() \n", " .std() \n", " .aggregate()\n", " .apply()\n", " .etc..\n", "3. **Combine:** Combining the results into a data structure." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
1980198119821983198419851986198719881989...200520062007200820092010201120122013Total
Continent
Africa3951436338192671263926503782749475529894...275232918828284298903453440892354413808338543618948
Asia31025343143021424696272742385028739432034745460256...1592531490541334591398941414341638451468941522181550753317794
Europe39760448024272024638222872084424370466985472660893...3595533053334953469235078334252677829177286911410947
Latin America and the Caribbean13081152151676915427136781517121179284712192425060...247472467626011265472686728818278562717324950765148
Northern America93781003090747100666165437074770564696790...8394961394631019089958142767778928503241142
Oceania1942183916751018878920904120011811539...15851473169318341860183415481679177555174
\n", "

6 rows × 35 columns

\n", "
" ], "text/plain": [ " 1980 1981 1982 1983 1984 1985 \\\n", "Continent \n", "Africa 3951 4363 3819 2671 2639 2650 \n", "Asia 31025 34314 30214 24696 27274 23850 \n", "Europe 39760 44802 42720 24638 22287 20844 \n", "Latin America and the Caribbean 13081 15215 16769 15427 13678 15171 \n", "Northern America 9378 10030 9074 7100 6661 6543 \n", "Oceania 1942 1839 1675 1018 878 920 \n", "\n", " 1986 1987 1988 1989 ... 2005 \\\n", "Continent ... \n", "Africa 3782 7494 7552 9894 ... 27523 \n", "Asia 28739 43203 47454 60256 ... 159253 \n", "Europe 24370 46698 54726 60893 ... 35955 \n", "Latin America and the Caribbean 21179 28471 21924 25060 ... 24747 \n", "Northern America 7074 7705 6469 6790 ... 8394 \n", "Oceania 904 1200 1181 1539 ... 1585 \n", "\n", " 2006 2007 2008 2009 2010 \\\n", "Continent \n", "Africa 29188 28284 29890 34534 40892 \n", "Asia 149054 133459 139894 141434 163845 \n", "Europe 33053 33495 34692 35078 33425 \n", "Latin America and the Caribbean 24676 26011 26547 26867 28818 \n", "Northern America 9613 9463 10190 8995 8142 \n", "Oceania 1473 1693 1834 1860 1834 \n", "\n", " 2011 2012 2013 Total \n", "Continent \n", "Africa 35441 38083 38543 618948 \n", "Asia 146894 152218 155075 3317794 \n", "Europe 26778 29177 28691 1410947 \n", "Latin America and the Caribbean 27856 27173 24950 765148 \n", "Northern America 7677 7892 8503 241142 \n", "Oceania 1548 1679 1775 55174 \n", "\n", "[6 rows x 35 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# group countries by continents and apply sum() function \n", "df_continents = df_can.groupby('Continent', axis=0).sum()\n", "\n", "# note: the output of the groupby method is a `groupby' object. \n", "# we can not use it further until we apply a function (eg .sum())\n", "print(type(df_can.groupby('Continent', axis=0)))\n", "\n", "df_continents.head(6)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Step 2: Plot the data. We will pass in `kind = 'pie'` keyword, along with the following additional parameters:\n", "- `autopct` - is a string or function used to label the wedges with their numeric value. The label will be placed inside the wedge. If it is a format string, the label will be `fmt%pct`.\n", "- `startangle` - rotates the start of the pie chart by angle degrees counterclockwise from the x-axis.\n", "- `shadow` - Draws a shadow beneath the pie (to give a 3D feel)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# autopct create %, start angle represent starting point\n", "df_continents['Total'].plot(kind='pie',\n", " figsize=(5, 6),\n", " autopct='%1.1f%%', # add in percentages\n", " startangle=90, # start angle 90° (Africa)\n", " shadow=True, # add shadow \n", " )\n", "\n", "plt.title('Immigration to Canada by Continent [1980 - 2013]')\n", "plt.axis('equal') # Sets the pie chart to look like a circle.\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above visual is not very clear, the numbers and text overlap in some instances. Let's make a few modifications to improve the visuals:\n", "\n", "* Remove the text labels on the pie chart by passing in `legend` and add it as a seperate legend using `plt.legend()`.\n", "* Push out the percentages to sit just outside the pie chart by passing in `pctdistance` parameter.\n", "* Pass in a custom set of colors for continents by passing in `colors` parameter.\n", "* **Explode** the pie chart to emphasize the lowest three continents (Africa, North America, and Latin America and Carribbean) by pasing in `explode` parameter.\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "colors_list = ['gold', 'yellowgreen', 'lightcoral', 'lightskyblue', 'lightgreen', 'pink']\n", "explode_list = [0.1, 0, 0, 0, 0.1, 0.1] # ratio for each continent with which to offset each wedge.\n", "\n", "df_continents['Total'].plot(kind='pie',\n", " figsize=(15, 6),\n", " autopct='%1.1f%%', \n", " startangle=90, \n", " shadow=True, \n", " labels=None, # turn off labels on pie chart\n", " pctdistance=1.12, # the ratio between the center of each pie slice and the start of the text generated by autopct \n", " colors=colors_list, # add custom colors\n", " explode=explode_list # 'explode' lowest 3 continents\n", " )\n", "\n", "# scale the title up by 12% to match pctdistance\n", "plt.title('Immigration to Canada by Continent [1980 - 2013]', y=1.12) \n", "\n", "plt.axis('equal') \n", "\n", "# add legend\n", "plt.legend(labels=df_continents.index, loc='upper left') \n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now this looks pretty nice!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Thanks for reading :)\n", "Created by [Tarun Kamboj](https://www.linkedin.com/in/kambojtarun/)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }