{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# A bit more of Pandas \n",
    "### and some RegEx\n",
    "<br><br>\n",
    "<div style=\"background-color:rgba(128, 128, 0, 0.1); text-align:left; vertical-align: middle; padding:20px 0;\">\n",
    "<p style=\"font-size:134%;color:Deep Teal;\">SC 4125: Developing Data Products</p>\n",
    "    <p style=\"font-size:100%;color:Olive;\">Module-2: Basics of Pandas structures and manipulation</p><br>\n",
    "\n",
    "    \n",
    "<br> \n",
    "by <a href=\"https://personal.ntu.edu.sg/anwitaman/\" style=\"font-size:100%;color:Deep Teal;\">Anwitaman DATTA</a><br>\n",
    "School of Computer Science and Engineering, NTU Singapore.        \n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Teaching material\n",
    "- <a href=\"M2-PanEx.slides.html\">.html</a> deck of slides\n",
    "- <a href=\"M2-PanEx.ipynb\">.ipynb</a> Jupyter notebook"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Disclaimer/Caveat emptor\n",
    "\n",
    "- Non-systematic and non-exhaustive review\n",
    "- Example solutions are not necessarily the most efficient or elegant, let alone unique"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "### Positioning this module in the big picture\n",
    "\n",
    "<img src=\"pics/Module-2-In-BigPic.png\" alt=\"Big picture\" width=\"500\"/><br>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "<img src=\"pics/cute-panda.png\" alt=\"Basic Pandas Training\" width=\"300\"/><br>\n",
    "\n",
    "- Some basic tools for cleaning and manipulating your data: Apply, Lambdas, Merge, Groupby, Missing data, etc."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "# housekeeping - imports necessary libraries\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import re\n",
    "import datetime "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Load data and carry out some preliminary cleaning\n",
    "\n",
    "Let's load the same gapminder data we had worked with in the previous module, but from a .tsv file this time. Likewise, let's load the CountriesOfTheWorld data from a .csv file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>continent</th>\n",
       "      <th>year</th>\n",
       "      <th>lifeExp</th>\n",
       "      <th>pop</th>\n",
       "      <th>gdpPercap</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>761</th>\n",
       "      <td>Israel</td>\n",
       "      <td>Asia</td>\n",
       "      <td>1977</td>\n",
       "      <td>73.060</td>\n",
       "      <td>3495918</td>\n",
       "      <td>13306.619210</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1263</th>\n",
       "      <td>Reunion</td>\n",
       "      <td>Africa</td>\n",
       "      <td>1967</td>\n",
       "      <td>60.542</td>\n",
       "      <td>414024</td>\n",
       "      <td>4021.175739</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>787</th>\n",
       "      <td>Jamaica</td>\n",
       "      <td>Americas</td>\n",
       "      <td>1987</td>\n",
       "      <td>71.770</td>\n",
       "      <td>2326606</td>\n",
       "      <td>6351.237495</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>868</th>\n",
       "      <td>Lebanon</td>\n",
       "      <td>Asia</td>\n",
       "      <td>1972</td>\n",
       "      <td>65.421</td>\n",
       "      <td>2680018</td>\n",
       "      <td>7486.384341</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1473</th>\n",
       "      <td>Sweden</td>\n",
       "      <td>Europe</td>\n",
       "      <td>1997</td>\n",
       "      <td>79.390</td>\n",
       "      <td>8897619</td>\n",
       "      <td>25266.594990</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      country continent  year  lifeExp      pop     gdpPercap\n",
       "761    Israel      Asia  1977   73.060  3495918  13306.619210\n",
       "1263  Reunion    Africa  1967   60.542   414024   4021.175739\n",
       "787   Jamaica  Americas  1987   71.770  2326606   6351.237495\n",
       "868   Lebanon      Asia  1972   65.421  2680018   7486.384341\n",
       "1473   Sweden    Europe  1997   79.390  8897619  25266.594990"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# load data\n",
    "gapminderdatapath ='data/gapminder/' # change this to adjust relative path\n",
    "gap_df = pd.read_csv(gapminderdatapath+'gapminder.tsv', sep='\\t')\n",
    "gap_df.sample(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Country</th>\n",
       "      <th>Region</th>\n",
       "      <th>Population</th>\n",
       "      <th>Area (sq. mi.)</th>\n",
       "      <th>Pop. Density (per sq. mi.)</th>\n",
       "      <th>Coastline (coast/area ratio)</th>\n",
       "      <th>Net migration</th>\n",
       "      <th>Infant mortality (per 1000 births)</th>\n",
       "      <th>GDP ($ per capita)</th>\n",
       "      <th>Literacy (%)</th>\n",
       "      <th>Phones (per 1000)</th>\n",
       "      <th>Arable (%)</th>\n",
       "      <th>Crops (%)</th>\n",
       "      <th>Other (%)</th>\n",
       "      <th>Climate</th>\n",
       "      <th>Birthrate</th>\n",
       "      <th>Deathrate</th>\n",
       "      <th>Agriculture</th>\n",
       "      <th>Industry</th>\n",
       "      <th>Service</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>127</th>\n",
       "      <td>Maldives</td>\n",
       "      <td>ASIA (EX. NEAR EAST)</td>\n",
       "      <td>359008</td>\n",
       "      <td>300</td>\n",
       "      <td>1196,7</td>\n",
       "      <td>214,67</td>\n",
       "      <td>0</td>\n",
       "      <td>56,52</td>\n",
       "      <td>3900.0</td>\n",
       "      <td>97,2</td>\n",
       "      <td>90,0</td>\n",
       "      <td>13,33</td>\n",
       "      <td>16,67</td>\n",
       "      <td>70</td>\n",
       "      <td>2</td>\n",
       "      <td>34,81</td>\n",
       "      <td>7,06</td>\n",
       "      <td>0,2</td>\n",
       "      <td>0,18</td>\n",
       "      <td>0,62</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Aruba</td>\n",
       "      <td>LATIN AMER. &amp; CARIB</td>\n",
       "      <td>71891</td>\n",
       "      <td>193</td>\n",
       "      <td>372,5</td>\n",
       "      <td>35,49</td>\n",
       "      <td>0</td>\n",
       "      <td>5,89</td>\n",
       "      <td>28000.0</td>\n",
       "      <td>97,0</td>\n",
       "      <td>516,1</td>\n",
       "      <td>10,53</td>\n",
       "      <td>0</td>\n",
       "      <td>89,47</td>\n",
       "      <td>2</td>\n",
       "      <td>11,03</td>\n",
       "      <td>6,68</td>\n",
       "      <td>0,004</td>\n",
       "      <td>0,333</td>\n",
       "      <td>0,663</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>130</th>\n",
       "      <td>Marshall Islands</td>\n",
       "      <td>OCEANIA</td>\n",
       "      <td>60422</td>\n",
       "      <td>11854</td>\n",
       "      <td>5,1</td>\n",
       "      <td>3,12</td>\n",
       "      <td>-6,04</td>\n",
       "      <td>29,45</td>\n",
       "      <td>1600.0</td>\n",
       "      <td>93,7</td>\n",
       "      <td>91,2</td>\n",
       "      <td>16,67</td>\n",
       "      <td>38,89</td>\n",
       "      <td>44,44</td>\n",
       "      <td>2</td>\n",
       "      <td>33,05</td>\n",
       "      <td>4,78</td>\n",
       "      <td>0,317</td>\n",
       "      <td>0,149</td>\n",
       "      <td>0,534</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>163</th>\n",
       "      <td>Poland</td>\n",
       "      <td>EASTERN EUROPE</td>\n",
       "      <td>38536869</td>\n",
       "      <td>312685</td>\n",
       "      <td>123,3</td>\n",
       "      <td>0,16</td>\n",
       "      <td>-0,49</td>\n",
       "      <td>8,51</td>\n",
       "      <td>11100.0</td>\n",
       "      <td>99,8</td>\n",
       "      <td>306,3</td>\n",
       "      <td>45,91</td>\n",
       "      <td>1,12</td>\n",
       "      <td>52,97</td>\n",
       "      <td>3</td>\n",
       "      <td>9,85</td>\n",
       "      <td>9,89</td>\n",
       "      <td>0,05</td>\n",
       "      <td>0,311</td>\n",
       "      <td>0,64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>185</th>\n",
       "      <td>Slovakia</td>\n",
       "      <td>EASTERN EUROPE</td>\n",
       "      <td>5439448</td>\n",
       "      <td>48845</td>\n",
       "      <td>111,4</td>\n",
       "      <td>0,00</td>\n",
       "      <td>0,3</td>\n",
       "      <td>7,41</td>\n",
       "      <td>13300.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>220,1</td>\n",
       "      <td>30,16</td>\n",
       "      <td>2,62</td>\n",
       "      <td>67,22</td>\n",
       "      <td>3</td>\n",
       "      <td>10,65</td>\n",
       "      <td>9,45</td>\n",
       "      <td>0,035</td>\n",
       "      <td>0,294</td>\n",
       "      <td>0,672</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               Country                               Region  Population  \\\n",
       "127          Maldives         ASIA (EX. NEAR EAST)               359008   \n",
       "10              Aruba               LATIN AMER. & CARIB           71891   \n",
       "130  Marshall Islands   OCEANIA                                   60422   \n",
       "163            Poland   EASTERN EUROPE                         38536869   \n",
       "185          Slovakia   EASTERN EUROPE                          5439448   \n",
       "\n",
       "     Area (sq. mi.) Pop. Density (per sq. mi.) Coastline (coast/area ratio)  \\\n",
       "127             300                     1196,7                       214,67   \n",
       "10              193                      372,5                        35,49   \n",
       "130           11854                        5,1                         3,12   \n",
       "163          312685                      123,3                         0,16   \n",
       "185           48845                      111,4                         0,00   \n",
       "\n",
       "    Net migration Infant mortality (per 1000 births)  GDP ($ per capita)  \\\n",
       "127             0                              56,52              3900.0   \n",
       "10              0                               5,89             28000.0   \n",
       "130         -6,04                              29,45              1600.0   \n",
       "163         -0,49                               8,51             11100.0   \n",
       "185           0,3                               7,41             13300.0   \n",
       "\n",
       "    Literacy (%) Phones (per 1000) Arable (%) Crops (%) Other (%) Climate  \\\n",
       "127         97,2              90,0      13,33     16,67        70       2   \n",
       "10          97,0             516,1      10,53         0     89,47       2   \n",
       "130         93,7              91,2      16,67     38,89     44,44       2   \n",
       "163         99,8             306,3      45,91      1,12     52,97       3   \n",
       "185          NaN             220,1      30,16      2,62     67,22       3   \n",
       "\n",
       "    Birthrate Deathrate Agriculture Industry Service  \n",
       "127     34,81      7,06         0,2     0,18    0,62  \n",
       "10      11,03      6,68       0,004    0,333   0,663  \n",
       "130     33,05      4,78       0,317    0,149   0,534  \n",
       "163      9,85      9,89        0,05    0,311    0,64  \n",
       "185     10,65      9,45       0,035    0,294   0,672  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# load data\n",
    "countriesdatapath ='data/'\n",
    "CoTW_df = pd.read_csv(countriesdatapath+'CountriesOfTheWorld.csv')\n",
    "CoTW_df.sample(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "What is the list of countries in the gapminder and CoTW datasets?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina', 'Australia', 'Austria', 'Bahrain', 'Bangladesh', 'Belgium']\n"
     ]
    }
   ],
   "source": [
    "gap_country_names_series=gap_df['country'].drop_duplicates()\n",
    "print(gap_country_names_series.to_list()[:10]) #if you print the length, you will get a result of 142"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Afghanistan ' 'Albania ' 'Algeria ' 'American Samoa ' 'Andorra '\n",
      " 'Angola ' 'Anguilla ' 'Antigua & Barbuda ' 'Argentina ' 'Armenia ']\n"
     ]
    }
   ],
   "source": [
    "CoTW_country_names=CoTW_df['Country'].unique()\n",
    "print(CoTW_country_names[:10]) #if you print the length, you will get a result of 227"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Afghanistan' 'Albania' 'Algeria' 'American Samoa' 'Andorra' 'Angola'\n",
      " 'Anguilla' 'Antigua & Barbuda' 'Argentina' 'Armenia']\n"
     ]
    }
   ],
   "source": [
    "# Let's get rid of the trailing space in the records\n",
    "CoTW_df['Country']=CoTW_df.Country.apply(lambda x: x[:-1])\n",
    "# Alternative: CoTW_df['Country']=CoTW_df.Country.apply(lambda x: x.rstrip()) \n",
    "# Alternate way would be to `strip' the space (in this case, rstrip would also work)\n",
    "CoTW_country_names=CoTW_df['Country'].unique()\n",
    "print(CoTW_country_names[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Apply & Lambda\n",
    "- We just saw the use of two very interesting tools (which often go hand-in-hand when processing a data frame):\n",
    "    1. Apply: Apply a function along an axis of the data frame. It can also be used over a column. \n",
    "        * See details at the <a href=\"https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html\">doc page</a> \n",
    "    2. Lambda: A small anonymous function. You can of-course use a named function inside your lambda, if you need something more complicated.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Rank</th>\n",
       "      <th>NOC</th>\n",
       "      <th>Gold</th>\n",
       "      <th>Silver</th>\n",
       "      <th>Bronze</th>\n",
       "      <th>Total</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>84</th>\n",
       "      <td>78</td>\n",
       "      <td>Trinidad and Tobago</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>16</td>\n",
       "      <td>Jamaica</td>\n",
       "      <td>6</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35</th>\n",
       "      <td>35</td>\n",
       "      <td>Thailand</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>United States</td>\n",
       "      <td>46</td>\n",
       "      <td>37</td>\n",
       "      <td>38</td>\n",
       "      <td>121</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>11</td>\n",
       "      <td>Netherlands</td>\n",
       "      <td>8</td>\n",
       "      <td>7</td>\n",
       "      <td>4</td>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Rank                  NOC  Gold  Silver  Bronze  Total\n",
       "84   78  Trinidad and Tobago     0       0       1      1\n",
       "15   16              Jamaica     6       3       2     11\n",
       "35   35             Thailand     2       2       2      6\n",
       "0     1        United States    46      37      38    121\n",
       "10   11          Netherlands     8       7       4     19"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# load data\n",
    "olymp_df=pd.read_html(r'https://en.wikipedia.org/wiki/2016_Summer_Olympics_medal_table')\n",
    "olymp2016medals=olymp_df[2][:-1]\n",
    "olymp2016medals.sample(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "#### The following code was there because previously, each country name in the table was followed with a (ThreeLetterAcronym)\n",
    "olymp2016medals['NOC']=olymp2016medals.NOC.apply(lambda x: x.split(\"(\",1)[0][:-1])\n",
    "olymp2016medals"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Bahamas', 'Brazil*', 'Chinese Taipei', 'Great Britain', 'Independent Olympic Athletes', 'Ivory Coast', 'Kosovo', 'North Korea', 'South Korea', 'Trinidad and Tobago']\n",
      "['Afghanistan', 'Albania', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antigua & Barbuda', 'Aruba', 'Bahamas, The', 'Bangladesh', 'Barbados', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia', 'Bosnia & Herzegovina', 'Botswana', 'Brazil', 'British Virgin Is.', 'Brunei', 'Burkina Faso', 'Burma', 'Cambodia', 'Cameroon', 'Cape Verde', 'Cayman Islands', 'Central African Rep.', 'Chad', 'Chile', 'Comoros', 'Congo, Dem. Rep.', 'Congo, Repub. of the', 'Cook Islands', 'Costa Rica', \"Cote d'Ivoire\", 'Cyprus', 'Djibouti', 'Dominica', 'East Timor', 'Ecuador', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Faroe Islands', 'French Guiana', 'French Polynesia', 'Gabon', 'Gambia, The', 'Gaza Strip', 'Ghana', 'Gibraltar', 'Greenland', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti', 'Honduras', 'Hong Kong', 'Iceland', 'Iraq', 'Isle of Man', 'Jersey', 'Kiribati', 'Korea, North', 'Korea, South', 'Kuwait', 'Kyrgyzstan', 'Laos', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Luxembourg', 'Macau', 'Macedonia', 'Madagascar', 'Malawi', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Micronesia, Fed. St.', 'Moldova', 'Monaco', 'Montserrat', 'Mozambique', 'N. Mariana Islands', 'Namibia', 'Nauru', 'Nepal', 'Netherlands Antilles', 'New Caledonia', 'Nicaragua', 'Oman', 'Pakistan', 'Palau', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Reunion', 'Rwanda', 'Saint Helena', 'Saint Kitts & Nevis', 'Saint Lucia', 'Saint Vincent and the Grenadines', 'Samoa', 'San Marino', 'Sao Tome & Principe', 'Saudi Arabia', 'Senegal', 'Seychelles', 'Sierra Leone', 'Solomon Islands', 'Somalia', 'Sri Lanka', 'St Pierre & Miquelon', 'Sudan', 'Suriname', 'Swaziland', 'Syria', 'Taiwan', 'Tanzania', 'Togo', 'Tonga', 'Trinidad & Tobago', 'Turkmenistan', 'Turks & Caicos Is', 'Tuvalu', 'Uganda', 'United Kingdom', 'Uruguay', 'Vanuatu', 'Virgin Islands', 'Wallis and Futuna', 'West Bank', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe']\n"
     ]
    }
   ],
   "source": [
    "print(sorted(set(olymp2016medals['NOC'])-set(CoTW_country_names)) )\n",
    "print(sorted(set(CoTW_country_names)-set(olymp2016medals['NOC'])))\n",
    "# Note that several of the same countries are named or spelled differently in the two lists"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Independent Olympic Athletes', 'Kosovo']\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\datta\\anaconda3\\lib\\site-packages\\pandas\\core\\indexing.py:1765: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
      "  isetter(loc, value)\n"
     ]
    }
   ],
   "source": [
    "CoTW_df.loc[(CoTW_df.Country==\"Bahamas, The\"),'Country']='Bahamas'\n",
    "CoTW_df.loc[(CoTW_df.Country=='Taiwan'),'Country']='Chinese Taipei'\n",
    "CoTW_df.loc[(CoTW_df.Country=='United Kingdom'),'Country']='Great Britain'\n",
    "CoTW_df.loc[(CoTW_df.Country==\"Cote d'Ivoire\"),'Country']='Ivory Coast'\n",
    "CoTW_df.loc[(CoTW_df.Country=='Korea, North'),'Country']='North Korea'\n",
    "CoTW_df.loc[(CoTW_df.Country=='Korea, South'),'Country']='South Korea'\n",
    "CoTW_df.loc[(CoTW_df.Country==\"Trinidad & Tobago\"),'Country']='Trinidad and Tobago'\n",
    "olymp2016medals.loc[(olymp2016medals.NOC==\"Brazil*\"),'NOC']='Brazil'\n",
    "CoTW_country_names=CoTW_df['Country'].unique()\n",
    "print(sorted(set(olymp2016medals['NOC'])-set(CoTW_country_names)) )\n",
    "#print(sorted(set(CoTW_country_names)-set(olymp2016medals['NOC'])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Need to `fix' all the anomalous data, particularly if we want to correlate the information across the data frames. Above, I did this manually.\n",
    "\n",
    "Talking points: \n",
    "1. Challenges to do this at scale.\n",
    "    * Check for similarity of words: e.g., Korea, North vs North Korea; Trindad & Tobago vs Trinidad and Tobago\n",
    "    * What about Cote d'Ivoire vs Ivory Coast? Swaziland versus Switzerland? Niger vs Nigeria?\n",
    "    * What to do with United Kingdom vs Great Britain, Taiwan vs Chinese Taipei?\n",
    "2. Sometimes, you may be lucky and can leverage on Named entity recognition and matching tools.\n",
    "\n",
    "Caveat: Different domains may be more or less amenable to certain solutions/heuristics. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Detour\n",
    "\n",
    "For the particular problem at hand, namely, each country has been spelled variously, and we want to identify them to be equivalent, instead of writing our own tool, we may leverage on a nice tool someone already built!\n",
    "\n",
    "Namely, country_converter: https://github.com/konstantinstadler/country_converter\n",
    "\n",
    "There's also pycountry, which is useful in some related cases, but it maynot be suitable for our particular scenario (at least, I didn't figure out how to use it for our problem): https://pypi.org/project/pycountry/ "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "#!pip install country_converter --upgrade\n",
    "import country_converter as coco"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Alice's Wonderland not found in regex\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Niger', 'Nigeria', 'North Korea', 'North Korea', \"Cote d'Ivoire\", \"Cote d'Ivoire\", 'Brazil', 'Brazil', 'United Kingdom', 'United Kingdom', 'Taiwan', 'Taiwan', 'Trinidad and Tobago', 'Trinidad and Tobago', 'not found']\n"
     ]
    }
   ],
   "source": [
    "# sanity check\n",
    "some_names = ['Niger', 'Nigeria', 'Korea, North', \"North Korea\", \"Cote d'Ivoire\", \"Ivory Coast\", \"Brazil*\", \"Brazil\", \"Great Britain\", \"United Kingdom\", \"Taiwan\", \"Chinese taipei\", \"Trinidad & Tobago\", \"Trinidad and Tobago\", \"Alice's Wonderland\"]\n",
    "standard_names = coco.convert(names=some_names, to='name_short')\n",
    "print(standard_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Merge data frames\n",
    "- how: {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’\n",
    "    * <a href=\"https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html\">Pandas doc</a>\n",
    "    * A nice discussion on various options how merger is done can be at <a href=\"https://stackoverflow.com/questions/448023/what-is-the-difference-between-left-right-outer-and-inner-joins\">Stack Overflow</a>. [Below figure is from there.]\n",
    "    \n",
    "<img src=\"pics/joins-overflow.png\" alt=\"kinds of joins\" width=\"350\">    \n",
    "\n",
    "- Note that Pandas has also a Join method, which however in-fact join columns with other DataFrame either on index or on a key column. <a href=\"https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html\">A good resource to check</a>."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Rank</th>\n",
       "      <th>NOC</th>\n",
       "      <th>Gold</th>\n",
       "      <th>Silver</th>\n",
       "      <th>Bronze</th>\n",
       "      <th>Total</th>\n",
       "      <th>Country</th>\n",
       "      <th>Population</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>57</th>\n",
       "      <td>60</td>\n",
       "      <td>Malaysia</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>Malaysia</td>\n",
       "      <td>24385858</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>58</th>\n",
       "      <td>61</td>\n",
       "      <td>Mexico</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "      <td>Mexico</td>\n",
       "      <td>107449525</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>59</th>\n",
       "      <td>62</td>\n",
       "      <td>Venezuela</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>Venezuela</td>\n",
       "      <td>25730435</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>60</th>\n",
       "      <td>63</td>\n",
       "      <td>Algeria</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>Algeria</td>\n",
       "      <td>32930091</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>64</th>\n",
       "      <td>67</td>\n",
       "      <td>India</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>India</td>\n",
       "      <td>1095351995</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>69</th>\n",
       "      <td>69</td>\n",
       "      <td>Philippines</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>Philippines</td>\n",
       "      <td>89468677</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>72</th>\n",
       "      <td>75</td>\n",
       "      <td>Egypt</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>Egypt</td>\n",
       "      <td>78887007</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>79</th>\n",
       "      <td>78</td>\n",
       "      <td>Morocco</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Morocco</td>\n",
       "      <td>33241259</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>80</th>\n",
       "      <td>78</td>\n",
       "      <td>Nigeria</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Nigeria</td>\n",
       "      <td>131859731</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Rank          NOC  Gold  Silver  Bronze  Total      Country  Population\n",
       "57   60     Malaysia     0       4       1      5     Malaysia    24385858\n",
       "58   61       Mexico     0       3       2      5       Mexico   107449525\n",
       "59   62    Venezuela     0       2       1      3    Venezuela    25730435\n",
       "60   63      Algeria     0       2       0      2      Algeria    32930091\n",
       "64   67        India     0       1       1      2        India  1095351995\n",
       "69   69  Philippines     0       1       0      1  Philippines    89468677\n",
       "72   75        Egypt     0       0       3      3        Egypt    78887007\n",
       "79   78      Morocco     0       0       1      1      Morocco    33241259\n",
       "80   78      Nigeria     0       0       1      1      Nigeria   131859731"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Assuming that the data has been `cleaned' \n",
    "# Determine all the countries with a population of more than 20 million \n",
    "# as per the CountriesOfTheWorld table, that got no gold medals.  \n",
    "Merged_df=olymp2016medals.merge(CoTW_df[['Country','Population']],left_on='NOC', right_on='Country')\n",
    "Merged_df[(Merged_df['Population']>20000000) & (Merged_df['Gold']==0)]\n",
    "# Note: For the specific purpose, from CoTW_df, I do not need the other columns than Country and Population\n",
    "# I could have cleaned the data more, e.g. NOC and Country columns are redundant."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "<b>Ungraded task 2.1: Ratio of total medals to per capita GDP</b><br><br>\n",
    "\n",
    "Determine the ratio of the number of total medals to the per capita GDP for each country as per the CountriesOfTheWorld table and sort them in decreasing order. Compare the results with others.\n",
    "\n",
    "- Discuss: What subtleties are involved? \n",
    "    * How do you account for the countries which are not in the original Olympics table?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Groupby, Hierarchical index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "scrolled": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "year\n",
       "1952    28.801\n",
       "1957    30.332\n",
       "1962    31.997\n",
       "1967    34.020\n",
       "1972    35.400\n",
       "1977    31.220\n",
       "1982    38.445\n",
       "1987    39.906\n",
       "1992    23.599\n",
       "1997    36.087\n",
       "2002    39.193\n",
       "2007    39.613\n",
       "Name: lifeExp, dtype: float64"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# For each year for which data is available in the gap_df Dataframe\n",
    "# What was the minimum recorded life expectancy?\n",
    "gap_df.groupby([\"year\"])[\"lifeExp\"].min()                                                    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>year</th>\n",
       "      <th>country</th>\n",
       "      <th>lifeExp</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1952</td>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>28.801</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1957</td>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>30.332</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1962</td>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>31.997</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1967</td>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>34.020</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1348</th>\n",
       "      <td>1972</td>\n",
       "      <td>Sierra Leone</td>\n",
       "      <td>35.400</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>221</th>\n",
       "      <td>1977</td>\n",
       "      <td>Cambodia</td>\n",
       "      <td>31.220</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1350</th>\n",
       "      <td>1982</td>\n",
       "      <td>Sierra Leone</td>\n",
       "      <td>38.445</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43</th>\n",
       "      <td>1987</td>\n",
       "      <td>Angola</td>\n",
       "      <td>39.906</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1292</th>\n",
       "      <td>1992</td>\n",
       "      <td>Rwanda</td>\n",
       "      <td>23.599</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1293</th>\n",
       "      <td>1997</td>\n",
       "      <td>Rwanda</td>\n",
       "      <td>36.087</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1690</th>\n",
       "      <td>2002</td>\n",
       "      <td>Zambia</td>\n",
       "      <td>39.193</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1463</th>\n",
       "      <td>2007</td>\n",
       "      <td>Swaziland</td>\n",
       "      <td>39.613</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      year       country  lifeExp\n",
       "0     1952   Afghanistan   28.801\n",
       "1     1957   Afghanistan   30.332\n",
       "2     1962   Afghanistan   31.997\n",
       "3     1967   Afghanistan   34.020\n",
       "1348  1972  Sierra Leone   35.400\n",
       "221   1977      Cambodia   31.220\n",
       "1350  1982  Sierra Leone   38.445\n",
       "43    1987        Angola   39.906\n",
       "1292  1992        Rwanda   23.599\n",
       "1293  1997        Rwanda   36.087\n",
       "1690  2002        Zambia   39.193\n",
       "1463  2007     Swaziland   39.613"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# What if we also want to know, which country had said minimum life expectancy? \n",
    "gap_df.loc[gap_df.groupby([\"year\"])[\"lifeExp\"].idxmin()][['year','country','lifeExp']]                                                        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "scrolled": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>Population</th>\n",
       "      <th>Area (sq. mi.)</th>\n",
       "      <th>GDP ($ per capita)</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Region</th>\n",
       "      <th>Country</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">ASIA (EX. NEAR EAST)</th>\n",
       "      <th>Afghanistan</th>\n",
       "      <td>31056997</td>\n",
       "      <td>647500</td>\n",
       "      <td>700.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Bangladesh</th>\n",
       "      <td>147365352</td>\n",
       "      <td>144000</td>\n",
       "      <td>1900.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Bhutan</th>\n",
       "      <td>2279723</td>\n",
       "      <td>47000</td>\n",
       "      <td>1300.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Brunei</th>\n",
       "      <td>379444</td>\n",
       "      <td>5770</td>\n",
       "      <td>18600.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Burma</th>\n",
       "      <td>47382633</td>\n",
       "      <td>678500</td>\n",
       "      <td>1800.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">WESTERN EUROPE</th>\n",
       "      <th>Portugal</th>\n",
       "      <td>10605870</td>\n",
       "      <td>92391</td>\n",
       "      <td>18000.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>San Marino</th>\n",
       "      <td>29251</td>\n",
       "      <td>61</td>\n",
       "      <td>34600.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Spain</th>\n",
       "      <td>40397842</td>\n",
       "      <td>504782</td>\n",
       "      <td>22000.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Sweden</th>\n",
       "      <td>9016596</td>\n",
       "      <td>449964</td>\n",
       "      <td>26800.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Switzerland</th>\n",
       "      <td>7523934</td>\n",
       "      <td>41290</td>\n",
       "      <td>32700.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>227 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                 Population  Area (sq. mi.)  \\\n",
       "Region                              Country                                   \n",
       "ASIA (EX. NEAR EAST)                Afghanistan    31056997          647500   \n",
       "                                    Bangladesh    147365352          144000   \n",
       "                                    Bhutan          2279723           47000   \n",
       "                                    Brunei           379444            5770   \n",
       "                                    Burma          47382633          678500   \n",
       "...                                                     ...             ...   \n",
       "WESTERN EUROPE                      Portugal       10605870           92391   \n",
       "                                    San Marino        29251              61   \n",
       "                                    Spain          40397842          504782   \n",
       "                                    Sweden          9016596          449964   \n",
       "                                    Switzerland     7523934           41290   \n",
       "\n",
       "                                                 GDP ($ per capita)  \n",
       "Region                              Country                          \n",
       "ASIA (EX. NEAR EAST)                Afghanistan               700.0  \n",
       "                                    Bangladesh               1900.0  \n",
       "                                    Bhutan                   1300.0  \n",
       "                                    Brunei                  18600.0  \n",
       "                                    Burma                    1800.0  \n",
       "...                                                             ...  \n",
       "WESTERN EUROPE                      Portugal                18000.0  \n",
       "                                    San Marino              34600.0  \n",
       "                                    Spain                   22000.0  \n",
       "                                    Sweden                  26800.0  \n",
       "                                    Switzerland             32700.0  \n",
       "\n",
       "[227 rows x 3 columns]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Groupby can be useful in creating hierarchical indexing\n",
    "# e.g., When we want to check the data by regions.\n",
    "CoTW_df.groupby(['Region','Country']).mean()\n",
    "# Many of the data, though representing numeric values, are in-fact treated as string in the current Dataframe. \n",
    "# Refer to Ungraded Task 2.2 (below) for more on this. The nature of the data may limit what can be computed.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>Population</th>\n",
       "      <th>Area (sq. mi.)</th>\n",
       "      <th>Pop. Density (per sq. mi.)</th>\n",
       "      <th>Coastline (coast/area ratio)</th>\n",
       "      <th>Net migration</th>\n",
       "      <th>Infant mortality (per 1000 births)</th>\n",
       "      <th>GDP ($ per capita)</th>\n",
       "      <th>Literacy (%)</th>\n",
       "      <th>Phones (per 1000)</th>\n",
       "      <th>Arable (%)</th>\n",
       "      <th>Crops (%)</th>\n",
       "      <th>Other (%)</th>\n",
       "      <th>Climate</th>\n",
       "      <th>Birthrate</th>\n",
       "      <th>Deathrate</th>\n",
       "      <th>Agriculture</th>\n",
       "      <th>Industry</th>\n",
       "      <th>Service</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Region</th>\n",
       "      <th>Country</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">ASIA (EX. NEAR EAST)</th>\n",
       "      <th>Afghanistan</th>\n",
       "      <td>31056997</td>\n",
       "      <td>647500</td>\n",
       "      <td>48,0</td>\n",
       "      <td>0,00</td>\n",
       "      <td>23,06</td>\n",
       "      <td>163,07</td>\n",
       "      <td>700.0</td>\n",
       "      <td>36,0</td>\n",
       "      <td>3,2</td>\n",
       "      <td>12,13</td>\n",
       "      <td>0,22</td>\n",
       "      <td>87,65</td>\n",
       "      <td>1</td>\n",
       "      <td>46,6</td>\n",
       "      <td>20,34</td>\n",
       "      <td>0,38</td>\n",
       "      <td>0,24</td>\n",
       "      <td>0,38</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Bangladesh</th>\n",
       "      <td>147365352</td>\n",
       "      <td>144000</td>\n",
       "      <td>1023,4</td>\n",
       "      <td>0,40</td>\n",
       "      <td>-0,71</td>\n",
       "      <td>62,6</td>\n",
       "      <td>1900.0</td>\n",
       "      <td>43,1</td>\n",
       "      <td>7,3</td>\n",
       "      <td>62,11</td>\n",
       "      <td>3,07</td>\n",
       "      <td>34,82</td>\n",
       "      <td>2</td>\n",
       "      <td>29,8</td>\n",
       "      <td>8,27</td>\n",
       "      <td>0,199</td>\n",
       "      <td>0,198</td>\n",
       "      <td>0,603</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Bhutan</th>\n",
       "      <td>2279723</td>\n",
       "      <td>47000</td>\n",
       "      <td>48,5</td>\n",
       "      <td>0,00</td>\n",
       "      <td>0</td>\n",
       "      <td>100,44</td>\n",
       "      <td>1300.0</td>\n",
       "      <td>42,2</td>\n",
       "      <td>14,3</td>\n",
       "      <td>3,09</td>\n",
       "      <td>0,43</td>\n",
       "      <td>96,48</td>\n",
       "      <td>2</td>\n",
       "      <td>33,65</td>\n",
       "      <td>12,7</td>\n",
       "      <td>0,258</td>\n",
       "      <td>0,379</td>\n",
       "      <td>0,363</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Brunei</th>\n",
       "      <td>379444</td>\n",
       "      <td>5770</td>\n",
       "      <td>65,8</td>\n",
       "      <td>2,79</td>\n",
       "      <td>3,59</td>\n",
       "      <td>12,61</td>\n",
       "      <td>18600.0</td>\n",
       "      <td>93,9</td>\n",
       "      <td>237,2</td>\n",
       "      <td>0,57</td>\n",
       "      <td>0,76</td>\n",
       "      <td>98,67</td>\n",
       "      <td>2</td>\n",
       "      <td>18,79</td>\n",
       "      <td>3,45</td>\n",
       "      <td>0,036</td>\n",
       "      <td>0,561</td>\n",
       "      <td>0,403</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Burma</th>\n",
       "      <td>47382633</td>\n",
       "      <td>678500</td>\n",
       "      <td>69,8</td>\n",
       "      <td>0,28</td>\n",
       "      <td>-1,8</td>\n",
       "      <td>67,24</td>\n",
       "      <td>1800.0</td>\n",
       "      <td>85,3</td>\n",
       "      <td>10,1</td>\n",
       "      <td>15,19</td>\n",
       "      <td>0,97</td>\n",
       "      <td>83,84</td>\n",
       "      <td>2</td>\n",
       "      <td>17,91</td>\n",
       "      <td>9,83</td>\n",
       "      <td>0,564</td>\n",
       "      <td>0,082</td>\n",
       "      <td>0,353</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">WESTERN EUROPE</th>\n",
       "      <th>Portugal</th>\n",
       "      <td>10605870</td>\n",
       "      <td>92391</td>\n",
       "      <td>114,8</td>\n",
       "      <td>1,94</td>\n",
       "      <td>3,57</td>\n",
       "      <td>5,05</td>\n",
       "      <td>18000.0</td>\n",
       "      <td>93,3</td>\n",
       "      <td>399,2</td>\n",
       "      <td>21,75</td>\n",
       "      <td>7,81</td>\n",
       "      <td>70,44</td>\n",
       "      <td>3</td>\n",
       "      <td>10,72</td>\n",
       "      <td>10,5</td>\n",
       "      <td>0,053</td>\n",
       "      <td>0,274</td>\n",
       "      <td>0,673</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>San Marino</th>\n",
       "      <td>29251</td>\n",
       "      <td>61</td>\n",
       "      <td>479,5</td>\n",
       "      <td>0,00</td>\n",
       "      <td>10,98</td>\n",
       "      <td>5,73</td>\n",
       "      <td>34600.0</td>\n",
       "      <td>96,0</td>\n",
       "      <td>704,3</td>\n",
       "      <td>16,67</td>\n",
       "      <td>0</td>\n",
       "      <td>83,33</td>\n",
       "      <td>NaN</td>\n",
       "      <td>10,02</td>\n",
       "      <td>8,17</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Spain</th>\n",
       "      <td>40397842</td>\n",
       "      <td>504782</td>\n",
       "      <td>80,0</td>\n",
       "      <td>0,98</td>\n",
       "      <td>0,99</td>\n",
       "      <td>4,42</td>\n",
       "      <td>22000.0</td>\n",
       "      <td>97,9</td>\n",
       "      <td>453,5</td>\n",
       "      <td>26,07</td>\n",
       "      <td>9,87</td>\n",
       "      <td>64,06</td>\n",
       "      <td>3</td>\n",
       "      <td>10,06</td>\n",
       "      <td>9,72</td>\n",
       "      <td>0,04</td>\n",
       "      <td>0,295</td>\n",
       "      <td>0,665</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Sweden</th>\n",
       "      <td>9016596</td>\n",
       "      <td>449964</td>\n",
       "      <td>20,0</td>\n",
       "      <td>0,72</td>\n",
       "      <td>1,67</td>\n",
       "      <td>2,77</td>\n",
       "      <td>26800.0</td>\n",
       "      <td>99,0</td>\n",
       "      <td>715,0</td>\n",
       "      <td>6,54</td>\n",
       "      <td>0,01</td>\n",
       "      <td>93,45</td>\n",
       "      <td>3</td>\n",
       "      <td>10,27</td>\n",
       "      <td>10,31</td>\n",
       "      <td>0,011</td>\n",
       "      <td>0,282</td>\n",
       "      <td>0,707</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Switzerland</th>\n",
       "      <td>7523934</td>\n",
       "      <td>41290</td>\n",
       "      <td>182,2</td>\n",
       "      <td>0,00</td>\n",
       "      <td>4,05</td>\n",
       "      <td>4,39</td>\n",
       "      <td>32700.0</td>\n",
       "      <td>99,0</td>\n",
       "      <td>680,9</td>\n",
       "      <td>10,42</td>\n",
       "      <td>0,61</td>\n",
       "      <td>88,97</td>\n",
       "      <td>3</td>\n",
       "      <td>9,71</td>\n",
       "      <td>8,49</td>\n",
       "      <td>0,015</td>\n",
       "      <td>0,34</td>\n",
       "      <td>0,645</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>227 rows × 18 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                 Population  Area (sq. mi.)  \\\n",
       "Region                              Country                                   \n",
       "ASIA (EX. NEAR EAST)                Afghanistan    31056997          647500   \n",
       "                                    Bangladesh    147365352          144000   \n",
       "                                    Bhutan          2279723           47000   \n",
       "                                    Brunei           379444            5770   \n",
       "                                    Burma          47382633          678500   \n",
       "...                                                     ...             ...   \n",
       "WESTERN EUROPE                      Portugal       10605870           92391   \n",
       "                                    San Marino        29251              61   \n",
       "                                    Spain          40397842          504782   \n",
       "                                    Sweden          9016596          449964   \n",
       "                                    Switzerland     7523934           41290   \n",
       "\n",
       "                                                Pop. Density (per sq. mi.)  \\\n",
       "Region                              Country                                  \n",
       "ASIA (EX. NEAR EAST)                Afghanistan                       48,0   \n",
       "                                    Bangladesh                      1023,4   \n",
       "                                    Bhutan                            48,5   \n",
       "                                    Brunei                            65,8   \n",
       "                                    Burma                             69,8   \n",
       "...                                                                    ...   \n",
       "WESTERN EUROPE                      Portugal                         114,8   \n",
       "                                    San Marino                       479,5   \n",
       "                                    Spain                             80,0   \n",
       "                                    Sweden                            20,0   \n",
       "                                    Switzerland                      182,2   \n",
       "\n",
       "                                                Coastline (coast/area ratio)  \\\n",
       "Region                              Country                                    \n",
       "ASIA (EX. NEAR EAST)                Afghanistan                         0,00   \n",
       "                                    Bangladesh                          0,40   \n",
       "                                    Bhutan                              0,00   \n",
       "                                    Brunei                              2,79   \n",
       "                                    Burma                               0,28   \n",
       "...                                                                      ...   \n",
       "WESTERN EUROPE                      Portugal                            1,94   \n",
       "                                    San Marino                          0,00   \n",
       "                                    Spain                               0,98   \n",
       "                                    Sweden                              0,72   \n",
       "                                    Switzerland                         0,00   \n",
       "\n",
       "                                                Net migration  \\\n",
       "Region                              Country                     \n",
       "ASIA (EX. NEAR EAST)                Afghanistan         23,06   \n",
       "                                    Bangladesh          -0,71   \n",
       "                                    Bhutan                  0   \n",
       "                                    Brunei               3,59   \n",
       "                                    Burma                -1,8   \n",
       "...                                                       ...   \n",
       "WESTERN EUROPE                      Portugal             3,57   \n",
       "                                    San Marino          10,98   \n",
       "                                    Spain                0,99   \n",
       "                                    Sweden               1,67   \n",
       "                                    Switzerland          4,05   \n",
       "\n",
       "                                                Infant mortality (per 1000 births)  \\\n",
       "Region                              Country                                          \n",
       "ASIA (EX. NEAR EAST)                Afghanistan                             163,07   \n",
       "                                    Bangladesh                                62,6   \n",
       "                                    Bhutan                                  100,44   \n",
       "                                    Brunei                                   12,61   \n",
       "                                    Burma                                    67,24   \n",
       "...                                                                            ...   \n",
       "WESTERN EUROPE                      Portugal                                  5,05   \n",
       "                                    San Marino                                5,73   \n",
       "                                    Spain                                     4,42   \n",
       "                                    Sweden                                    2,77   \n",
       "                                    Switzerland                               4,39   \n",
       "\n",
       "                                                 GDP ($ per capita)  \\\n",
       "Region                              Country                           \n",
       "ASIA (EX. NEAR EAST)                Afghanistan               700.0   \n",
       "                                    Bangladesh               1900.0   \n",
       "                                    Bhutan                   1300.0   \n",
       "                                    Brunei                  18600.0   \n",
       "                                    Burma                    1800.0   \n",
       "...                                                             ...   \n",
       "WESTERN EUROPE                      Portugal                18000.0   \n",
       "                                    San Marino              34600.0   \n",
       "                                    Spain                   22000.0   \n",
       "                                    Sweden                  26800.0   \n",
       "                                    Switzerland             32700.0   \n",
       "\n",
       "                                                Literacy (%)  \\\n",
       "Region                              Country                    \n",
       "ASIA (EX. NEAR EAST)                Afghanistan         36,0   \n",
       "                                    Bangladesh          43,1   \n",
       "                                    Bhutan              42,2   \n",
       "                                    Brunei              93,9   \n",
       "                                    Burma               85,3   \n",
       "...                                                      ...   \n",
       "WESTERN EUROPE                      Portugal            93,3   \n",
       "                                    San Marino          96,0   \n",
       "                                    Spain               97,9   \n",
       "                                    Sweden              99,0   \n",
       "                                    Switzerland         99,0   \n",
       "\n",
       "                                                Phones (per 1000) Arable (%)  \\\n",
       "Region                              Country                                    \n",
       "ASIA (EX. NEAR EAST)                Afghanistan               3,2      12,13   \n",
       "                                    Bangladesh                7,3      62,11   \n",
       "                                    Bhutan                   14,3       3,09   \n",
       "                                    Brunei                  237,2       0,57   \n",
       "                                    Burma                    10,1      15,19   \n",
       "...                                                           ...        ...   \n",
       "WESTERN EUROPE                      Portugal                399,2      21,75   \n",
       "                                    San Marino              704,3      16,67   \n",
       "                                    Spain                   453,5      26,07   \n",
       "                                    Sweden                  715,0       6,54   \n",
       "                                    Switzerland             680,9      10,42   \n",
       "\n",
       "                                                Crops (%) Other (%) Climate  \\\n",
       "Region                              Country                                   \n",
       "ASIA (EX. NEAR EAST)                Afghanistan      0,22     87,65       1   \n",
       "                                    Bangladesh       3,07     34,82       2   \n",
       "                                    Bhutan           0,43     96,48       2   \n",
       "                                    Brunei           0,76     98,67       2   \n",
       "                                    Burma            0,97     83,84       2   \n",
       "...                                                   ...       ...     ...   \n",
       "WESTERN EUROPE                      Portugal         7,81     70,44       3   \n",
       "                                    San Marino          0     83,33     NaN   \n",
       "                                    Spain            9,87     64,06       3   \n",
       "                                    Sweden           0,01     93,45       3   \n",
       "                                    Switzerland      0,61     88,97       3   \n",
       "\n",
       "                                                Birthrate Deathrate  \\\n",
       "Region                              Country                           \n",
       "ASIA (EX. NEAR EAST)                Afghanistan      46,6     20,34   \n",
       "                                    Bangladesh       29,8      8,27   \n",
       "                                    Bhutan          33,65      12,7   \n",
       "                                    Brunei          18,79      3,45   \n",
       "                                    Burma           17,91      9,83   \n",
       "...                                                   ...       ...   \n",
       "WESTERN EUROPE                      Portugal        10,72      10,5   \n",
       "                                    San Marino      10,02      8,17   \n",
       "                                    Spain           10,06      9,72   \n",
       "                                    Sweden          10,27     10,31   \n",
       "                                    Switzerland      9,71      8,49   \n",
       "\n",
       "                                                Agriculture Industry Service  \n",
       "Region                              Country                                   \n",
       "ASIA (EX. NEAR EAST)                Afghanistan        0,38     0,24    0,38  \n",
       "                                    Bangladesh        0,199    0,198   0,603  \n",
       "                                    Bhutan            0,258    0,379   0,363  \n",
       "                                    Brunei            0,036    0,561   0,403  \n",
       "                                    Burma             0,564    0,082   0,353  \n",
       "...                                                     ...      ...     ...  \n",
       "WESTERN EUROPE                      Portugal          0,053    0,274   0,673  \n",
       "                                    San Marino          NaN      NaN     NaN  \n",
       "                                    Spain              0,04    0,295   0,665  \n",
       "                                    Sweden            0,011    0,282   0,707  \n",
       "                                    Switzerland       0,015     0,34   0,645  \n",
       "\n",
       "[227 rows x 18 columns]"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "CoTW_df.groupby(['Region','Country']).max()\n",
    "# With something like max, Pandas seems more generous in ignoring data type ;)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>Population</th>\n",
       "      <th>Area (sq. mi.)</th>\n",
       "      <th>Pop. Density (per sq. mi.)</th>\n",
       "      <th>Coastline (coast/area ratio)</th>\n",
       "      <th>Net migration</th>\n",
       "      <th>Infant mortality (per 1000 births)</th>\n",
       "      <th>GDP ($ per capita)</th>\n",
       "      <th>Literacy (%)</th>\n",
       "      <th>Phones (per 1000)</th>\n",
       "      <th>Arable (%)</th>\n",
       "      <th>Crops (%)</th>\n",
       "      <th>Other (%)</th>\n",
       "      <th>Climate</th>\n",
       "      <th>Birthrate</th>\n",
       "      <th>Deathrate</th>\n",
       "      <th>Agriculture</th>\n",
       "      <th>Industry</th>\n",
       "      <th>Service</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Region</th>\n",
       "      <th>Country</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">ASIA (EX. NEAR EAST)</th>\n",
       "      <th>Afghanistan</th>\n",
       "      <td>31056997</td>\n",
       "      <td>647500</td>\n",
       "      <td>48,0</td>\n",
       "      <td>0,00</td>\n",
       "      <td>23,06</td>\n",
       "      <td>163,07</td>\n",
       "      <td>700.0</td>\n",
       "      <td>36,0</td>\n",
       "      <td>3,2</td>\n",
       "      <td>12,13</td>\n",
       "      <td>0,22</td>\n",
       "      <td>87,65</td>\n",
       "      <td>1</td>\n",
       "      <td>46,6</td>\n",
       "      <td>20,34</td>\n",
       "      <td>0,38</td>\n",
       "      <td>0,24</td>\n",
       "      <td>0,38</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Bangladesh</th>\n",
       "      <td>147365352</td>\n",
       "      <td>144000</td>\n",
       "      <td>1023,4</td>\n",
       "      <td>0,40</td>\n",
       "      <td>-0,71</td>\n",
       "      <td>62,6</td>\n",
       "      <td>1900.0</td>\n",
       "      <td>43,1</td>\n",
       "      <td>7,3</td>\n",
       "      <td>62,11</td>\n",
       "      <td>3,07</td>\n",
       "      <td>34,82</td>\n",
       "      <td>2</td>\n",
       "      <td>29,8</td>\n",
       "      <td>8,27</td>\n",
       "      <td>0,199</td>\n",
       "      <td>0,198</td>\n",
       "      <td>0,603</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Bhutan</th>\n",
       "      <td>2279723</td>\n",
       "      <td>47000</td>\n",
       "      <td>48,5</td>\n",
       "      <td>0,00</td>\n",
       "      <td>0</td>\n",
       "      <td>100,44</td>\n",
       "      <td>1300.0</td>\n",
       "      <td>42,2</td>\n",
       "      <td>14,3</td>\n",
       "      <td>3,09</td>\n",
       "      <td>0,43</td>\n",
       "      <td>96,48</td>\n",
       "      <td>2</td>\n",
       "      <td>33,65</td>\n",
       "      <td>12,7</td>\n",
       "      <td>0,258</td>\n",
       "      <td>0,379</td>\n",
       "      <td>0,363</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Brunei</th>\n",
       "      <td>379444</td>\n",
       "      <td>5770</td>\n",
       "      <td>65,8</td>\n",
       "      <td>2,79</td>\n",
       "      <td>3,59</td>\n",
       "      <td>12,61</td>\n",
       "      <td>18600.0</td>\n",
       "      <td>93,9</td>\n",
       "      <td>237,2</td>\n",
       "      <td>0,57</td>\n",
       "      <td>0,76</td>\n",
       "      <td>98,67</td>\n",
       "      <td>2</td>\n",
       "      <td>18,79</td>\n",
       "      <td>3,45</td>\n",
       "      <td>0,036</td>\n",
       "      <td>0,561</td>\n",
       "      <td>0,403</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Burma</th>\n",
       "      <td>47382633</td>\n",
       "      <td>678500</td>\n",
       "      <td>69,8</td>\n",
       "      <td>0,28</td>\n",
       "      <td>-1,8</td>\n",
       "      <td>67,24</td>\n",
       "      <td>1800.0</td>\n",
       "      <td>85,3</td>\n",
       "      <td>10,1</td>\n",
       "      <td>15,19</td>\n",
       "      <td>0,97</td>\n",
       "      <td>83,84</td>\n",
       "      <td>2</td>\n",
       "      <td>17,91</td>\n",
       "      <td>9,83</td>\n",
       "      <td>0,564</td>\n",
       "      <td>0,082</td>\n",
       "      <td>0,353</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">WESTERN EUROPE</th>\n",
       "      <th>Portugal</th>\n",
       "      <td>10605870</td>\n",
       "      <td>92391</td>\n",
       "      <td>114,8</td>\n",
       "      <td>1,94</td>\n",
       "      <td>3,57</td>\n",
       "      <td>5,05</td>\n",
       "      <td>18000.0</td>\n",
       "      <td>93,3</td>\n",
       "      <td>399,2</td>\n",
       "      <td>21,75</td>\n",
       "      <td>7,81</td>\n",
       "      <td>70,44</td>\n",
       "      <td>3</td>\n",
       "      <td>10,72</td>\n",
       "      <td>10,5</td>\n",
       "      <td>0,053</td>\n",
       "      <td>0,274</td>\n",
       "      <td>0,673</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>San Marino</th>\n",
       "      <td>29251</td>\n",
       "      <td>61</td>\n",
       "      <td>479,5</td>\n",
       "      <td>0,00</td>\n",
       "      <td>10,98</td>\n",
       "      <td>5,73</td>\n",
       "      <td>34600.0</td>\n",
       "      <td>96,0</td>\n",
       "      <td>704,3</td>\n",
       "      <td>16,67</td>\n",
       "      <td>0</td>\n",
       "      <td>83,33</td>\n",
       "      <td>NaN</td>\n",
       "      <td>10,02</td>\n",
       "      <td>8,17</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Spain</th>\n",
       "      <td>40397842</td>\n",
       "      <td>504782</td>\n",
       "      <td>80,0</td>\n",
       "      <td>0,98</td>\n",
       "      <td>0,99</td>\n",
       "      <td>4,42</td>\n",
       "      <td>22000.0</td>\n",
       "      <td>97,9</td>\n",
       "      <td>453,5</td>\n",
       "      <td>26,07</td>\n",
       "      <td>9,87</td>\n",
       "      <td>64,06</td>\n",
       "      <td>3</td>\n",
       "      <td>10,06</td>\n",
       "      <td>9,72</td>\n",
       "      <td>0,04</td>\n",
       "      <td>0,295</td>\n",
       "      <td>0,665</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Sweden</th>\n",
       "      <td>9016596</td>\n",
       "      <td>449964</td>\n",
       "      <td>20,0</td>\n",
       "      <td>0,72</td>\n",
       "      <td>1,67</td>\n",
       "      <td>2,77</td>\n",
       "      <td>26800.0</td>\n",
       "      <td>99,0</td>\n",
       "      <td>715,0</td>\n",
       "      <td>6,54</td>\n",
       "      <td>0,01</td>\n",
       "      <td>93,45</td>\n",
       "      <td>3</td>\n",
       "      <td>10,27</td>\n",
       "      <td>10,31</td>\n",
       "      <td>0,011</td>\n",
       "      <td>0,282</td>\n",
       "      <td>0,707</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Switzerland</th>\n",
       "      <td>7523934</td>\n",
       "      <td>41290</td>\n",
       "      <td>182,2</td>\n",
       "      <td>0,00</td>\n",
       "      <td>4,05</td>\n",
       "      <td>4,39</td>\n",
       "      <td>32700.0</td>\n",
       "      <td>99,0</td>\n",
       "      <td>680,9</td>\n",
       "      <td>10,42</td>\n",
       "      <td>0,61</td>\n",
       "      <td>88,97</td>\n",
       "      <td>3</td>\n",
       "      <td>9,71</td>\n",
       "      <td>8,49</td>\n",
       "      <td>0,015</td>\n",
       "      <td>0,34</td>\n",
       "      <td>0,645</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>227 rows × 18 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                 Population  Area (sq. mi.)  \\\n",
       "Region                              Country                                   \n",
       "ASIA (EX. NEAR EAST)                Afghanistan    31056997          647500   \n",
       "                                    Bangladesh    147365352          144000   \n",
       "                                    Bhutan          2279723           47000   \n",
       "                                    Brunei           379444            5770   \n",
       "                                    Burma          47382633          678500   \n",
       "...                                                     ...             ...   \n",
       "WESTERN EUROPE                      Portugal       10605870           92391   \n",
       "                                    San Marino        29251              61   \n",
       "                                    Spain          40397842          504782   \n",
       "                                    Sweden          9016596          449964   \n",
       "                                    Switzerland     7523934           41290   \n",
       "\n",
       "                                                Pop. Density (per sq. mi.)  \\\n",
       "Region                              Country                                  \n",
       "ASIA (EX. NEAR EAST)                Afghanistan                       48,0   \n",
       "                                    Bangladesh                      1023,4   \n",
       "                                    Bhutan                            48,5   \n",
       "                                    Brunei                            65,8   \n",
       "                                    Burma                             69,8   \n",
       "...                                                                    ...   \n",
       "WESTERN EUROPE                      Portugal                         114,8   \n",
       "                                    San Marino                       479,5   \n",
       "                                    Spain                             80,0   \n",
       "                                    Sweden                            20,0   \n",
       "                                    Switzerland                      182,2   \n",
       "\n",
       "                                                Coastline (coast/area ratio)  \\\n",
       "Region                              Country                                    \n",
       "ASIA (EX. NEAR EAST)                Afghanistan                         0,00   \n",
       "                                    Bangladesh                          0,40   \n",
       "                                    Bhutan                              0,00   \n",
       "                                    Brunei                              2,79   \n",
       "                                    Burma                               0,28   \n",
       "...                                                                      ...   \n",
       "WESTERN EUROPE                      Portugal                            1,94   \n",
       "                                    San Marino                          0,00   \n",
       "                                    Spain                               0,98   \n",
       "                                    Sweden                              0,72   \n",
       "                                    Switzerland                         0,00   \n",
       "\n",
       "                                                Net migration  \\\n",
       "Region                              Country                     \n",
       "ASIA (EX. NEAR EAST)                Afghanistan         23,06   \n",
       "                                    Bangladesh          -0,71   \n",
       "                                    Bhutan                  0   \n",
       "                                    Brunei               3,59   \n",
       "                                    Burma                -1,8   \n",
       "...                                                       ...   \n",
       "WESTERN EUROPE                      Portugal             3,57   \n",
       "                                    San Marino          10,98   \n",
       "                                    Spain                0,99   \n",
       "                                    Sweden               1,67   \n",
       "                                    Switzerland          4,05   \n",
       "\n",
       "                                                Infant mortality (per 1000 births)  \\\n",
       "Region                              Country                                          \n",
       "ASIA (EX. NEAR EAST)                Afghanistan                             163,07   \n",
       "                                    Bangladesh                                62,6   \n",
       "                                    Bhutan                                  100,44   \n",
       "                                    Brunei                                   12,61   \n",
       "                                    Burma                                    67,24   \n",
       "...                                                                            ...   \n",
       "WESTERN EUROPE                      Portugal                                  5,05   \n",
       "                                    San Marino                                5,73   \n",
       "                                    Spain                                     4,42   \n",
       "                                    Sweden                                    2,77   \n",
       "                                    Switzerland                               4,39   \n",
       "\n",
       "                                                 GDP ($ per capita)  \\\n",
       "Region                              Country                           \n",
       "ASIA (EX. NEAR EAST)                Afghanistan               700.0   \n",
       "                                    Bangladesh               1900.0   \n",
       "                                    Bhutan                   1300.0   \n",
       "                                    Brunei                  18600.0   \n",
       "                                    Burma                    1800.0   \n",
       "...                                                             ...   \n",
       "WESTERN EUROPE                      Portugal                18000.0   \n",
       "                                    San Marino              34600.0   \n",
       "                                    Spain                   22000.0   \n",
       "                                    Sweden                  26800.0   \n",
       "                                    Switzerland             32700.0   \n",
       "\n",
       "                                                Literacy (%)  \\\n",
       "Region                              Country                    \n",
       "ASIA (EX. NEAR EAST)                Afghanistan         36,0   \n",
       "                                    Bangladesh          43,1   \n",
       "                                    Bhutan              42,2   \n",
       "                                    Brunei              93,9   \n",
       "                                    Burma               85,3   \n",
       "...                                                      ...   \n",
       "WESTERN EUROPE                      Portugal            93,3   \n",
       "                                    San Marino          96,0   \n",
       "                                    Spain               97,9   \n",
       "                                    Sweden              99,0   \n",
       "                                    Switzerland         99,0   \n",
       "\n",
       "                                                Phones (per 1000) Arable (%)  \\\n",
       "Region                              Country                                    \n",
       "ASIA (EX. NEAR EAST)                Afghanistan               3,2      12,13   \n",
       "                                    Bangladesh                7,3      62,11   \n",
       "                                    Bhutan                   14,3       3,09   \n",
       "                                    Brunei                  237,2       0,57   \n",
       "                                    Burma                    10,1      15,19   \n",
       "...                                                           ...        ...   \n",
       "WESTERN EUROPE                      Portugal                399,2      21,75   \n",
       "                                    San Marino              704,3      16,67   \n",
       "                                    Spain                   453,5      26,07   \n",
       "                                    Sweden                  715,0       6,54   \n",
       "                                    Switzerland             680,9      10,42   \n",
       "\n",
       "                                                Crops (%) Other (%) Climate  \\\n",
       "Region                              Country                                   \n",
       "ASIA (EX. NEAR EAST)                Afghanistan      0,22     87,65       1   \n",
       "                                    Bangladesh       3,07     34,82       2   \n",
       "                                    Bhutan           0,43     96,48       2   \n",
       "                                    Brunei           0,76     98,67       2   \n",
       "                                    Burma            0,97     83,84       2   \n",
       "...                                                   ...       ...     ...   \n",
       "WESTERN EUROPE                      Portugal         7,81     70,44       3   \n",
       "                                    San Marino          0     83,33     NaN   \n",
       "                                    Spain            9,87     64,06       3   \n",
       "                                    Sweden           0,01     93,45       3   \n",
       "                                    Switzerland      0,61     88,97       3   \n",
       "\n",
       "                                                Birthrate Deathrate  \\\n",
       "Region                              Country                           \n",
       "ASIA (EX. NEAR EAST)                Afghanistan      46,6     20,34   \n",
       "                                    Bangladesh       29,8      8,27   \n",
       "                                    Bhutan          33,65      12,7   \n",
       "                                    Brunei          18,79      3,45   \n",
       "                                    Burma           17,91      9,83   \n",
       "...                                                   ...       ...   \n",
       "WESTERN EUROPE                      Portugal        10,72      10,5   \n",
       "                                    San Marino      10,02      8,17   \n",
       "                                    Spain           10,06      9,72   \n",
       "                                    Sweden          10,27     10,31   \n",
       "                                    Switzerland      9,71      8,49   \n",
       "\n",
       "                                                Agriculture Industry Service  \n",
       "Region                              Country                                   \n",
       "ASIA (EX. NEAR EAST)                Afghanistan        0,38     0,24    0,38  \n",
       "                                    Bangladesh        0,199    0,198   0,603  \n",
       "                                    Bhutan            0,258    0,379   0,363  \n",
       "                                    Brunei            0,036    0,561   0,403  \n",
       "                                    Burma             0,564    0,082   0,353  \n",
       "...                                                     ...      ...     ...  \n",
       "WESTERN EUROPE                      Portugal          0,053    0,274   0,673  \n",
       "                                    San Marino          NaN      NaN     NaN  \n",
       "                                    Spain              0,04    0,295   0,665  \n",
       "                                    Sweden            0,011    0,282   0,707  \n",
       "                                    Switzerland       0,015     0,34   0,645  \n",
       "\n",
       "[227 rows x 18 columns]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# For what we want, an alternate, and simpler (and more robust, since we don't need to worry about data types) \n",
    "# option is to just use create hierarchical index using set_index, and then sort_index.\n",
    "CoTW_df.set_index(['Region','Country']).sort_index()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Missing data \n",
    "\n",
    "There are many reasons that data may be missing. e.g.:\n",
    "- It may be inherently missing: not collected, not entered, etc.\n",
    "- It may be implicitly missing.\n",
    "- Missing entries may get created during data processing. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Date</th>\n",
       "      <th>Day</th>\n",
       "      <th>Cases_Guinea</th>\n",
       "      <th>Cases_Liberia</th>\n",
       "      <th>Cases_SierraLeone</th>\n",
       "      <th>Cases_Nigeria</th>\n",
       "      <th>Cases_Senegal</th>\n",
       "      <th>Cases_UnitedStates</th>\n",
       "      <th>Cases_Spain</th>\n",
       "      <th>Cases_Mali</th>\n",
       "      <th>Deaths_Guinea</th>\n",
       "      <th>Deaths_Liberia</th>\n",
       "      <th>Deaths_SierraLeone</th>\n",
       "      <th>Deaths_Nigeria</th>\n",
       "      <th>Deaths_Senegal</th>\n",
       "      <th>Deaths_UnitedStates</th>\n",
       "      <th>Deaths_Spain</th>\n",
       "      <th>Deaths_Mali</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>108</th>\n",
       "      <td>4/14/2014</td>\n",
       "      <td>23</td>\n",
       "      <td>168.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>108.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>88</th>\n",
       "      <td>6/3/2014</td>\n",
       "      <td>73</td>\n",
       "      <td>344.0</td>\n",
       "      <td>13.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>215.0</td>\n",
       "      <td>12.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100</th>\n",
       "      <td>4/24/2014</td>\n",
       "      <td>33</td>\n",
       "      <td>NaN</td>\n",
       "      <td>35.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35</th>\n",
       "      <td>10/19/2014</td>\n",
       "      <td>211</td>\n",
       "      <td>1540.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3706.0</td>\n",
       "      <td>20.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>904.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1259.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>11/11/2014</td>\n",
       "      <td>234</td>\n",
       "      <td>1919.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5586.0</td>\n",
       "      <td>20.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>1166.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1187.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>96</th>\n",
       "      <td>5/5/2014</td>\n",
       "      <td>44</td>\n",
       "      <td>235.0</td>\n",
       "      <td>13.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>157.0</td>\n",
       "      <td>11.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone  \\\n",
       "108   4/14/2014   23         168.0            NaN                NaN   \n",
       "88     6/3/2014   73         344.0           13.0                NaN   \n",
       "100   4/24/2014   33           NaN           35.0                0.0   \n",
       "35   10/19/2014  211        1540.0            NaN             3706.0   \n",
       "22   11/11/2014  234        1919.0            NaN             5586.0   \n",
       "96     5/5/2014   44         235.0           13.0                0.0   \n",
       "\n",
       "     Cases_Nigeria  Cases_Senegal  Cases_UnitedStates  Cases_Spain  \\\n",
       "108            NaN            NaN                 NaN          NaN   \n",
       "88             NaN            NaN                 NaN          NaN   \n",
       "100            NaN            NaN                 NaN          NaN   \n",
       "35            20.0            1.0                 3.0          1.0   \n",
       "22            20.0            1.0                 4.0          1.0   \n",
       "96             NaN            NaN                 NaN          NaN   \n",
       "\n",
       "     Cases_Mali  Deaths_Guinea  Deaths_Liberia  Deaths_SierraLeone  \\\n",
       "108         NaN          108.0             NaN                 NaN   \n",
       "88          NaN          215.0            12.0                 6.0   \n",
       "100         NaN            NaN             NaN                 0.0   \n",
       "35          NaN          904.0             NaN              1259.0   \n",
       "22          4.0         1166.0             NaN              1187.0   \n",
       "96          NaN          157.0            11.0                 0.0   \n",
       "\n",
       "     Deaths_Nigeria  Deaths_Senegal  Deaths_UnitedStates  Deaths_Spain  \\\n",
       "108             NaN             NaN                  NaN           NaN   \n",
       "88              NaN             NaN                  NaN           NaN   \n",
       "100             NaN             NaN                  NaN           NaN   \n",
       "35              8.0             0.0                  1.0           0.0   \n",
       "22              8.0             0.0                  1.0           0.0   \n",
       "96              NaN             NaN                  NaN           NaN   \n",
       "\n",
       "     Deaths_Mali  \n",
       "108          NaN  \n",
       "88           NaN  \n",
       "100          NaN  \n",
       "35           NaN  \n",
       "22           3.0  \n",
       "96           NaN  "
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# load data (source: https://github.com/chendaniely/pandas_for_everyone) \n",
    "pfe_path ='data/pandas_for_everyone_data/' # change this to adjust relative path\n",
    "#visited_df = pd.read_csv(pfe_path+'survey_visited.csv')\n",
    "#survey_df = pd.read_csv(pfe_path+'survey_survey.csv')\n",
    "ebola_df= pd.read_csv(pfe_path+'country_timeseries.csv')\n",
    "ebola_df.sample(6)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Date</th>\n",
       "      <th>Day</th>\n",
       "      <th>Cases_Guinea</th>\n",
       "      <th>Cases_Liberia</th>\n",
       "      <th>Cases_SierraLeone</th>\n",
       "      <th>Cases_Nigeria</th>\n",
       "      <th>Cases_Senegal</th>\n",
       "      <th>Cases_UnitedStates</th>\n",
       "      <th>Cases_Spain</th>\n",
       "      <th>Cases_Mali</th>\n",
       "      <th>Deaths_Guinea</th>\n",
       "      <th>Deaths_Liberia</th>\n",
       "      <th>Deaths_SierraLeone</th>\n",
       "      <th>Deaths_Nigeria</th>\n",
       "      <th>Deaths_Senegal</th>\n",
       "      <th>Deaths_UnitedStates</th>\n",
       "      <th>Deaths_Spain</th>\n",
       "      <th>Deaths_Mali</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1/5/2015</td>\n",
       "      <td>289</td>\n",
       "      <td>2776.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>10030.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1786.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2977.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1/4/2015</td>\n",
       "      <td>288</td>\n",
       "      <td>2775.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>9780.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1781.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2943.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1/3/2015</td>\n",
       "      <td>287</td>\n",
       "      <td>2769.0</td>\n",
       "      <td>8166.0</td>\n",
       "      <td>9722.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1767.0</td>\n",
       "      <td>3496.0</td>\n",
       "      <td>2915.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1/2/2015</td>\n",
       "      <td>286</td>\n",
       "      <td>0.0</td>\n",
       "      <td>8157.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3496.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>12/31/2014</td>\n",
       "      <td>284</td>\n",
       "      <td>2730.0</td>\n",
       "      <td>8115.0</td>\n",
       "      <td>9633.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1739.0</td>\n",
       "      <td>3471.0</td>\n",
       "      <td>2827.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>117</th>\n",
       "      <td>3/27/2014</td>\n",
       "      <td>5</td>\n",
       "      <td>103.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>118</th>\n",
       "      <td>3/26/2014</td>\n",
       "      <td>4</td>\n",
       "      <td>86.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>62.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>119</th>\n",
       "      <td>3/25/2014</td>\n",
       "      <td>3</td>\n",
       "      <td>86.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>60.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>120</th>\n",
       "      <td>3/24/2014</td>\n",
       "      <td>2</td>\n",
       "      <td>86.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>59.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>121</th>\n",
       "      <td>3/22/2014</td>\n",
       "      <td>0</td>\n",
       "      <td>49.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>29.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>122 rows × 18 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "           Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone  \\\n",
       "0      1/5/2015  289        2776.0            0.0            10030.0   \n",
       "1      1/4/2015  288        2775.0            0.0             9780.0   \n",
       "2      1/3/2015  287        2769.0         8166.0             9722.0   \n",
       "3      1/2/2015  286           0.0         8157.0                0.0   \n",
       "4    12/31/2014  284        2730.0         8115.0             9633.0   \n",
       "..          ...  ...           ...            ...                ...   \n",
       "117   3/27/2014    5         103.0            8.0                6.0   \n",
       "118   3/26/2014    4          86.0            0.0                0.0   \n",
       "119   3/25/2014    3          86.0            0.0                0.0   \n",
       "120   3/24/2014    2          86.0            0.0                0.0   \n",
       "121   3/22/2014    0          49.0            0.0                0.0   \n",
       "\n",
       "     Cases_Nigeria  Cases_Senegal  Cases_UnitedStates  Cases_Spain  \\\n",
       "0              0.0            0.0                 0.0          0.0   \n",
       "1              0.0            0.0                 0.0          0.0   \n",
       "2              0.0            0.0                 0.0          0.0   \n",
       "3              0.0            0.0                 0.0          0.0   \n",
       "4              0.0            0.0                 0.0          0.0   \n",
       "..             ...            ...                 ...          ...   \n",
       "117            0.0            0.0                 0.0          0.0   \n",
       "118            0.0            0.0                 0.0          0.0   \n",
       "119            0.0            0.0                 0.0          0.0   \n",
       "120            0.0            0.0                 0.0          0.0   \n",
       "121            0.0            0.0                 0.0          0.0   \n",
       "\n",
       "     Cases_Mali  Deaths_Guinea  Deaths_Liberia  Deaths_SierraLeone  \\\n",
       "0           0.0         1786.0             0.0              2977.0   \n",
       "1           0.0         1781.0             0.0              2943.0   \n",
       "2           0.0         1767.0          3496.0              2915.0   \n",
       "3           0.0            0.0          3496.0                 0.0   \n",
       "4           0.0         1739.0          3471.0              2827.0   \n",
       "..          ...            ...             ...                 ...   \n",
       "117         0.0           66.0             6.0                 5.0   \n",
       "118         0.0           62.0             0.0                 0.0   \n",
       "119         0.0           60.0             0.0                 0.0   \n",
       "120         0.0           59.0             0.0                 0.0   \n",
       "121         0.0           29.0             0.0                 0.0   \n",
       "\n",
       "     Deaths_Nigeria  Deaths_Senegal  Deaths_UnitedStates  Deaths_Spain  \\\n",
       "0               0.0             0.0                  0.0           0.0   \n",
       "1               0.0             0.0                  0.0           0.0   \n",
       "2               0.0             0.0                  0.0           0.0   \n",
       "3               0.0             0.0                  0.0           0.0   \n",
       "4               0.0             0.0                  0.0           0.0   \n",
       "..              ...             ...                  ...           ...   \n",
       "117             0.0             0.0                  0.0           0.0   \n",
       "118             0.0             0.0                  0.0           0.0   \n",
       "119             0.0             0.0                  0.0           0.0   \n",
       "120             0.0             0.0                  0.0           0.0   \n",
       "121             0.0             0.0                  0.0           0.0   \n",
       "\n",
       "     Deaths_Mali  \n",
       "0            0.0  \n",
       "1            0.0  \n",
       "2            0.0  \n",
       "3            0.0  \n",
       "4            0.0  \n",
       "..           ...  \n",
       "117          0.0  \n",
       "118          0.0  \n",
       "119          0.0  \n",
       "120          0.0  \n",
       "121          0.0  \n",
       "\n",
       "[122 rows x 18 columns]"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# One option is to consider missing data to have some default value.\n",
    "# e.g., in the case of number of Ebola cases, it may be fair to consider missing values to be 0.\n",
    "ebola_df.fillna(0,inplace=True)\n",
    "ebola_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pop</th>\n",
       "      <th>gdpPercap</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>year</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2002</th>\n",
       "      <td>25268405</td>\n",
       "      <td>726.734055</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2007</th>\n",
       "      <td>31889923</td>\n",
       "      <td>974.580338</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           pop   gdpPercap\n",
       "year                      \n",
       "2002  25268405  726.734055\n",
       "2007  31889923  974.580338"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Let's consider an example of how missing data may get created during the processing (because it was implicitly missing)\n",
    "afg_df = gap_df[gap_df['country']=='Afghanistan'][['year','pop','gdpPercap']].tail(2).set_index('year')\n",
    "afg_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "scrolled": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pop</th>\n",
       "      <th>gdpPercap</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>year</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2000</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2001</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2002</th>\n",
       "      <td>25268405.0</td>\n",
       "      <td>726.734055</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2003</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2004</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2005</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2006</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2007</th>\n",
       "      <td>31889923.0</td>\n",
       "      <td>974.580338</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2008</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2009</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2010</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             pop   gdpPercap\n",
       "year                        \n",
       "2000         NaN         NaN\n",
       "2001         NaN         NaN\n",
       "2002  25268405.0  726.734055\n",
       "2003         NaN         NaN\n",
       "2004         NaN         NaN\n",
       "2005         NaN         NaN\n",
       "2006         NaN         NaN\n",
       "2007  31889923.0  974.580338\n",
       "2008         NaN         NaN\n",
       "2009         NaN         NaN\n",
       "2010         NaN         NaN"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# If we for some reason need yearly population and per capita GDP\n",
    "# That information was missing implicitly, and during our processing, it may get so explicitly\n",
    "afg_df=afg_df.reindex(afg_df.index.tolist() + list(range(2000, 2011)))\n",
    "afg_df=afg_df[~afg_df.index.duplicated(keep='first')].sort_index()\n",
    "afg_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pop</th>\n",
       "      <th>gdpPercap</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>year</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2000</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2001</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2002</th>\n",
       "      <td>25268405.0</td>\n",
       "      <td>726.734055</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2003</th>\n",
       "      <td>25268405.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2004</th>\n",
       "      <td>25268405.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2005</th>\n",
       "      <td>25268405.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2006</th>\n",
       "      <td>25268405.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2007</th>\n",
       "      <td>31889923.0</td>\n",
       "      <td>974.580338</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2008</th>\n",
       "      <td>31889923.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2009</th>\n",
       "      <td>31889923.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2010</th>\n",
       "      <td>31889923.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             pop   gdpPercap\n",
       "year                        \n",
       "2000         NaN         NaN\n",
       "2001         NaN         NaN\n",
       "2002  25268405.0  726.734055\n",
       "2003  25268405.0         NaN\n",
       "2004  25268405.0         NaN\n",
       "2005  25268405.0         NaN\n",
       "2006  25268405.0         NaN\n",
       "2007  31889923.0  974.580338\n",
       "2008  31889923.0         NaN\n",
       "2009  31889923.0         NaN\n",
       "2010  31889923.0         NaN"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# forward fill (the population)\n",
    "afg_df.loc[:,'pop'] = afg_df.loc[:,'pop'].ffill()\n",
    "afg_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pop</th>\n",
       "      <th>gdpPercap</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>year</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2000</th>\n",
       "      <td>25268405.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2001</th>\n",
       "      <td>25268405.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2002</th>\n",
       "      <td>25268405.0</td>\n",
       "      <td>726.734055</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2003</th>\n",
       "      <td>25268405.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2004</th>\n",
       "      <td>25268405.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2005</th>\n",
       "      <td>25268405.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2006</th>\n",
       "      <td>25268405.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2007</th>\n",
       "      <td>31889923.0</td>\n",
       "      <td>974.580338</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2008</th>\n",
       "      <td>31889923.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2009</th>\n",
       "      <td>31889923.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2010</th>\n",
       "      <td>31889923.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             pop   gdpPercap\n",
       "year                        \n",
       "2000  25268405.0         NaN\n",
       "2001  25268405.0         NaN\n",
       "2002  25268405.0  726.734055\n",
       "2003  25268405.0         NaN\n",
       "2004  25268405.0         NaN\n",
       "2005  25268405.0         NaN\n",
       "2006  25268405.0         NaN\n",
       "2007  31889923.0  974.580338\n",
       "2008  31889923.0         NaN\n",
       "2009  31889923.0         NaN\n",
       "2010  31889923.0         NaN"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# backward fill (the population)\n",
    "afg_df.loc[:,'pop'] = afg_df.loc[:,'pop'].bfill()\n",
    "afg_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pop</th>\n",
       "      <th>gdpPercap</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>year</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2000</th>\n",
       "      <td>25268405.0</td>\n",
       "      <td>726.734055</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2001</th>\n",
       "      <td>25268405.0</td>\n",
       "      <td>726.734055</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2002</th>\n",
       "      <td>25268405.0</td>\n",
       "      <td>726.734055</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2003</th>\n",
       "      <td>25268405.0</td>\n",
       "      <td>776.303312</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2004</th>\n",
       "      <td>25268405.0</td>\n",
       "      <td>825.872568</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2005</th>\n",
       "      <td>25268405.0</td>\n",
       "      <td>875.441825</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2006</th>\n",
       "      <td>25268405.0</td>\n",
       "      <td>925.011082</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2007</th>\n",
       "      <td>31889923.0</td>\n",
       "      <td>974.580338</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2008</th>\n",
       "      <td>31889923.0</td>\n",
       "      <td>974.580338</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2009</th>\n",
       "      <td>31889923.0</td>\n",
       "      <td>974.580338</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2010</th>\n",
       "      <td>31889923.0</td>\n",
       "      <td>974.580338</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             pop   gdpPercap\n",
       "year                        \n",
       "2000  25268405.0  726.734055\n",
       "2001  25268405.0  726.734055\n",
       "2002  25268405.0  726.734055\n",
       "2003  25268405.0  776.303312\n",
       "2004  25268405.0  825.872568\n",
       "2005  25268405.0  875.441825\n",
       "2006  25268405.0  925.011082\n",
       "2007  31889923.0  974.580338\n",
       "2008  31889923.0  974.580338\n",
       "2009  31889923.0  974.580338\n",
       "2010  31889923.0  974.580338"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# interpolate fill: many caveats apply\n",
    "# only linear interpolation is supported\n",
    "# the gaps are filled with linear interpolation. \n",
    "# the extremities just repeat the data\n",
    "\n",
    "afg_df.loc[:,'gdpPercap'] = afg_df.loc[:,'gdpPercap'].interpolate(limit_direction='both')\n",
    "afg_df\n",
    "# you may need your own custom method"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "scrolled": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>time</th>\n",
       "      <th>user</th>\n",
       "      <th>video</th>\n",
       "      <th>playback position</th>\n",
       "      <th>paused</th>\n",
       "      <th>volume</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1469974424</td>\n",
       "      <td>cheryl</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>5</td>\n",
       "      <td>False</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1469974454</td>\n",
       "      <td>cheryl</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>6</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1469974544</td>\n",
       "      <td>cheryl</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>9</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1469974574</td>\n",
       "      <td>cheryl</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>10</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1469977514</td>\n",
       "      <td>bob</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>1469977544</td>\n",
       "      <td>bob</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>1469977574</td>\n",
       "      <td>bob</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>1469977604</td>\n",
       "      <td>bob</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>1469974604</td>\n",
       "      <td>cheryl</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>11</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>1469974694</td>\n",
       "      <td>cheryl</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>14</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>1469974724</td>\n",
       "      <td>cheryl</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>15</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>1469974454</td>\n",
       "      <td>sue</td>\n",
       "      <td>advanced.html</td>\n",
       "      <td>24</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>1469974524</td>\n",
       "      <td>sue</td>\n",
       "      <td>advanced.html</td>\n",
       "      <td>25</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>1469974424</td>\n",
       "      <td>sue</td>\n",
       "      <td>advanced.html</td>\n",
       "      <td>23</td>\n",
       "      <td>False</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>1469974554</td>\n",
       "      <td>sue</td>\n",
       "      <td>advanced.html</td>\n",
       "      <td>26</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>1469974624</td>\n",
       "      <td>sue</td>\n",
       "      <td>advanced.html</td>\n",
       "      <td>27</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>1469974654</td>\n",
       "      <td>sue</td>\n",
       "      <td>advanced.html</td>\n",
       "      <td>28</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>1469974724</td>\n",
       "      <td>sue</td>\n",
       "      <td>advanced.html</td>\n",
       "      <td>29</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>1469974484</td>\n",
       "      <td>cheryl</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>7</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>1469974514</td>\n",
       "      <td>cheryl</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>8</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>1469974754</td>\n",
       "      <td>sue</td>\n",
       "      <td>advanced.html</td>\n",
       "      <td>30</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>1469974824</td>\n",
       "      <td>sue</td>\n",
       "      <td>advanced.html</td>\n",
       "      <td>31</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>1469974854</td>\n",
       "      <td>sue</td>\n",
       "      <td>advanced.html</td>\n",
       "      <td>32</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>1469974924</td>\n",
       "      <td>sue</td>\n",
       "      <td>advanced.html</td>\n",
       "      <td>33</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>1469977424</td>\n",
       "      <td>bob</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>1469977454</td>\n",
       "      <td>bob</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>1469977484</td>\n",
       "      <td>bob</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>1469977634</td>\n",
       "      <td>bob</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>1469977664</td>\n",
       "      <td>bob</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>1469974634</td>\n",
       "      <td>cheryl</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>12</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30</th>\n",
       "      <td>1469974664</td>\n",
       "      <td>cheryl</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>13</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31</th>\n",
       "      <td>1469977694</td>\n",
       "      <td>bob</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32</th>\n",
       "      <td>1469977724</td>\n",
       "      <td>bob</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          time    user          video  playback position paused  volume\n",
       "0   1469974424  cheryl     intro.html                  5  False    10.0\n",
       "1   1469974454  cheryl     intro.html                  6    NaN     NaN\n",
       "2   1469974544  cheryl     intro.html                  9    NaN     NaN\n",
       "3   1469974574  cheryl     intro.html                 10    NaN     NaN\n",
       "4   1469977514     bob     intro.html                  1    NaN     NaN\n",
       "5   1469977544     bob     intro.html                  1    NaN     NaN\n",
       "6   1469977574     bob     intro.html                  1    NaN     NaN\n",
       "7   1469977604     bob     intro.html                  1    NaN     NaN\n",
       "8   1469974604  cheryl     intro.html                 11    NaN     NaN\n",
       "9   1469974694  cheryl     intro.html                 14    NaN     NaN\n",
       "10  1469974724  cheryl     intro.html                 15    NaN     NaN\n",
       "11  1469974454     sue  advanced.html                 24    NaN     NaN\n",
       "12  1469974524     sue  advanced.html                 25    NaN     NaN\n",
       "13  1469974424     sue  advanced.html                 23  False    10.0\n",
       "14  1469974554     sue  advanced.html                 26    NaN     NaN\n",
       "15  1469974624     sue  advanced.html                 27    NaN     NaN\n",
       "16  1469974654     sue  advanced.html                 28    NaN     5.0\n",
       "17  1469974724     sue  advanced.html                 29    NaN     NaN\n",
       "18  1469974484  cheryl     intro.html                  7    NaN     NaN\n",
       "19  1469974514  cheryl     intro.html                  8    NaN     NaN\n",
       "20  1469974754     sue  advanced.html                 30    NaN     NaN\n",
       "21  1469974824     sue  advanced.html                 31    NaN     NaN\n",
       "22  1469974854     sue  advanced.html                 32    NaN     NaN\n",
       "23  1469974924     sue  advanced.html                 33    NaN     NaN\n",
       "24  1469977424     bob     intro.html                  1   True    10.0\n",
       "25  1469977454     bob     intro.html                  1    NaN     NaN\n",
       "26  1469977484     bob     intro.html                  1    NaN     NaN\n",
       "27  1469977634     bob     intro.html                  1    NaN     NaN\n",
       "28  1469977664     bob     intro.html                  1    NaN     NaN\n",
       "29  1469974634  cheryl     intro.html                 12    NaN     NaN\n",
       "30  1469974664  cheryl     intro.html                 13    NaN     NaN\n",
       "31  1469977694     bob     intro.html                  1    NaN     NaN\n",
       "32  1469977724     bob     intro.html                  1    NaN     NaN"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Be mindful regarding the semantics of the data\n",
    "# This data is from a 'toy example' on video playbacks \n",
    "# from Prof. Christopher Brooks' course on Data Science in Python \n",
    "# Usage logs being collected from multiple users parallely, with some missing data\n",
    "data_from_coursera_path ='data/coursera_sourced_data/'\n",
    "log_df=pd.read_csv(data_from_coursera_path+'log.csv')\n",
    "log_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user</th>\n",
       "      <th>video</th>\n",
       "      <th>playback position</th>\n",
       "      <th>paused</th>\n",
       "      <th>volume</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>time</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1469974424</th>\n",
       "      <td>cheryl</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>5</td>\n",
       "      <td>False</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974424</th>\n",
       "      <td>sue</td>\n",
       "      <td>advanced.html</td>\n",
       "      <td>23</td>\n",
       "      <td>False</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974454</th>\n",
       "      <td>cheryl</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>6</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974454</th>\n",
       "      <td>sue</td>\n",
       "      <td>advanced.html</td>\n",
       "      <td>24</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974484</th>\n",
       "      <td>cheryl</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>7</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974514</th>\n",
       "      <td>cheryl</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>8</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974524</th>\n",
       "      <td>sue</td>\n",
       "      <td>advanced.html</td>\n",
       "      <td>25</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974544</th>\n",
       "      <td>cheryl</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>9</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974554</th>\n",
       "      <td>sue</td>\n",
       "      <td>advanced.html</td>\n",
       "      <td>26</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974574</th>\n",
       "      <td>cheryl</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>10</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974604</th>\n",
       "      <td>cheryl</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>11</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974624</th>\n",
       "      <td>sue</td>\n",
       "      <td>advanced.html</td>\n",
       "      <td>27</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974634</th>\n",
       "      <td>cheryl</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>12</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974654</th>\n",
       "      <td>sue</td>\n",
       "      <td>advanced.html</td>\n",
       "      <td>28</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974664</th>\n",
       "      <td>cheryl</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>13</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974694</th>\n",
       "      <td>cheryl</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>14</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974724</th>\n",
       "      <td>cheryl</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>15</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974724</th>\n",
       "      <td>sue</td>\n",
       "      <td>advanced.html</td>\n",
       "      <td>29</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974754</th>\n",
       "      <td>sue</td>\n",
       "      <td>advanced.html</td>\n",
       "      <td>30</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974824</th>\n",
       "      <td>sue</td>\n",
       "      <td>advanced.html</td>\n",
       "      <td>31</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974854</th>\n",
       "      <td>sue</td>\n",
       "      <td>advanced.html</td>\n",
       "      <td>32</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974924</th>\n",
       "      <td>sue</td>\n",
       "      <td>advanced.html</td>\n",
       "      <td>33</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469977424</th>\n",
       "      <td>bob</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469977454</th>\n",
       "      <td>bob</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469977484</th>\n",
       "      <td>bob</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469977514</th>\n",
       "      <td>bob</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469977544</th>\n",
       "      <td>bob</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469977574</th>\n",
       "      <td>bob</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469977604</th>\n",
       "      <td>bob</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469977634</th>\n",
       "      <td>bob</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469977664</th>\n",
       "      <td>bob</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469977694</th>\n",
       "      <td>bob</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469977724</th>\n",
       "      <td>bob</td>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              user          video  playback position paused  volume\n",
       "time                                                               \n",
       "1469974424  cheryl     intro.html                  5  False    10.0\n",
       "1469974424     sue  advanced.html                 23  False    10.0\n",
       "1469974454  cheryl     intro.html                  6    NaN     NaN\n",
       "1469974454     sue  advanced.html                 24    NaN     NaN\n",
       "1469974484  cheryl     intro.html                  7    NaN     NaN\n",
       "1469974514  cheryl     intro.html                  8    NaN     NaN\n",
       "1469974524     sue  advanced.html                 25    NaN     NaN\n",
       "1469974544  cheryl     intro.html                  9    NaN     NaN\n",
       "1469974554     sue  advanced.html                 26    NaN     NaN\n",
       "1469974574  cheryl     intro.html                 10    NaN     NaN\n",
       "1469974604  cheryl     intro.html                 11    NaN     NaN\n",
       "1469974624     sue  advanced.html                 27    NaN     NaN\n",
       "1469974634  cheryl     intro.html                 12    NaN     NaN\n",
       "1469974654     sue  advanced.html                 28    NaN     5.0\n",
       "1469974664  cheryl     intro.html                 13    NaN     NaN\n",
       "1469974694  cheryl     intro.html                 14    NaN     NaN\n",
       "1469974724  cheryl     intro.html                 15    NaN     NaN\n",
       "1469974724     sue  advanced.html                 29    NaN     NaN\n",
       "1469974754     sue  advanced.html                 30    NaN     NaN\n",
       "1469974824     sue  advanced.html                 31    NaN     NaN\n",
       "1469974854     sue  advanced.html                 32    NaN     NaN\n",
       "1469974924     sue  advanced.html                 33    NaN     NaN\n",
       "1469977424     bob     intro.html                  1   True    10.0\n",
       "1469977454     bob     intro.html                  1    NaN     NaN\n",
       "1469977484     bob     intro.html                  1    NaN     NaN\n",
       "1469977514     bob     intro.html                  1    NaN     NaN\n",
       "1469977544     bob     intro.html                  1    NaN     NaN\n",
       "1469977574     bob     intro.html                  1    NaN     NaN\n",
       "1469977604     bob     intro.html                  1    NaN     NaN\n",
       "1469977634     bob     intro.html                  1    NaN     NaN\n",
       "1469977664     bob     intro.html                  1    NaN     NaN\n",
       "1469977694     bob     intro.html                  1    NaN     NaN\n",
       "1469977724     bob     intro.html                  1    NaN     NaN"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Let's view the data sorted by time\n",
    "log_df=log_df.set_index('time').sort_index() \n",
    "log_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>video</th>\n",
       "      <th>playback position</th>\n",
       "      <th>paused</th>\n",
       "      <th>volume</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>user</th>\n",
       "      <th>time</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"11\" valign=\"top\">bob</th>\n",
       "      <th>1469977424</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469977454</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469977484</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469977514</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469977544</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469977574</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469977604</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469977634</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469977664</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469977694</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469977724</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"11\" valign=\"top\">cheryl</th>\n",
       "      <th>1469974424</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>5</td>\n",
       "      <td>False</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974454</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>6</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974484</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>7</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974514</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>8</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974544</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>9</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974574</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>10</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974604</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>11</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974634</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>12</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974664</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>13</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974694</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>14</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974724</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>15</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"11\" valign=\"top\">sue</th>\n",
       "      <th>1469974424</th>\n",
       "      <td>advanced.html</td>\n",
       "      <td>23</td>\n",
       "      <td>False</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974454</th>\n",
       "      <td>advanced.html</td>\n",
       "      <td>24</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974524</th>\n",
       "      <td>advanced.html</td>\n",
       "      <td>25</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974554</th>\n",
       "      <td>advanced.html</td>\n",
       "      <td>26</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974624</th>\n",
       "      <td>advanced.html</td>\n",
       "      <td>27</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974654</th>\n",
       "      <td>advanced.html</td>\n",
       "      <td>28</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974724</th>\n",
       "      <td>advanced.html</td>\n",
       "      <td>29</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974754</th>\n",
       "      <td>advanced.html</td>\n",
       "      <td>30</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974824</th>\n",
       "      <td>advanced.html</td>\n",
       "      <td>31</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974854</th>\n",
       "      <td>advanced.html</td>\n",
       "      <td>32</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974924</th>\n",
       "      <td>advanced.html</td>\n",
       "      <td>33</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                           video  playback position paused  volume\n",
       "user   time                                                       \n",
       "bob    1469977424     intro.html                  1   True    10.0\n",
       "       1469977454     intro.html                  1    NaN     NaN\n",
       "       1469977484     intro.html                  1    NaN     NaN\n",
       "       1469977514     intro.html                  1    NaN     NaN\n",
       "       1469977544     intro.html                  1    NaN     NaN\n",
       "       1469977574     intro.html                  1    NaN     NaN\n",
       "       1469977604     intro.html                  1    NaN     NaN\n",
       "       1469977634     intro.html                  1    NaN     NaN\n",
       "       1469977664     intro.html                  1    NaN     NaN\n",
       "       1469977694     intro.html                  1    NaN     NaN\n",
       "       1469977724     intro.html                  1    NaN     NaN\n",
       "cheryl 1469974424     intro.html                  5  False    10.0\n",
       "       1469974454     intro.html                  6    NaN     NaN\n",
       "       1469974484     intro.html                  7    NaN     NaN\n",
       "       1469974514     intro.html                  8    NaN     NaN\n",
       "       1469974544     intro.html                  9    NaN     NaN\n",
       "       1469974574     intro.html                 10    NaN     NaN\n",
       "       1469974604     intro.html                 11    NaN     NaN\n",
       "       1469974634     intro.html                 12    NaN     NaN\n",
       "       1469974664     intro.html                 13    NaN     NaN\n",
       "       1469974694     intro.html                 14    NaN     NaN\n",
       "       1469974724     intro.html                 15    NaN     NaN\n",
       "sue    1469974424  advanced.html                 23  False    10.0\n",
       "       1469974454  advanced.html                 24    NaN     NaN\n",
       "       1469974524  advanced.html                 25    NaN     NaN\n",
       "       1469974554  advanced.html                 26    NaN     NaN\n",
       "       1469974624  advanced.html                 27    NaN     NaN\n",
       "       1469974654  advanced.html                 28    NaN     5.0\n",
       "       1469974724  advanced.html                 29    NaN     NaN\n",
       "       1469974754  advanced.html                 30    NaN     NaN\n",
       "       1469974824  advanced.html                 31    NaN     NaN\n",
       "       1469974854  advanced.html                 32    NaN     NaN\n",
       "       1469974924  advanced.html                 33    NaN     NaN"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "log_df=log_df.reset_index().set_index(['user','time']).sort_index()\n",
    "# We want to use a 2-level index, so that all entries for a given user are together\n",
    "# Chaining can keep your code short. Make sure it does not make it too cryptic though ;) \n",
    "# Motivation: Using user Cheryl's volume setting to fill-in user Bob's volume may not be meaningful\n",
    "log_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "scrolled": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>video</th>\n",
       "      <th>playback position</th>\n",
       "      <th>paused</th>\n",
       "      <th>volume</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>user</th>\n",
       "      <th>time</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"11\" valign=\"top\">bob</th>\n",
       "      <th>1469977424</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469977454</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469977484</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469977514</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469977544</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469977574</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469977604</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469977634</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469977664</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469977694</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469977724</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"11\" valign=\"top\">cheryl</th>\n",
       "      <th>1469974424</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>5</td>\n",
       "      <td>False</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974454</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>6</td>\n",
       "      <td>False</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974484</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>7</td>\n",
       "      <td>False</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974514</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>8</td>\n",
       "      <td>False</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974544</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>9</td>\n",
       "      <td>False</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974574</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>10</td>\n",
       "      <td>False</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974604</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>11</td>\n",
       "      <td>False</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974634</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>12</td>\n",
       "      <td>False</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974664</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>13</td>\n",
       "      <td>False</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974694</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>14</td>\n",
       "      <td>False</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974724</th>\n",
       "      <td>intro.html</td>\n",
       "      <td>15</td>\n",
       "      <td>False</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"11\" valign=\"top\">sue</th>\n",
       "      <th>1469974424</th>\n",
       "      <td>advanced.html</td>\n",
       "      <td>23</td>\n",
       "      <td>False</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974454</th>\n",
       "      <td>advanced.html</td>\n",
       "      <td>24</td>\n",
       "      <td>False</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974524</th>\n",
       "      <td>advanced.html</td>\n",
       "      <td>25</td>\n",
       "      <td>False</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974554</th>\n",
       "      <td>advanced.html</td>\n",
       "      <td>26</td>\n",
       "      <td>False</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974624</th>\n",
       "      <td>advanced.html</td>\n",
       "      <td>27</td>\n",
       "      <td>False</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974654</th>\n",
       "      <td>advanced.html</td>\n",
       "      <td>28</td>\n",
       "      <td>False</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974724</th>\n",
       "      <td>advanced.html</td>\n",
       "      <td>29</td>\n",
       "      <td>False</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974754</th>\n",
       "      <td>advanced.html</td>\n",
       "      <td>30</td>\n",
       "      <td>False</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974824</th>\n",
       "      <td>advanced.html</td>\n",
       "      <td>31</td>\n",
       "      <td>False</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974854</th>\n",
       "      <td>advanced.html</td>\n",
       "      <td>32</td>\n",
       "      <td>False</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1469974924</th>\n",
       "      <td>advanced.html</td>\n",
       "      <td>33</td>\n",
       "      <td>False</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                           video  playback position  paused  volume\n",
       "user   time                                                        \n",
       "bob    1469977424     intro.html                  1    True    10.0\n",
       "       1469977454     intro.html                  1    True    10.0\n",
       "       1469977484     intro.html                  1    True    10.0\n",
       "       1469977514     intro.html                  1    True    10.0\n",
       "       1469977544     intro.html                  1    True    10.0\n",
       "       1469977574     intro.html                  1    True    10.0\n",
       "       1469977604     intro.html                  1    True    10.0\n",
       "       1469977634     intro.html                  1    True    10.0\n",
       "       1469977664     intro.html                  1    True    10.0\n",
       "       1469977694     intro.html                  1    True    10.0\n",
       "       1469977724     intro.html                  1    True    10.0\n",
       "cheryl 1469974424     intro.html                  5   False    10.0\n",
       "       1469974454     intro.html                  6   False    10.0\n",
       "       1469974484     intro.html                  7   False    10.0\n",
       "       1469974514     intro.html                  8   False    10.0\n",
       "       1469974544     intro.html                  9   False    10.0\n",
       "       1469974574     intro.html                 10   False    10.0\n",
       "       1469974604     intro.html                 11   False    10.0\n",
       "       1469974634     intro.html                 12   False    10.0\n",
       "       1469974664     intro.html                 13   False    10.0\n",
       "       1469974694     intro.html                 14   False    10.0\n",
       "       1469974724     intro.html                 15   False    10.0\n",
       "sue    1469974424  advanced.html                 23   False    10.0\n",
       "       1469974454  advanced.html                 24   False    10.0\n",
       "       1469974524  advanced.html                 25   False    10.0\n",
       "       1469974554  advanced.html                 26   False    10.0\n",
       "       1469974624  advanced.html                 27   False    10.0\n",
       "       1469974654  advanced.html                 28   False     5.0\n",
       "       1469974724  advanced.html                 29   False     5.0\n",
       "       1469974754  advanced.html                 30   False     5.0\n",
       "       1469974824  advanced.html                 31   False     5.0\n",
       "       1469974854  advanced.html                 32   False     5.0\n",
       "       1469974924  advanced.html                 33   False     5.0"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Now, it makes certain sense to assume that \n",
    "# the missing values correspond to the last recorded value for that particular user\n",
    "# Forward the Fill! (based on a reasonable Foundation of the assumptions ;)\n",
    "log_df = log_df.fillna(method='ffill')\n",
    "log_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Some basic text processing & RegEx\n",
    "<a href=\"https://www.explainxkcd.com/wiki/index.php/1638:_Backslashes\"><img src=\"pics/backslashes.png\" alt=\"XKCD backslashes\" width=\"400\"/></a><br>\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Doctors', 'may', 'soon', 'identify', '#diabetes', 'patients', 'who', 'have', 'a', 'higher', 'risk', 'of', 'vascular', '#inflammation', 'with', 'a', 'one-step', 'process', 'using', 'a', '‘lab-on-a-chip’', 'device', 'developed', 'by', 'scientists', 'from', '@NTUsg,', '@TTSH', 'and', '@MIT.', '#NTUsg2025', '#NTUsgResearch', 'https://bit.ly/ExoDFFchip']\n"
     ]
    }
   ],
   "source": [
    "# Tokens\n",
    "NTU_tweet =\"\"\"Doctors may soon identify #diabetes patients who have a higher risk of vascular #inflammation with a one-step process using a ‘lab-on-a-chip’ device developed by scientists from @NTUsg, @TTSH and @MIT. #NTUsg2025 #NTUsgResearch https://bit.ly/ExoDFFchip\"\"\"\n",
    "Tweet_tokens=NTU_tweet.split()\n",
    "print(Tweet_tokens)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Doctors', 'identify', 'patients', 'higher', 'vascular', 'one-step', 'process', 'using', '‘lab-on-a-chip’', 'device', 'developed', 'scientists']\n"
     ]
    }
   ],
   "source": [
    "# Find 'long' words (more than 4 letters) which are not hashtags, call outs, or urls!\n",
    "Tweet_longwords=[x for x in Tweet_tokens if (len(x)>4) & (x.find('#')<0) & (x.find('@')<0) & (x.find('://')<0)]\n",
    "print(Tweet_longwords)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['#diabetes', '#inflammation', '#NTUsg2025', '#NTUsgResearch']\n",
      "['#diabetes', '#inflammation', '#NTUsg2025', '#NTUsgResearch', 'stranger#things']\n"
     ]
    }
   ],
   "source": [
    "# What about, if we want to find all the hashtags?\n",
    "\n",
    "# Attempt 1:\n",
    "Tweet_hashtags1a=[x for x in Tweet_tokens if (x.find('#')>=0) ]\n",
    "Tweet_tokens.append(\"stranger#things\")\n",
    "Tweet_hashtags1b=[x for x in Tweet_tokens if (x.find('#')>=0) ]\n",
    "\n",
    "print(Tweet_hashtags1a)\n",
    "print(Tweet_hashtags1b)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['#diabetes', '#inflammation', '#NTUsg2025', '#NTUsgResearch']\n",
      "['#diabetes', '#inflammation', '#NTUsg2025', '#NTUsgResearch']\n"
     ]
    }
   ],
   "source": [
    "# Attempt 2:\n",
    "# Just in case there's something like `stranger#things' which is not a hashtag!\n",
    "# We might want to use one of the following:\n",
    "Tweet_hashtags2=[x for x in Tweet_tokens if (x.find('#')==0) ]  \n",
    "Tweet_hashtags3=[x for x in Tweet_tokens if x.startswith('#') ]\n",
    "\n",
    "print(Tweet_hashtags2)\n",
    "print(Tweet_hashtags3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['@NTUsg,', '#NTUsg2025', '#NTUsgResearch']\n",
      "['@NTUsg,', '#NTUsg2025', '#NTUsgResearch']\n"
     ]
    }
   ],
   "source": [
    "# Find all mentions of 'NTU'\n",
    "Contains_NTU1=[x for x in Tweet_tokens if (x.find('NTU')>=0) ]\n",
    "Contains_NTU2=[x for x in Tweet_tokens if 'NTU' in x]\n",
    "\n",
    "print(Contains_NTU1)\n",
    "print(Contains_NTU2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Some frequently used & useful string methods \n",
    "\n",
    "- checking: isalpha/isdigit/isalnum \n",
    "- transforming: lower/upper/title/replace/strip/rstrip\n",
    "- locate: find/rfind\n",
    "\n",
    "See [https://docs.python.org/3/library/stdtypes.html] for a more exhaustive list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['#diabetes', '#inflammation', '#NTUsg2025', '#NTUsgResearch']"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Let's redo the extraction of hashtags with a RegEx\n",
    "\n",
    "[x for x in Tweet_tokens if re.search('\\B#[A-Za-z-0-9_]+',x)] \n",
    "\n",
    "# For the specifc example tweet, if we used re.search('#',x), it would have sufficed.\n",
    "# But then it would not have managed to handle the 'stranger' cases such as isolated #, ## or hash sandwiched by text.."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### How to interpret <span style=\"color:red\">'\\B#[A-Za-z-0-9_]+'</span>?\n",
    "     [x for x in Tweet_tokens if re.search('\\B#[A-Za-z-0-9_]+',x)] \n",
    "     \n",
    "- <span style=\"color:red\">\\B</span> Matches boundary with non-alphanumeric characters\n",
    "\n",
    "- <span style=\"color:red\">[A-Za-z-0-9_]</span> Matches any of A to Z, a to z, 0 to 9 and _\n",
    "\n",
    "- <span style=\"color:red\">+</span> Greedily matches that expression on its immediate left is present at least once."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['#diabetes', '#inflammation', '#NTUsg2025', '#NTUsgResearch']"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# The same RegEx can be rewritten in a much shorter (and slightly cryptic) manner\n",
    "[x for x in Tweet_tokens if re.search('\\B#\\w+',x)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### RegEx cheatsheet snippet\n",
    "Source: <a href=\"https://www.dataquest.io/wp-content/uploads/2019/03/python-regular-expressions-cheat-sheet.pdf\">Dataquest.io</a>\n",
    "\n",
    "<img src=\"pics/regex-charclass.png\" alt=\"RegExCharacter Class\" width=\"300\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### RegEx cheatsheet snippet\n",
    "Source: <a href=\"https://www.dataquest.io/wp-content/uploads/2019/03/python-regular-expressions-cheat-sheet.pdf\">Dataquest.io</a>\n",
    "\n",
    "<img src=\"pics/regex-specialchar.png\" alt=\"RegExSpecialChar\" width=\"240\"/>\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0             03/25/93 Total time of visit (in minutes):\n",
       "1                           6/18/85 Primary Care Doctor:\n",
       "2      sshe plans to move as of 7/8/71 In-Home Servic...\n",
       "3                    7 on 9/27/75 Audit C Score Current:\n",
       "4      2/6/96 sleep studyPain Treatment Pain Level (N...\n",
       "                             ...                        \n",
       "495    1979 Family Psych History: Family History of S...\n",
       "496    therapist and friend died in ~2006 Parental/Ca...\n",
       "497                           2008 partial thyroidectomy\n",
       "498    sPt describes a history of sexual abuse as a c...\n",
       "499    . In 1980, patient was living in Naples and de...\n",
       "Length: 500, dtype: object"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Open a file, read it line-by-line and create a Data Series \n",
    "# The following data is obtained from Prof V. Vydiswaran's course on Applied Text Mining in Python\n",
    "\n",
    "doc = []\n",
    "with open(data_from_coursera_path+'text-with-dates.txt','r') as file:\n",
    "    for line in file:\n",
    "        doc.append(line.rstrip()) # used rstrip to remove trailing newlines \\n\n",
    "\n",
    "datetxt_df = pd.Series(doc)\n",
    "datetxt_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "scrolled": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['03/25/93 Total time of visit (in minutes):',\n",
       " '6/18/85 Primary Care Doctor:',\n",
       " 'sshe plans to move as of 7/8/71 In-Home Services: None',\n",
       " '7 on 9/27/75 Audit C Score Current:',\n",
       " '2/6/96 sleep studyPain Treatment Pain Level (Numeric Scale): 7',\n",
       " '.Per 7/06/79 Movement D/O note:',\n",
       " \"4, 5/18/78 Patient's thoughts about current substance abuse:\",\n",
       " '10/24/89 CPT Code: 90801 - Psychiatric Diagnosis Interview',\n",
       " '3/7/86 SOS-10 Total Score:',\n",
       " '(4/10/71)Score-1Audit C Score Current:',\n",
       " '(5/11/85) Crt-1.96, BUN-26; AST/ALT-16/22; WBC_12.6Activities of Daily Living (ADL) Bathing: Independent',\n",
       " '4/09/75 SOS-10 Total Score:',\n",
       " '8/01/98 Communication with referring physician?: Done',\n",
       " '1/26/72 Communication with referring physician?: Not Done',\n",
       " '5/24/1990 CPT Code: 90792: With medical services',\n",
       " '1/25/2011 CPT Code: 90792: With medical services',\n",
       " '4/12/82 Total time of visit (in minutes):',\n",
       " '1; 10/13/1976 Audit C Score, Highest/Date:',\n",
       " '4, 4/24/98 Relevant Drug History:',\n",
       " ') 59 yo unemployed w referred by Urgent Care for psychiatric evaluation and follow up. The patient reports she has been dx w BAD. Her main complaint today is anergy. Ms. Hartman was evaluated on one occasion 5/21/77. She was a cooperative but somewhat vague historian.History of Present Illness and Precipitating Events',\n",
       " '7/21/98 Total time of visit (in minutes):',\n",
       " '10/21/79 SOS-10 Total Score:',\n",
       " '3/03/90 CPT Code: 90792: With medical services',\n",
       " '2/11/76 CPT Code: 90792: With medical services',\n",
       " '07/25/1984 CPT Code: 90791: No medical services',\n",
       " '4-13-82 Other Child Mental Health Outcomes Scales Used:',\n",
       " '9/22/89 CPT Code: 90792: With medical services',\n",
       " '9/02/76 CPT Code: 90791: No medical services',\n",
       " '9/12/71 [report_end]',\n",
       " '10/24/86 Communication with referring physician?: Done',\n",
       " '03/31/1985 Total time of visit (in minutes):',\n",
       " '7/20/72 CPT Code: 90791: No medical services',\n",
       " '4/12/87= 1, negativeAudit C Score Current:',\n",
       " '06/20/91 Total time of visit (in minutes):',\n",
       " '5/12/2012 Primary Care Doctor:',\n",
       " '3/15/83 SOS-10 Total Score:',\n",
       " '2/14/73 CPT Code: 90801 - Psychiatric Diagnosis Interview',\n",
       " '5/24/88 CPT Code: 90792: With medical services',\n",
       " '7/27/1986 Total time of visit (in minutes):',\n",
       " '1-14-81 Communication with referring physician?: Done',\n",
       " '7-29-75 CPT Code: 90801 - Psychiatric Diagnosis Interview',\n",
       " '(6/24/87) TSH-2.18; Activities of Daily Living (ADL) Bathing: Independent',\n",
       " '8/14/94 Primary Care Doctor:',\n",
       " '4/13/2002 Primary Care Doctor:',\n",
       " '8/16/82 CPT Code: 90792: With medical services',\n",
       " '2/15/1998 Total time of visit (in minutes):',\n",
       " '7/15/91 CPT Code: 90792: With medical services',\n",
       " '06/12/94 SOS-10 Total Score:',\n",
       " '9/17/84 Communication with referring physician?: Done',\n",
       " '2/28/75 Other Adult Mental Health Outcomes Scales Used:',\n",
       " 'sOP WPM - Dr. Romo-psychopharm since 11/22/75',\n",
       " '\"In the context of the patient\\'s work up, she had Neuropsychological  testing 5/24/91.  She was referred because of her cognitive decline and symptoms of motor dysfunction to help characterize her current functioning. :',\n",
       " '6/13/92 CPT Code: 90791: No medical services',\n",
       " '7/11/71 SOS-10 Total Score:',\n",
       " '12/26/86 CPT Code: 90791: No medical services',\n",
       " '10/11/1987 CPT Code: 90791: No medical services',\n",
       " '3/14/95 Primary Care Doctor:',\n",
       " '12/01/73 CPT Code: 90791: No medical services',\n",
       " '0: 12/5/2010 Audit C Score Current:',\n",
       " '08/20/1982 SOS-10 Total Score:',\n",
       " '7/24/95 SOS-10 Total Score:',\n",
       " '8/06/83 CPT Code: 90791: No medical services',\n",
       " '02/22/92 SOS-10 Total Score:',\n",
       " '6/28/87 Impression Strengths/Abilities:',\n",
       " '07/29/1994 CPT code: 99203',\n",
       " '08/11/78 CPT Code: 90801 - Psychiatric Diagnosis Interview',\n",
       " '10/29/91 Communication with referring physician?: Done',\n",
       " '7/6/91 SOS-10 Total Score:',\n",
       " 'onone as of 1/21/87 Protective Factors:',\n",
       " '24 yo right handed woman with history of large right frontal mass s/p resection 11/3/1985 who had recent urgent R cranial wound revision and placement of L EVD for declining vision and increased drainage from craniotomy incision site and possible infection. She has a hx of secondary mania related to psychosis and manipulation of her right frontal lobe.',\n",
       " '7/04/82 CPT Code: 90801 - Psychiatric Diagnosis Interview',\n",
       " '4-13-89 Communication with referring physician?: Not Done',\n",
       " 'Lithium 0.25 (7/11/77).  LFTS wnl.  Urine tox neg.  Serum tox + fluoxetine 500; otherwise neg.  TSH 3.28.  BUN/Cr: 16/0.83.  Lipids unremarkable.  B12 363, Folate >20.  CBC: 4.9/36/308 Pertinent Medical Review of Systems Constitutional:',\n",
       " '4/12/74= 1Caffeine / Tobacco Use Caffeinated products: Yes',\n",
       " '09/19/81 CPT Code: 90792: With medical services',\n",
       " '9/6/79 Primary Care Doctor:',\n",
       " '12/5/87 Total time of visit (in minutes):',\n",
       " '01/05/1999 [report_end]',\n",
       " '4/22/80 SOS-10 Total Score:',\n",
       " '10/04/98 SOS-10 Total Score:',\n",
       " '.Mr. Echeverria described having panic-like experiences on at least three occasions. The first of these episodes occurred when facial meshing was placed on him during radiation therapy. He recalled the experience vividly and described \"freaking out\", involving a sensation of crushing in his chest, difficulty breathing, a fear that he could die, and intense emotional disquiet. The second episode occurred during the last 10-15 minutes of an MRI scan. Mr. Echeverria noted feeling surprised and fearful when the feelings of panic started. Similar physiological and psychological symptoms were reported as above. The third episode occurred recently on 6/29/81. Mr.Echeverria was driving his car and felt similar symptoms as described above. He reported fearing that he would crash his vehicle and subsequently pulled over and abandoned his vehicle.',\n",
       " '. Wile taking additional history, pt endorsed moderate depression and anxiety last 2 weeks, even though her period was on 8/04/78. Pt then acknowledged low to moderate mood and anxiety symptoms throughout the month, with premenstrual worsening. This mood pattern seems to have existed for many years, while premenstrual worsening increased in severity during last ~ 2 years, in context of likely peri menopause transition. Pt reported average of 3 hot flashes per day and 2 night sweats.',\n",
       " 'Death of mother; 7/07/1974 Meaningful activities/supports:',\n",
       " '09/14/2000 CPT Code: 90792: With medical services',\n",
       " '5/18/71 Total time of visit (in minutes):',\n",
       " '8/09/1981 Communication with referring physician?: Done',\n",
       " '6/05/93 CPT Code: 90791: No medical services',\n",
       " ')Dilantin (PHENYTOIN) 100 MG CAPSULE Take 2 Capsule(s) PO HS; No Change (Taking Differently), Comments: decreased from 290 daily to 260 mg daily due to elevated phenytoin level from 8/9/97',\n",
       " '12/8/82 Audit C=3Audit C Score Current:',\n",
       " '8/26/89 CPT Code: 90791: No medical services',\n",
       " '10/13/95 CPT Code: 90791: No medical services',\n",
       " '4/19/91 Communication with referring physician?: Not Done',\n",
       " '.APS - Psychiatry consult paged/requested in person at 04/08/2004 16:39 Patrick, Christian [hpp2]',\n",
       " '9/20/76 CPT Code: 90801 - Psychiatric Diagnosis Interview',\n",
       " '12/08/1990 @11 am [report_end]',\n",
       " '4/11/1974 Chief Complaint / HPI Chief Complaint (Patients own words)',\n",
       " '7/18/86 SOS-10 Total Score:',\n",
       " '3/31/91 Communication with referring physician?: Done',\n",
       " '5/13/72 Other Adult Mental Health Outcomes Scales Used:',\n",
       " '011/14/83 Audit C Score Current:',\n",
       " '8/16/92 SOS-10 Total Score:',\n",
       " '10/05/97 CPT Code: 90791: No medical services',\n",
       " '07/18/2002 CPT Code: 90792: With medical services',\n",
       " '9/22/82 Total time of visit (in minutes):',\n",
       " '2/24/74 SOS-10 Total Score:',\n",
       " '(2/03/78) TSH-0.90 Activities of Daily Living (ADL) Bathing: Independent',\n",
       " '2/11/2006 CPT Code: 90791: No medical services',\n",
       " \"Pt is a 21 year old, single, heterosexual identified, Liechtenstein male. He resides in Rabat, in a rented apartment, with a roommate. He has not graduated from high school nor has he earned his GED. He has been working full time, steadily, since the age of 16. He recently left a job at jacobs engineering group where he worked for about 5 years. He started working at becton dickinson a few weeks ago. Referral to SMH was precipitated by pt seeing his PCP, Dr. Ford, for a physical and sharing concerns re: depression. He was having suicidal thoughts the week of 8/22/83 and noticed that his depression and anxiety symptoms were starting to interfere with work. Per Dr. Ford's recommendation, he contacted PAL for assistance. When screened for substance use, pt was referred to SMH for an intake. Pt was scheduled to be seen last week due to urgency of his depression and SI, but missed the intake appointment due to getting extremely intoxicated the night before. Pt presents today seeking individual therapy and psychiatry services to address longstanding depression.\",\n",
       " 'PET Scan (DPSH 5/04/74): 1) Marked hypometabolism involving the bilateral caudate nuclei. This is a non-specific finding, although this raises the question of a multisystem atrophy, specifically MSA-P (striatonigral degeneration), 2) Cortical hypometabolism involving the right inferior frontal gyrus. This is a non-specific finding of uncertain significance, and is not definitively related to a neurodegenerative process, although pathology in this region is not excluded',\n",
       " '7/20/2011 [report_end]',\n",
       " '6/17/95 Total time of visit (in minutes):',\n",
       " '6/10/72 SOS-10 Total Score:',\n",
       " 'nPt denied use to me but endorsed use on 10/16/82 as part of BVH initial visit summary supplement by Nicholas BenjaminOpiates: Yes',\n",
       " '12/15/92 CPT Code: 90801 - Psychiatric Diagnosis Interview',\n",
       " '12/8/97 SOS-10 Total Score:',\n",
       " '4/05/89 Primary Care Doctor:',\n",
       " '12/04/87 SOS-10 Total Score:',\n",
       " '4 (6/20/77)Audit C Score Current:',\n",
       " 'see 4/27/2006 consult note Dr. GuevaraWhat factors in prior treatment were helpful/not helpful:',\n",
       " '07/17/92 CPT Code: 90791: No medical services',\n",
       " '12/22/98 CPT Code: 90801 - Psychiatric Diagnosis Interview',\n",
       " '10/02/96 Age:',\n",
       " '11/05/90 CPT Code: 90792: With medical services',\n",
       " '5/04/77 CPT Code: 90792: With medical services',\n",
       " '2/27/96 Communication with referring physician?: Done']"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# What does the following code do?\n",
    "[x for x in datetxt_df if re.findall(r'\\d{1,2}[/|-]\\d{1,2}[/|-]\\d{2,4}',x)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "scrolled": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['03/25/93 Total time of visit (in minutes):',\n",
       " '6/18/85 Primary Care Doctor:',\n",
       " 'sshe plans to move as of 7/8/71 In-Home Services: None',\n",
       " '7 on 9/27/75 Audit C Score Current:',\n",
       " '2/6/96 sleep studyPain Treatment Pain Level (Numeric Scale): 7',\n",
       " '.Per 7/06/79 Movement D/O note:',\n",
       " \"4, 5/18/78 Patient's thoughts about current substance abuse:\",\n",
       " '10/24/89 CPT Code: 90801 - Psychiatric Diagnosis Interview',\n",
       " '3/7/86 SOS-10 Total Score:',\n",
       " '(4/10/71)Score-1Audit C Score Current:',\n",
       " '(5/11/85) Crt-1.96, BUN-26; AST/ALT-16/22; WBC_12.6Activities of Daily Living (ADL) Bathing: Independent',\n",
       " '4/09/75 SOS-10 Total Score:',\n",
       " '8/01/98 Communication with referring physician?: Done',\n",
       " '1/26/72 Communication with referring physician?: Not Done',\n",
       " '5/24/1990 CPT Code: 90792: With medical services',\n",
       " '1/25/2011 CPT Code: 90792: With medical services',\n",
       " '4/12/82 Total time of visit (in minutes):',\n",
       " '1; 10/13/1976 Audit C Score, Highest/Date:',\n",
       " '4, 4/24/98 Relevant Drug History:',\n",
       " ') 59 yo unemployed w referred by Urgent Care for psychiatric evaluation and follow up. The patient reports she has been dx w BAD. Her main complaint today is anergy. Ms. Hartman was evaluated on one occasion 5/21/77. She was a cooperative but somewhat vague historian.History of Present Illness and Precipitating Events',\n",
       " '7/21/98 Total time of visit (in minutes):',\n",
       " '10/21/79 SOS-10 Total Score:',\n",
       " '3/03/90 CPT Code: 90792: With medical services',\n",
       " '2/11/76 CPT Code: 90792: With medical services',\n",
       " '07/25/1984 CPT Code: 90791: No medical services',\n",
       " '4-13-82 Other Child Mental Health Outcomes Scales Used:',\n",
       " '9/22/89 CPT Code: 90792: With medical services',\n",
       " '9/02/76 CPT Code: 90791: No medical services',\n",
       " '9/12/71 [report_end]',\n",
       " '10/24/86 Communication with referring physician?: Done',\n",
       " '03/31/1985 Total time of visit (in minutes):',\n",
       " '7/20/72 CPT Code: 90791: No medical services',\n",
       " '4/12/87= 1, negativeAudit C Score Current:',\n",
       " '06/20/91 Total time of visit (in minutes):',\n",
       " '5/12/2012 Primary Care Doctor:',\n",
       " '3/15/83 SOS-10 Total Score:',\n",
       " '2/14/73 CPT Code: 90801 - Psychiatric Diagnosis Interview',\n",
       " '5/24/88 CPT Code: 90792: With medical services',\n",
       " '7/27/1986 Total time of visit (in minutes):',\n",
       " '1-14-81 Communication with referring physician?: Done',\n",
       " '7-29-75 CPT Code: 90801 - Psychiatric Diagnosis Interview',\n",
       " '(6/24/87) TSH-2.18; Activities of Daily Living (ADL) Bathing: Independent',\n",
       " '8/14/94 Primary Care Doctor:',\n",
       " '4/13/2002 Primary Care Doctor:',\n",
       " '8/16/82 CPT Code: 90792: With medical services',\n",
       " '2/15/1998 Total time of visit (in minutes):',\n",
       " '7/15/91 CPT Code: 90792: With medical services',\n",
       " '06/12/94 SOS-10 Total Score:',\n",
       " '9/17/84 Communication with referring physician?: Done',\n",
       " '2/28/75 Other Adult Mental Health Outcomes Scales Used:',\n",
       " 'sOP WPM - Dr. Romo-psychopharm since 11/22/75',\n",
       " '\"In the context of the patient\\'s work up, she had Neuropsychological  testing 5/24/91.  She was referred because of her cognitive decline and symptoms of motor dysfunction to help characterize her current functioning. :',\n",
       " '6/13/92 CPT Code: 90791: No medical services',\n",
       " '7/11/71 SOS-10 Total Score:',\n",
       " '12/26/86 CPT Code: 90791: No medical services',\n",
       " '10/11/1987 CPT Code: 90791: No medical services',\n",
       " '3/14/95 Primary Care Doctor:',\n",
       " '12/01/73 CPT Code: 90791: No medical services',\n",
       " '0: 12/5/2010 Audit C Score Current:',\n",
       " '08/20/1982 SOS-10 Total Score:',\n",
       " '7/24/95 SOS-10 Total Score:',\n",
       " '8/06/83 CPT Code: 90791: No medical services',\n",
       " '02/22/92 SOS-10 Total Score:',\n",
       " '6/28/87 Impression Strengths/Abilities:',\n",
       " '07/29/1994 CPT code: 99203',\n",
       " '08/11/78 CPT Code: 90801 - Psychiatric Diagnosis Interview',\n",
       " '10/29/91 Communication with referring physician?: Done',\n",
       " '7/6/91 SOS-10 Total Score:',\n",
       " 'onone as of 1/21/87 Protective Factors:',\n",
       " '24 yo right handed woman with history of large right frontal mass s/p resection 11/3/1985 who had recent urgent R cranial wound revision and placement of L EVD for declining vision and increased drainage from craniotomy incision site and possible infection. She has a hx of secondary mania related to psychosis and manipulation of her right frontal lobe.',\n",
       " '7/04/82 CPT Code: 90801 - Psychiatric Diagnosis Interview',\n",
       " '4-13-89 Communication with referring physician?: Not Done',\n",
       " 'Lithium 0.25 (7/11/77).  LFTS wnl.  Urine tox neg.  Serum tox + fluoxetine 500; otherwise neg.  TSH 3.28.  BUN/Cr: 16/0.83.  Lipids unremarkable.  B12 363, Folate >20.  CBC: 4.9/36/308 Pertinent Medical Review of Systems Constitutional:',\n",
       " '4/12/74= 1Caffeine / Tobacco Use Caffeinated products: Yes',\n",
       " '09/19/81 CPT Code: 90792: With medical services',\n",
       " '9/6/79 Primary Care Doctor:',\n",
       " '12/5/87 Total time of visit (in minutes):',\n",
       " '01/05/1999 [report_end]',\n",
       " '4/22/80 SOS-10 Total Score:',\n",
       " '10/04/98 SOS-10 Total Score:',\n",
       " '.Mr. Echeverria described having panic-like experiences on at least three occasions. The first of these episodes occurred when facial meshing was placed on him during radiation therapy. He recalled the experience vividly and described \"freaking out\", involving a sensation of crushing in his chest, difficulty breathing, a fear that he could die, and intense emotional disquiet. The second episode occurred during the last 10-15 minutes of an MRI scan. Mr. Echeverria noted feeling surprised and fearful when the feelings of panic started. Similar physiological and psychological symptoms were reported as above. The third episode occurred recently on 6/29/81. Mr.Echeverria was driving his car and felt similar symptoms as described above. He reported fearing that he would crash his vehicle and subsequently pulled over and abandoned his vehicle.',\n",
       " '. Wile taking additional history, pt endorsed moderate depression and anxiety last 2 weeks, even though her period was on 8/04/78. Pt then acknowledged low to moderate mood and anxiety symptoms throughout the month, with premenstrual worsening. This mood pattern seems to have existed for many years, while premenstrual worsening increased in severity during last ~ 2 years, in context of likely peri menopause transition. Pt reported average of 3 hot flashes per day and 2 night sweats.',\n",
       " 'Death of mother; 7/07/1974 Meaningful activities/supports:',\n",
       " '09/14/2000 CPT Code: 90792: With medical services',\n",
       " '5/18/71 Total time of visit (in minutes):',\n",
       " '8/09/1981 Communication with referring physician?: Done',\n",
       " '6/05/93 CPT Code: 90791: No medical services',\n",
       " ')Dilantin (PHENYTOIN) 100 MG CAPSULE Take 2 Capsule(s) PO HS; No Change (Taking Differently), Comments: decreased from 290 daily to 260 mg daily due to elevated phenytoin level from 8/9/97',\n",
       " '12/8/82 Audit C=3Audit C Score Current:',\n",
       " '8/26/89 CPT Code: 90791: No medical services',\n",
       " '10/13/95 CPT Code: 90791: No medical services',\n",
       " '4/19/91 Communication with referring physician?: Not Done',\n",
       " '.APS - Psychiatry consult paged/requested in person at 04/08/2004 16:39 Patrick, Christian [hpp2]',\n",
       " '9/20/76 CPT Code: 90801 - Psychiatric Diagnosis Interview',\n",
       " '12/08/1990 @11 am [report_end]',\n",
       " '4/11/1974 Chief Complaint / HPI Chief Complaint (Patients own words)',\n",
       " '7/18/86 SOS-10 Total Score:',\n",
       " '3/31/91 Communication with referring physician?: Done',\n",
       " '5/13/72 Other Adult Mental Health Outcomes Scales Used:',\n",
       " '011/14/83 Audit C Score Current:',\n",
       " '8/16/92 SOS-10 Total Score:',\n",
       " '10/05/97 CPT Code: 90791: No medical services',\n",
       " '07/18/2002 CPT Code: 90792: With medical services',\n",
       " '9/22/82 Total time of visit (in minutes):',\n",
       " '2/24/74 SOS-10 Total Score:',\n",
       " '(2/03/78) TSH-0.90 Activities of Daily Living (ADL) Bathing: Independent',\n",
       " '2/11/2006 CPT Code: 90791: No medical services',\n",
       " \"Pt is a 21 year old, single, heterosexual identified, Liechtenstein male. He resides in Rabat, in a rented apartment, with a roommate. He has not graduated from high school nor has he earned his GED. He has been working full time, steadily, since the age of 16. He recently left a job at jacobs engineering group where he worked for about 5 years. He started working at becton dickinson a few weeks ago. Referral to SMH was precipitated by pt seeing his PCP, Dr. Ford, for a physical and sharing concerns re: depression. He was having suicidal thoughts the week of 8/22/83 and noticed that his depression and anxiety symptoms were starting to interfere with work. Per Dr. Ford's recommendation, he contacted PAL for assistance. When screened for substance use, pt was referred to SMH for an intake. Pt was scheduled to be seen last week due to urgency of his depression and SI, but missed the intake appointment due to getting extremely intoxicated the night before. Pt presents today seeking individual therapy and psychiatry services to address longstanding depression.\",\n",
       " 'PET Scan (DPSH 5/04/74): 1) Marked hypometabolism involving the bilateral caudate nuclei. This is a non-specific finding, although this raises the question of a multisystem atrophy, specifically MSA-P (striatonigral degeneration), 2) Cortical hypometabolism involving the right inferior frontal gyrus. This is a non-specific finding of uncertain significance, and is not definitively related to a neurodegenerative process, although pathology in this region is not excluded',\n",
       " '7/20/2011 [report_end]',\n",
       " '6/17/95 Total time of visit (in minutes):',\n",
       " '6/10/72 SOS-10 Total Score:',\n",
       " 'nPt denied use to me but endorsed use on 10/16/82 as part of BVH initial visit summary supplement by Nicholas BenjaminOpiates: Yes',\n",
       " '12/15/92 CPT Code: 90801 - Psychiatric Diagnosis Interview',\n",
       " '12/8/97 SOS-10 Total Score:',\n",
       " '4/05/89 Primary Care Doctor:',\n",
       " '12/04/87 SOS-10 Total Score:',\n",
       " '4 (6/20/77)Audit C Score Current:',\n",
       " 'see 4/27/2006 consult note Dr. GuevaraWhat factors in prior treatment were helpful/not helpful:',\n",
       " '07/17/92 CPT Code: 90791: No medical services',\n",
       " '12/22/98 CPT Code: 90801 - Psychiatric Diagnosis Interview',\n",
       " '10/02/96 Age:',\n",
       " '11/05/90 CPT Code: 90792: With medical services',\n",
       " '5/04/77 CPT Code: 90792: With medical services',\n",
       " '2/27/96 Communication with referring physician?: Done']"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[x for x in datetxt_df if re.search(r'\\d{1,2}[/|-]\\d{1,2}[/|-]\\d{2,4}',x)]\n",
    "\n",
    "# Search returns the first occurrence, while findall finds all the (non-overlapping) matches.\n",
    "# For our usage here, either suffices.\n",
    "# Even re.match() would suffice for the current purpose. \n",
    "# Check the subtle differences between match/search/findall for yourself! "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### pandas.Series.str.extract & extractall\n",
    "\n",
    "Extract captures groups in the regex pattern as columns in DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "scrolled": true,
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>03/25/93</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>6/18/85</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>7/8/71</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>9/27/75</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2/6/96</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>495</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>496</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>497</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>498</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>499</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>500 rows × 1 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "            0\n",
       "0    03/25/93\n",
       "1     6/18/85\n",
       "2      7/8/71\n",
       "3     9/27/75\n",
       "4      2/6/96\n",
       "..        ...\n",
       "495       NaN\n",
       "496       NaN\n",
       "497       NaN\n",
       "498       NaN\n",
       "499       NaN\n",
       "\n",
       "[500 rows x 1 columns]"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Extract: Grouping done using ()\n",
    "datetxt_df.str.extract(r'(\\d{1,2}[/|-]\\d{1,2}[/|-]\\d{2,4})')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>match</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <th>0</th>\n",
       "      <td>03/25/93</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <th>0</th>\n",
       "      <td>6/18/85</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <th>0</th>\n",
       "      <td>7/8/71</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <th>0</th>\n",
       "      <td>9/27/75</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <th>0</th>\n",
       "      <td>2/6/96</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>120</th>\n",
       "      <th>0</th>\n",
       "      <td>12/22/98</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>121</th>\n",
       "      <th>0</th>\n",
       "      <td>10/02/96</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>122</th>\n",
       "      <th>0</th>\n",
       "      <td>11/05/90</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>123</th>\n",
       "      <th>0</th>\n",
       "      <td>5/04/77</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>124</th>\n",
       "      <th>0</th>\n",
       "      <td>2/27/96</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>126 rows × 1 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                  0\n",
       "    match          \n",
       "0   0      03/25/93\n",
       "1   0       6/18/85\n",
       "2   0        7/8/71\n",
       "3   0       9/27/75\n",
       "4   0        2/6/96\n",
       "...             ...\n",
       "120 0      12/22/98\n",
       "121 0      10/02/96\n",
       "122 0      11/05/90\n",
       "123 0       5/04/77\n",
       "124 0       2/27/96\n",
       "\n",
       "[126 rows x 1 columns]"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Extract all\n",
    "datetxt_df.str.extractall(r'(\\d{1,2}[/|-]\\d{1,2}[/|-]\\d{2,4})')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>match</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <th>0</th>\n",
       "      <td>03</td>\n",
       "      <td>25</td>\n",
       "      <td>93</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <th>0</th>\n",
       "      <td>6</td>\n",
       "      <td>18</td>\n",
       "      <td>85</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <th>0</th>\n",
       "      <td>7</td>\n",
       "      <td>8</td>\n",
       "      <td>71</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <th>0</th>\n",
       "      <td>9</td>\n",
       "      <td>27</td>\n",
       "      <td>75</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <th>0</th>\n",
       "      <td>2</td>\n",
       "      <td>6</td>\n",
       "      <td>96</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>120</th>\n",
       "      <th>0</th>\n",
       "      <td>12</td>\n",
       "      <td>22</td>\n",
       "      <td>98</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>121</th>\n",
       "      <th>0</th>\n",
       "      <td>10</td>\n",
       "      <td>02</td>\n",
       "      <td>96</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>122</th>\n",
       "      <th>0</th>\n",
       "      <td>11</td>\n",
       "      <td>05</td>\n",
       "      <td>90</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>123</th>\n",
       "      <th>0</th>\n",
       "      <td>5</td>\n",
       "      <td>04</td>\n",
       "      <td>77</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>124</th>\n",
       "      <th>0</th>\n",
       "      <td>2</td>\n",
       "      <td>27</td>\n",
       "      <td>96</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>126 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "            0   1   2\n",
       "    match            \n",
       "0   0      03  25  93\n",
       "1   0       6  18  85\n",
       "2   0       7   8  71\n",
       "3   0       9  27  75\n",
       "4   0       2   6  96\n",
       "...        ..  ..  ..\n",
       "120 0      12  22  98\n",
       "121 0      10  02  96\n",
       "122 0      11  05  90\n",
       "123 0       5  04  77\n",
       "124 0       2  27  96\n",
       "\n",
       "[126 rows x 3 columns]"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Further grouping with () \n",
    "datetxt_df.str.extractall(r'(\\d{1,2})[/|-](\\d{1,2})[/|-](\\d{2,4})')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>match</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <th>0</th>\n",
       "      <td>03/25/93</td>\n",
       "      <td>03</td>\n",
       "      <td>25/93</td>\n",
       "      <td>25</td>\n",
       "      <td>93</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <th>0</th>\n",
       "      <td>6/18/85</td>\n",
       "      <td>6</td>\n",
       "      <td>18/85</td>\n",
       "      <td>18</td>\n",
       "      <td>85</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <th>0</th>\n",
       "      <td>7/8/71</td>\n",
       "      <td>7</td>\n",
       "      <td>8/71</td>\n",
       "      <td>8</td>\n",
       "      <td>71</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <th>0</th>\n",
       "      <td>9/27/75</td>\n",
       "      <td>9</td>\n",
       "      <td>27/75</td>\n",
       "      <td>27</td>\n",
       "      <td>75</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <th>0</th>\n",
       "      <td>2/6/96</td>\n",
       "      <td>2</td>\n",
       "      <td>6/96</td>\n",
       "      <td>6</td>\n",
       "      <td>96</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>120</th>\n",
       "      <th>0</th>\n",
       "      <td>12/22/98</td>\n",
       "      <td>12</td>\n",
       "      <td>22/98</td>\n",
       "      <td>22</td>\n",
       "      <td>98</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>121</th>\n",
       "      <th>0</th>\n",
       "      <td>10/02/96</td>\n",
       "      <td>10</td>\n",
       "      <td>02/96</td>\n",
       "      <td>02</td>\n",
       "      <td>96</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>122</th>\n",
       "      <th>0</th>\n",
       "      <td>11/05/90</td>\n",
       "      <td>11</td>\n",
       "      <td>05/90</td>\n",
       "      <td>05</td>\n",
       "      <td>90</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>123</th>\n",
       "      <th>0</th>\n",
       "      <td>5/04/77</td>\n",
       "      <td>5</td>\n",
       "      <td>04/77</td>\n",
       "      <td>04</td>\n",
       "      <td>77</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>124</th>\n",
       "      <th>0</th>\n",
       "      <td>2/27/96</td>\n",
       "      <td>2</td>\n",
       "      <td>27/96</td>\n",
       "      <td>27</td>\n",
       "      <td>96</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>126 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                  0   1      2   3   4\n",
       "    match                             \n",
       "0   0      03/25/93  03  25/93  25  93\n",
       "1   0       6/18/85   6  18/85  18  85\n",
       "2   0        7/8/71   7   8/71   8  71\n",
       "3   0       9/27/75   9  27/75  27  75\n",
       "4   0        2/6/96   2   6/96   6  96\n",
       "...             ...  ..    ...  ..  ..\n",
       "120 0      12/22/98  12  22/98  22  98\n",
       "121 0      10/02/96  10  02/96  02  96\n",
       "122 0      11/05/90  11  05/90  05  90\n",
       "123 0       5/04/77   5  04/77  04  77\n",
       "124 0       2/27/96   2  27/96  27  96\n",
       "\n",
       "[126 rows x 5 columns]"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Grouping can be carried out in a nested fashion!\n",
    "datetxt_df.str.extractall(r'((\\d{1,2})[/|-]((\\d{1,2})[/|-](\\d{2,4})))')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Date</th>\n",
       "      <th>Month</th>\n",
       "      <th>Day</th>\n",
       "      <th>Year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>03/25/93</td>\n",
       "      <td>03</td>\n",
       "      <td>25</td>\n",
       "      <td>93</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>6/18/85</td>\n",
       "      <td>6</td>\n",
       "      <td>18</td>\n",
       "      <td>85</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>7/8/71</td>\n",
       "      <td>7</td>\n",
       "      <td>8</td>\n",
       "      <td>71</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>9/27/75</td>\n",
       "      <td>9</td>\n",
       "      <td>27</td>\n",
       "      <td>75</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2/6/96</td>\n",
       "      <td>2</td>\n",
       "      <td>6</td>\n",
       "      <td>96</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>495</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>496</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>497</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>498</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>499</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>500 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         Date Month  Day Year\n",
       "0    03/25/93    03   25   93\n",
       "1     6/18/85     6   18   85\n",
       "2      7/8/71     7    8   71\n",
       "3     9/27/75     9   27   75\n",
       "4      2/6/96     2    6   96\n",
       "..        ...   ...  ...  ...\n",
       "495       NaN   NaN  NaN  NaN\n",
       "496       NaN   NaN  NaN  NaN\n",
       "497       NaN   NaN  NaN  NaN\n",
       "498       NaN   NaN  NaN  NaN\n",
       "499       NaN   NaN  NaN  NaN\n",
       "\n",
       "[500 rows x 4 columns]"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Named groups ?P<NamedGroup> \n",
    "datetxt_df.str.extract(r'(?P<Date>(?P<Month>\\d{1,2})[/|-](?P<Day>\\d{1,2})[/|-](?P<Year>\\d{2,4}))')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "scrolled": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Original_txt</th>\n",
       "      <th>Date</th>\n",
       "      <th>Month</th>\n",
       "      <th>Day</th>\n",
       "      <th>Year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>03/25/93 Total time of visit (in minutes):</td>\n",
       "      <td>03/25/93</td>\n",
       "      <td>03</td>\n",
       "      <td>25</td>\n",
       "      <td>93</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>6/18/85 Primary Care Doctor:</td>\n",
       "      <td>6/18/85</td>\n",
       "      <td>6</td>\n",
       "      <td>18</td>\n",
       "      <td>85</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>sshe plans to move as of 7/8/71 In-Home Servic...</td>\n",
       "      <td>7/8/71</td>\n",
       "      <td>7</td>\n",
       "      <td>8</td>\n",
       "      <td>71</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>7 on 9/27/75 Audit C Score Current:</td>\n",
       "      <td>9/27/75</td>\n",
       "      <td>9</td>\n",
       "      <td>27</td>\n",
       "      <td>75</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2/6/96 sleep studyPain Treatment Pain Level (N...</td>\n",
       "      <td>2/6/96</td>\n",
       "      <td>2</td>\n",
       "      <td>6</td>\n",
       "      <td>96</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>495</th>\n",
       "      <td>1979 Family Psych History: Family History of S...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>496</th>\n",
       "      <td>therapist and friend died in ~2006 Parental/Ca...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>497</th>\n",
       "      <td>2008 partial thyroidectomy</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>498</th>\n",
       "      <td>sPt describes a history of sexual abuse as a c...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>499</th>\n",
       "      <td>. In 1980, patient was living in Naples and de...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>500 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                          Original_txt      Date Month  Day  \\\n",
       "0           03/25/93 Total time of visit (in minutes):  03/25/93    03   25   \n",
       "1                         6/18/85 Primary Care Doctor:   6/18/85     6   18   \n",
       "2    sshe plans to move as of 7/8/71 In-Home Servic...    7/8/71     7    8   \n",
       "3                  7 on 9/27/75 Audit C Score Current:   9/27/75     9   27   \n",
       "4    2/6/96 sleep studyPain Treatment Pain Level (N...    2/6/96     2    6   \n",
       "..                                                 ...       ...   ...  ...   \n",
       "495  1979 Family Psych History: Family History of S...       NaN   NaN  NaN   \n",
       "496  therapist and friend died in ~2006 Parental/Ca...       NaN   NaN  NaN   \n",
       "497                         2008 partial thyroidectomy       NaN   NaN  NaN   \n",
       "498  sPt describes a history of sexual abuse as a c...       NaN   NaN  NaN   \n",
       "499  . In 1980, patient was living in Naples and de...       NaN   NaN  NaN   \n",
       "\n",
       "    Year  \n",
       "0     93  \n",
       "1     85  \n",
       "2     71  \n",
       "3     75  \n",
       "4     96  \n",
       "..   ...  \n",
       "495  NaN  \n",
       "496  NaN  \n",
       "497  NaN  \n",
       "498  NaN  \n",
       "499  NaN  \n",
       "\n",
       "[500 rows x 5 columns]"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# It can come handy to add derived columns, e.g. with a .join method to join Dataframes\n",
    "# Since datetxt_df is a Pandas Series, we first need to convert it into a Dataframe\n",
    "datetxt_df.to_frame(name=\"Original_txt\")\\\n",
    ".join(datetxt_df.str.extract\\\n",
    "(r'(?P<Date>(?P<Month>\\d{1,2})[/|-](?P<Day>\\d{1,2})[/|-](?P<Year>\\d{2,4}))'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Putting a few of the pieces together\n",
    "- Prelude to Ungraded task 2.2\n",
    "    - A bit of data cleaning \n",
    "    - A bit of missing data handling\n",
    "    - Doing so with RegEx "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Country</th>\n",
       "      <th>Region</th>\n",
       "      <th>Population</th>\n",
       "      <th>Area (sq. mi.)</th>\n",
       "      <th>Pop. Density (per sq. mi.)</th>\n",
       "      <th>Coastline (coast/area ratio)</th>\n",
       "      <th>Net migration</th>\n",
       "      <th>Infant mortality (per 1000 births)</th>\n",
       "      <th>GDP ($ per capita)</th>\n",
       "      <th>Literacy (%)</th>\n",
       "      <th>Phones (per 1000)</th>\n",
       "      <th>Arable (%)</th>\n",
       "      <th>Crops (%)</th>\n",
       "      <th>Other (%)</th>\n",
       "      <th>Climate</th>\n",
       "      <th>Birthrate</th>\n",
       "      <th>Deathrate</th>\n",
       "      <th>Agriculture</th>\n",
       "      <th>Industry</th>\n",
       "      <th>Service</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>221</th>\n",
       "      <td>Wallis and Futuna</td>\n",
       "      <td>OCEANIA</td>\n",
       "      <td>16025</td>\n",
       "      <td>274</td>\n",
       "      <td>58,5</td>\n",
       "      <td>47,08</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3700.0</td>\n",
       "      <td>50,0</td>\n",
       "      <td>118,6</td>\n",
       "      <td>5</td>\n",
       "      <td>25</td>\n",
       "      <td>70</td>\n",
       "      <td>2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>222</th>\n",
       "      <td>West Bank</td>\n",
       "      <td>NEAR EAST</td>\n",
       "      <td>2460492</td>\n",
       "      <td>5860</td>\n",
       "      <td>419,9</td>\n",
       "      <td>0,00</td>\n",
       "      <td>2,98</td>\n",
       "      <td>19,62</td>\n",
       "      <td>800.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>145,2</td>\n",
       "      <td>16,9</td>\n",
       "      <td>18,97</td>\n",
       "      <td>64,13</td>\n",
       "      <td>3</td>\n",
       "      <td>31,67</td>\n",
       "      <td>3,92</td>\n",
       "      <td>0,09</td>\n",
       "      <td>0,28</td>\n",
       "      <td>0,63</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>223</th>\n",
       "      <td>Western Sahara</td>\n",
       "      <td>NORTHERN AFRICA</td>\n",
       "      <td>273008</td>\n",
       "      <td>266000</td>\n",
       "      <td>1,0</td>\n",
       "      <td>0,42</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0,02</td>\n",
       "      <td>0</td>\n",
       "      <td>99,98</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0,4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>224</th>\n",
       "      <td>Yemen</td>\n",
       "      <td>NEAR EAST</td>\n",
       "      <td>21456188</td>\n",
       "      <td>527970</td>\n",
       "      <td>40,6</td>\n",
       "      <td>0,36</td>\n",
       "      <td>0</td>\n",
       "      <td>61,5</td>\n",
       "      <td>800.0</td>\n",
       "      <td>50,2</td>\n",
       "      <td>37,2</td>\n",
       "      <td>2,78</td>\n",
       "      <td>0,24</td>\n",
       "      <td>96,98</td>\n",
       "      <td>1</td>\n",
       "      <td>42,89</td>\n",
       "      <td>8,3</td>\n",
       "      <td>0,135</td>\n",
       "      <td>0,472</td>\n",
       "      <td>0,393</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>225</th>\n",
       "      <td>Zambia</td>\n",
       "      <td>SUB-SAHARAN AFRICA</td>\n",
       "      <td>11502010</td>\n",
       "      <td>752614</td>\n",
       "      <td>15,3</td>\n",
       "      <td>0,00</td>\n",
       "      <td>0</td>\n",
       "      <td>88,29</td>\n",
       "      <td>800.0</td>\n",
       "      <td>80,6</td>\n",
       "      <td>8,2</td>\n",
       "      <td>7,08</td>\n",
       "      <td>0,03</td>\n",
       "      <td>92,9</td>\n",
       "      <td>2</td>\n",
       "      <td>41</td>\n",
       "      <td>19,93</td>\n",
       "      <td>0,22</td>\n",
       "      <td>0,29</td>\n",
       "      <td>0,489</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>226</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>SUB-SAHARAN AFRICA</td>\n",
       "      <td>12236805</td>\n",
       "      <td>390580</td>\n",
       "      <td>31,3</td>\n",
       "      <td>0,00</td>\n",
       "      <td>0</td>\n",
       "      <td>67,69</td>\n",
       "      <td>1900.0</td>\n",
       "      <td>90,7</td>\n",
       "      <td>26,8</td>\n",
       "      <td>8,32</td>\n",
       "      <td>0,34</td>\n",
       "      <td>91,34</td>\n",
       "      <td>2</td>\n",
       "      <td>28,01</td>\n",
       "      <td>21,84</td>\n",
       "      <td>0,179</td>\n",
       "      <td>0,243</td>\n",
       "      <td>0,579</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               Country                               Region  Population  \\\n",
       "221  Wallis and Futuna  OCEANIA                                   16025   \n",
       "222          West Bank  NEAR EAST                               2460492   \n",
       "223     Western Sahara  NORTHERN AFRICA                          273008   \n",
       "224              Yemen  NEAR EAST                              21456188   \n",
       "225             Zambia  SUB-SAHARAN AFRICA                     11502010   \n",
       "226           Zimbabwe  SUB-SAHARAN AFRICA                     12236805   \n",
       "\n",
       "     Area (sq. mi.) Pop. Density (per sq. mi.) Coastline (coast/area ratio)  \\\n",
       "221             274                       58,5                        47,08   \n",
       "222            5860                      419,9                         0,00   \n",
       "223          266000                        1,0                         0,42   \n",
       "224          527970                       40,6                         0,36   \n",
       "225          752614                       15,3                         0,00   \n",
       "226          390580                       31,3                         0,00   \n",
       "\n",
       "    Net migration Infant mortality (per 1000 births)  GDP ($ per capita)  \\\n",
       "221           NaN                                NaN              3700.0   \n",
       "222          2,98                              19,62               800.0   \n",
       "223           NaN                                NaN                 NaN   \n",
       "224             0                               61,5               800.0   \n",
       "225             0                              88,29               800.0   \n",
       "226             0                              67,69              1900.0   \n",
       "\n",
       "    Literacy (%) Phones (per 1000) Arable (%) Crops (%) Other (%) Climate  \\\n",
       "221         50,0             118,6          5        25        70       2   \n",
       "222          NaN             145,2       16,9     18,97     64,13       3   \n",
       "223          NaN               NaN       0,02         0     99,98       1   \n",
       "224         50,2              37,2       2,78      0,24     96,98       1   \n",
       "225         80,6               8,2       7,08      0,03      92,9       2   \n",
       "226         90,7              26,8       8,32      0,34     91,34       2   \n",
       "\n",
       "    Birthrate Deathrate Agriculture Industry Service  \n",
       "221       NaN       NaN         NaN      NaN     NaN  \n",
       "222     31,67      3,92        0,09     0,28    0,63  \n",
       "223       NaN       NaN         NaN      NaN     0,4  \n",
       "224     42,89       8,3       0,135    0,472   0,393  \n",
       "225        41     19,93        0,22     0,29   0,489  \n",
       "226     28,01     21,84       0,179    0,243   0,579  "
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Let's go back to the CountriesOfTheWorld data.\n",
    "# We previously loaded it in the CoTW_df DataFrame.\n",
    "# We see lot's of missing data. \n",
    "# We had noticed previously that some of the numeric data is stored improperly (as string objects). \n",
    "CoTW_df.tail(6)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Country                                object\n",
       "Region                                 object\n",
       "Population                              int64\n",
       "Area (sq. mi.)                          int64\n",
       "Pop. Density (per sq. mi.)             object\n",
       "Coastline (coast/area ratio)           object\n",
       "Net migration                          object\n",
       "Infant mortality (per 1000 births)     object\n",
       "GDP ($ per capita)                    float64\n",
       "Literacy (%)                           object\n",
       "Phones (per 1000)                      object\n",
       "Arable (%)                             object\n",
       "Crops (%)                              object\n",
       "Other (%)                              object\n",
       "Climate                                object\n",
       "Birthrate                              object\n",
       "Deathrate                              object\n",
       "Agriculture                            object\n",
       "Industry                               object\n",
       "Service                                object\n",
       "dtype: object"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Let's check the dtypes to confirm this.\n",
    "CoTW_df.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Country</th>\n",
       "      <th>Region</th>\n",
       "      <th>Population</th>\n",
       "      <th>Area (sq. mi.)</th>\n",
       "      <th>Pop. Density (per sq. mi.)</th>\n",
       "      <th>Coastline (coast/area ratio)</th>\n",
       "      <th>Net migration</th>\n",
       "      <th>Infant mortality (per 1000 births)</th>\n",
       "      <th>GDP ($ per capita)</th>\n",
       "      <th>Literacy (%)</th>\n",
       "      <th>Phones (per 1000)</th>\n",
       "      <th>Arable (%)</th>\n",
       "      <th>Crops (%)</th>\n",
       "      <th>Other (%)</th>\n",
       "      <th>Climate</th>\n",
       "      <th>Birthrate</th>\n",
       "      <th>Deathrate</th>\n",
       "      <th>Agriculture</th>\n",
       "      <th>Industry</th>\n",
       "      <th>Service</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>217</th>\n",
       "      <td>Vanuatu</td>\n",
       "      <td>OCEANIA</td>\n",
       "      <td>208869</td>\n",
       "      <td>12200</td>\n",
       "      <td>17,1</td>\n",
       "      <td>20,72</td>\n",
       "      <td>0</td>\n",
       "      <td>55,16</td>\n",
       "      <td>2900.0</td>\n",
       "      <td>53.0</td>\n",
       "      <td>32,6</td>\n",
       "      <td>2,46</td>\n",
       "      <td>7,38</td>\n",
       "      <td>90,16</td>\n",
       "      <td>2</td>\n",
       "      <td>22,72</td>\n",
       "      <td>7,82</td>\n",
       "      <td>0,26</td>\n",
       "      <td>0,12</td>\n",
       "      <td>0,62</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>218</th>\n",
       "      <td>Venezuela</td>\n",
       "      <td>LATIN AMER. &amp; CARIB</td>\n",
       "      <td>25730435</td>\n",
       "      <td>912050</td>\n",
       "      <td>28,2</td>\n",
       "      <td>0,31</td>\n",
       "      <td>-0,04</td>\n",
       "      <td>22,2</td>\n",
       "      <td>4800.0</td>\n",
       "      <td>93.4</td>\n",
       "      <td>140,1</td>\n",
       "      <td>2,95</td>\n",
       "      <td>0,92</td>\n",
       "      <td>96,13</td>\n",
       "      <td>2</td>\n",
       "      <td>18,71</td>\n",
       "      <td>4,92</td>\n",
       "      <td>0,04</td>\n",
       "      <td>0,419</td>\n",
       "      <td>0,541</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>219</th>\n",
       "      <td>Vietnam</td>\n",
       "      <td>ASIA (EX. NEAR EAST)</td>\n",
       "      <td>84402966</td>\n",
       "      <td>329560</td>\n",
       "      <td>256,1</td>\n",
       "      <td>1,05</td>\n",
       "      <td>-0,45</td>\n",
       "      <td>25,95</td>\n",
       "      <td>2500.0</td>\n",
       "      <td>90.3</td>\n",
       "      <td>187,7</td>\n",
       "      <td>19,97</td>\n",
       "      <td>5,95</td>\n",
       "      <td>74,08</td>\n",
       "      <td>2</td>\n",
       "      <td>16,86</td>\n",
       "      <td>6,22</td>\n",
       "      <td>0,209</td>\n",
       "      <td>0,41</td>\n",
       "      <td>0,381</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>220</th>\n",
       "      <td>Virgin Islands</td>\n",
       "      <td>LATIN AMER. &amp; CARIB</td>\n",
       "      <td>108605</td>\n",
       "      <td>1910</td>\n",
       "      <td>56,9</td>\n",
       "      <td>9,84</td>\n",
       "      <td>-8,94</td>\n",
       "      <td>8,03</td>\n",
       "      <td>17200.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>652,8</td>\n",
       "      <td>11,76</td>\n",
       "      <td>2,94</td>\n",
       "      <td>85,3</td>\n",
       "      <td>2</td>\n",
       "      <td>13,96</td>\n",
       "      <td>6,43</td>\n",
       "      <td>0,01</td>\n",
       "      <td>0,19</td>\n",
       "      <td>0,8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>221</th>\n",
       "      <td>Wallis and Futuna</td>\n",
       "      <td>OCEANIA</td>\n",
       "      <td>16025</td>\n",
       "      <td>274</td>\n",
       "      <td>58,5</td>\n",
       "      <td>47,08</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3700.0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>118,6</td>\n",
       "      <td>5</td>\n",
       "      <td>25</td>\n",
       "      <td>70</td>\n",
       "      <td>2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>222</th>\n",
       "      <td>West Bank</td>\n",
       "      <td>NEAR EAST</td>\n",
       "      <td>2460492</td>\n",
       "      <td>5860</td>\n",
       "      <td>419,9</td>\n",
       "      <td>0,00</td>\n",
       "      <td>2,98</td>\n",
       "      <td>19,62</td>\n",
       "      <td>800.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>145,2</td>\n",
       "      <td>16,9</td>\n",
       "      <td>18,97</td>\n",
       "      <td>64,13</td>\n",
       "      <td>3</td>\n",
       "      <td>31,67</td>\n",
       "      <td>3,92</td>\n",
       "      <td>0,09</td>\n",
       "      <td>0,28</td>\n",
       "      <td>0,63</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>223</th>\n",
       "      <td>Western Sahara</td>\n",
       "      <td>NORTHERN AFRICA</td>\n",
       "      <td>273008</td>\n",
       "      <td>266000</td>\n",
       "      <td>1,0</td>\n",
       "      <td>0,42</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0,02</td>\n",
       "      <td>0</td>\n",
       "      <td>99,98</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0,4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>224</th>\n",
       "      <td>Yemen</td>\n",
       "      <td>NEAR EAST</td>\n",
       "      <td>21456188</td>\n",
       "      <td>527970</td>\n",
       "      <td>40,6</td>\n",
       "      <td>0,36</td>\n",
       "      <td>0</td>\n",
       "      <td>61,5</td>\n",
       "      <td>800.0</td>\n",
       "      <td>50.2</td>\n",
       "      <td>37,2</td>\n",
       "      <td>2,78</td>\n",
       "      <td>0,24</td>\n",
       "      <td>96,98</td>\n",
       "      <td>1</td>\n",
       "      <td>42,89</td>\n",
       "      <td>8,3</td>\n",
       "      <td>0,135</td>\n",
       "      <td>0,472</td>\n",
       "      <td>0,393</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>225</th>\n",
       "      <td>Zambia</td>\n",
       "      <td>SUB-SAHARAN AFRICA</td>\n",
       "      <td>11502010</td>\n",
       "      <td>752614</td>\n",
       "      <td>15,3</td>\n",
       "      <td>0,00</td>\n",
       "      <td>0</td>\n",
       "      <td>88,29</td>\n",
       "      <td>800.0</td>\n",
       "      <td>80.6</td>\n",
       "      <td>8,2</td>\n",
       "      <td>7,08</td>\n",
       "      <td>0,03</td>\n",
       "      <td>92,9</td>\n",
       "      <td>2</td>\n",
       "      <td>41</td>\n",
       "      <td>19,93</td>\n",
       "      <td>0,22</td>\n",
       "      <td>0,29</td>\n",
       "      <td>0,489</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>226</th>\n",
       "      <td>Zimbabwe</td>\n",
       "      <td>SUB-SAHARAN AFRICA</td>\n",
       "      <td>12236805</td>\n",
       "      <td>390580</td>\n",
       "      <td>31,3</td>\n",
       "      <td>0,00</td>\n",
       "      <td>0</td>\n",
       "      <td>67,69</td>\n",
       "      <td>1900.0</td>\n",
       "      <td>90.7</td>\n",
       "      <td>26,8</td>\n",
       "      <td>8,32</td>\n",
       "      <td>0,34</td>\n",
       "      <td>91,34</td>\n",
       "      <td>2</td>\n",
       "      <td>28,01</td>\n",
       "      <td>21,84</td>\n",
       "      <td>0,179</td>\n",
       "      <td>0,243</td>\n",
       "      <td>0,579</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               Country                               Region  Population  \\\n",
       "217            Vanuatu  OCEANIA                                  208869   \n",
       "218          Venezuela              LATIN AMER. & CARIB        25730435   \n",
       "219            Vietnam        ASIA (EX. NEAR EAST)             84402966   \n",
       "220     Virgin Islands              LATIN AMER. & CARIB          108605   \n",
       "221  Wallis and Futuna  OCEANIA                                   16025   \n",
       "222          West Bank  NEAR EAST                               2460492   \n",
       "223     Western Sahara  NORTHERN AFRICA                          273008   \n",
       "224              Yemen  NEAR EAST                              21456188   \n",
       "225             Zambia  SUB-SAHARAN AFRICA                     11502010   \n",
       "226           Zimbabwe  SUB-SAHARAN AFRICA                     12236805   \n",
       "\n",
       "     Area (sq. mi.) Pop. Density (per sq. mi.) Coastline (coast/area ratio)  \\\n",
       "217           12200                       17,1                        20,72   \n",
       "218          912050                       28,2                         0,31   \n",
       "219          329560                      256,1                         1,05   \n",
       "220            1910                       56,9                         9,84   \n",
       "221             274                       58,5                        47,08   \n",
       "222            5860                      419,9                         0,00   \n",
       "223          266000                        1,0                         0,42   \n",
       "224          527970                       40,6                         0,36   \n",
       "225          752614                       15,3                         0,00   \n",
       "226          390580                       31,3                         0,00   \n",
       "\n",
       "    Net migration Infant mortality (per 1000 births)  GDP ($ per capita)  \\\n",
       "217             0                              55,16              2900.0   \n",
       "218         -0,04                               22,2              4800.0   \n",
       "219         -0,45                              25,95              2500.0   \n",
       "220         -8,94                               8,03             17200.0   \n",
       "221           NaN                                NaN              3700.0   \n",
       "222          2,98                              19,62               800.0   \n",
       "223           NaN                                NaN                 NaN   \n",
       "224             0                               61,5               800.0   \n",
       "225             0                              88,29               800.0   \n",
       "226             0                              67,69              1900.0   \n",
       "\n",
       "     Literacy (%) Phones (per 1000) Arable (%) Crops (%) Other (%) Climate  \\\n",
       "217          53.0              32,6       2,46      7,38     90,16       2   \n",
       "218          93.4             140,1       2,95      0,92     96,13       2   \n",
       "219          90.3             187,7      19,97      5,95     74,08       2   \n",
       "220           NaN             652,8      11,76      2,94      85,3       2   \n",
       "221          50.0             118,6          5        25        70       2   \n",
       "222           NaN             145,2       16,9     18,97     64,13       3   \n",
       "223           NaN               NaN       0,02         0     99,98       1   \n",
       "224          50.2              37,2       2,78      0,24     96,98       1   \n",
       "225          80.6               8,2       7,08      0,03      92,9       2   \n",
       "226          90.7              26,8       8,32      0,34     91,34       2   \n",
       "\n",
       "    Birthrate Deathrate Agriculture Industry Service  \n",
       "217     22,72      7,82        0,26     0,12    0,62  \n",
       "218     18,71      4,92        0,04    0,419   0,541  \n",
       "219     16,86      6,22       0,209     0,41   0,381  \n",
       "220     13,96      6,43        0,01     0,19     0,8  \n",
       "221       NaN       NaN         NaN      NaN     NaN  \n",
       "222     31,67      3,92        0,09     0,28    0,63  \n",
       "223       NaN       NaN         NaN      NaN     0,4  \n",
       "224     42,89       8,3       0,135    0,472   0,393  \n",
       "225        41     19,93        0,22     0,29   0,489  \n",
       "226     28,01     21,84       0,179    0,243   0,579  "
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Let's convert the Literacy (%) entries into float() type. \n",
    "# We need to replace the ',' with '.'; following which, we can apply astype(float)\n",
    "CoTW_df['Literacy (%)']=CoTW_df['Literacy (%)'].str.replace(',', '.').astype(float)\n",
    "CoTW_df.tail(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Country                                object\n",
       "Region                                 object\n",
       "Population                              int64\n",
       "Area (sq. mi.)                          int64\n",
       "Pop. Density (per sq. mi.)             object\n",
       "Coastline (coast/area ratio)           object\n",
       "Net migration                          object\n",
       "Infant mortality (per 1000 births)     object\n",
       "GDP ($ per capita)                    float64\n",
       "Literacy (%)                          float64\n",
       "Phones (per 1000)                      object\n",
       "Arable (%)                             object\n",
       "Crops (%)                              object\n",
       "Other (%)                              object\n",
       "Climate                                object\n",
       "Birthrate                              object\n",
       "Deathrate                              object\n",
       "Agriculture                            object\n",
       "Industry                               object\n",
       "Service                                object\n",
       "dtype: object"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# reconfirm type change\n",
    "CoTW_df.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Datetime object\n",
    "    https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "False\n"
     ]
    }
   ],
   "source": [
    "date1='03-25-93'\n",
    "date2=\"3/25/1993\"\n",
    "date3=\"25 Mar 1993\"\n",
    "print(date1==date3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1993-03-25 00:00:00\n"
     ]
    }
   ],
   "source": [
    "datetime_object1 = datetime.datetime.strptime(date1,\"%m-%d-%y\")\n",
    "print(datetime_object1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1993-03-25 00:00:00\n"
     ]
    }
   ],
   "source": [
    "datetime_object2 = datetime.datetime.strptime(date2,\"%m/%d/%Y\")\n",
    "print(datetime_object2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1993-03-25 00:00:00\n",
      "True\n"
     ]
    }
   ],
   "source": [
    "datetime_object3 = datetime.datetime.strptime(date3,\"%d %b %Y\")\n",
    "print(datetime_object3)\n",
    "print(datetime_object1==datetime_object3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "<b> Ungraded task 2.2: Clean the CoTW_df Dataframe</b>\n",
    "\n",
    "Clean the data in the CoTW_df Dataframe as follows:\n",
    "\n",
    "1. Create rule based methods (e.g., using regular expressions) to replace the column names as follows: \n",
    "    * Remove everything in brackets, e.g., replace \"Coastline (coast/area ratio)\" with \"Coastline\"\n",
    "    * If there are multiple words in the title, then join them with underscores. As such, \"Infant mortality (per 1000 births)\" would become \"Infant_mortality\". \n",
    "\n",
    "2. Replace all the numeric values which have been stored as strings, into floats.\n",
    "\n",
    "3. For all the missing numeric values (NaN), replace them with the Mean value available for the \"Region\". e.g., Wallis and Futuna is from OCEANIA region, so replace its missing Net_Migration information with the mean Net_Migration as determined for the values available for other countries in OCEANIA. \n",
    "    * Discuss alternatives for dealing with the NaN values. These alternatives may be dependent on type of column and/or the rows involved."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "<b>Ungraded task 2.3: Find all the dates in the datetxt_df Series</b><br><br>\n",
    "\n",
    "- Identify the (first instance of) date in each line of the Data Series.\n",
    "    - If only month and year information is available, then assume that the day of the month is 1st.\n",
    "    - If only year information is available, consider the month to be January.\n",
    "- Create a DataFrame (as shown below) with two columns, the Position of Date in the original Series, and the Date identified, sorted in increasing order of the date. \n",
    "\n",
    "|     |   Position | Date       |\n",
    "|----:|-----------:|:-----------|\n",
    "|   0 |          9 | 10/4/1971  |\n",
    "|   1 |         84 | 18/5/1971  |\n",
    "|   2 |          2 | 8/7/1971   |\n",
    "| ... |        ... |    ...     |\n",
    "| ... |        ... |    ...     |\n",
    "| 498 |        161 | 19/10/2016 |\n",
    "| 499 |        413 | 1/11/2016  |\n",
    "\n",
    "\n",
    "- Discuss: Assuming that you have access to a correct/ideal solution (e.g., the one provided), and you obtain some solution based on your processing, how will you quantitatively compare the quality of result you obtained w.r.to the ideal solution? If you consider several metrics, discuss pros and cons of such various metrics. Implement the corresponding metric and quantify the quality of result you obtain.\n",
    "- A tentative solution has been provided in an accompanying file named sorted-dates.csv\n",
    "    - Caveat: The solution provided may be faulty. If you find a better solution, please share with me. Please also let me know how you verified inaccuracies of the tentative solution."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "<p style=\"font-size:134%;color:Deep Teal;\">That's it folks!</p>\n",
    "\n",
    "<img src=\"pics/tangram.png\" alt=\"Bear in Tangram\" width=\"500\"/>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}