{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# A bit more of Pandas \n",
"### and some RegEx\n",
"
\n",
"
\n",
"
SC 4125: Developing Data Products
\n",
"
Module-2: Basics of Pandas structures and manipulation
\n",
"\n",
" \n",
"
\n",
"by
Anwitaman DATTA\n",
"School of Computer Science and Engineering, NTU Singapore. \n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### Teaching material\n",
"- .html deck of slides\n",
"- .ipynb Jupyter notebook"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Disclaimer/Caveat emptor\n",
"\n",
"- Non-systematic and non-exhaustive review\n",
"- Example solutions are not necessarily the most efficient or elegant, let alone unique"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"### Positioning this module in the big picture\n",
"\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"
\n",
"\n",
"- Some basic tools for cleaning and manipulating your data: Apply, Lambdas, Merge, Groupby, Missing data, etc."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# housekeeping - imports necessary libraries\n",
"import numpy as np\n",
"import pandas as pd\n",
"import re\n",
"import datetime "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Load data and carry out some preliminary cleaning\n",
"\n",
"Let's load the same gapminder data we had worked with in the previous module, but from a .tsv file this time. Likewise, let's load the CountriesOfTheWorld data from a .csv file."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" country | \n",
" continent | \n",
" year | \n",
" lifeExp | \n",
" pop | \n",
" gdpPercap | \n",
"
\n",
" \n",
" \n",
" \n",
" 761 | \n",
" Israel | \n",
" Asia | \n",
" 1977 | \n",
" 73.060 | \n",
" 3495918 | \n",
" 13306.619210 | \n",
"
\n",
" \n",
" 1263 | \n",
" Reunion | \n",
" Africa | \n",
" 1967 | \n",
" 60.542 | \n",
" 414024 | \n",
" 4021.175739 | \n",
"
\n",
" \n",
" 787 | \n",
" Jamaica | \n",
" Americas | \n",
" 1987 | \n",
" 71.770 | \n",
" 2326606 | \n",
" 6351.237495 | \n",
"
\n",
" \n",
" 868 | \n",
" Lebanon | \n",
" Asia | \n",
" 1972 | \n",
" 65.421 | \n",
" 2680018 | \n",
" 7486.384341 | \n",
"
\n",
" \n",
" 1473 | \n",
" Sweden | \n",
" Europe | \n",
" 1997 | \n",
" 79.390 | \n",
" 8897619 | \n",
" 25266.594990 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" country continent year lifeExp pop gdpPercap\n",
"761 Israel Asia 1977 73.060 3495918 13306.619210\n",
"1263 Reunion Africa 1967 60.542 414024 4021.175739\n",
"787 Jamaica Americas 1987 71.770 2326606 6351.237495\n",
"868 Lebanon Asia 1972 65.421 2680018 7486.384341\n",
"1473 Sweden Europe 1997 79.390 8897619 25266.594990"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# load data\n",
"gapminderdatapath ='data/gapminder/' # change this to adjust relative path\n",
"gap_df = pd.read_csv(gapminderdatapath+'gapminder.tsv', sep='\\t')\n",
"gap_df.sample(5)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Country | \n",
" Region | \n",
" Population | \n",
" Area (sq. mi.) | \n",
" Pop. Density (per sq. mi.) | \n",
" Coastline (coast/area ratio) | \n",
" Net migration | \n",
" Infant mortality (per 1000 births) | \n",
" GDP ($ per capita) | \n",
" Literacy (%) | \n",
" Phones (per 1000) | \n",
" Arable (%) | \n",
" Crops (%) | \n",
" Other (%) | \n",
" Climate | \n",
" Birthrate | \n",
" Deathrate | \n",
" Agriculture | \n",
" Industry | \n",
" Service | \n",
"
\n",
" \n",
" \n",
" \n",
" 127 | \n",
" Maldives | \n",
" ASIA (EX. NEAR EAST) | \n",
" 359008 | \n",
" 300 | \n",
" 1196,7 | \n",
" 214,67 | \n",
" 0 | \n",
" 56,52 | \n",
" 3900.0 | \n",
" 97,2 | \n",
" 90,0 | \n",
" 13,33 | \n",
" 16,67 | \n",
" 70 | \n",
" 2 | \n",
" 34,81 | \n",
" 7,06 | \n",
" 0,2 | \n",
" 0,18 | \n",
" 0,62 | \n",
"
\n",
" \n",
" 10 | \n",
" Aruba | \n",
" LATIN AMER. & CARIB | \n",
" 71891 | \n",
" 193 | \n",
" 372,5 | \n",
" 35,49 | \n",
" 0 | \n",
" 5,89 | \n",
" 28000.0 | \n",
" 97,0 | \n",
" 516,1 | \n",
" 10,53 | \n",
" 0 | \n",
" 89,47 | \n",
" 2 | \n",
" 11,03 | \n",
" 6,68 | \n",
" 0,004 | \n",
" 0,333 | \n",
" 0,663 | \n",
"
\n",
" \n",
" 130 | \n",
" Marshall Islands | \n",
" OCEANIA | \n",
" 60422 | \n",
" 11854 | \n",
" 5,1 | \n",
" 3,12 | \n",
" -6,04 | \n",
" 29,45 | \n",
" 1600.0 | \n",
" 93,7 | \n",
" 91,2 | \n",
" 16,67 | \n",
" 38,89 | \n",
" 44,44 | \n",
" 2 | \n",
" 33,05 | \n",
" 4,78 | \n",
" 0,317 | \n",
" 0,149 | \n",
" 0,534 | \n",
"
\n",
" \n",
" 163 | \n",
" Poland | \n",
" EASTERN EUROPE | \n",
" 38536869 | \n",
" 312685 | \n",
" 123,3 | \n",
" 0,16 | \n",
" -0,49 | \n",
" 8,51 | \n",
" 11100.0 | \n",
" 99,8 | \n",
" 306,3 | \n",
" 45,91 | \n",
" 1,12 | \n",
" 52,97 | \n",
" 3 | \n",
" 9,85 | \n",
" 9,89 | \n",
" 0,05 | \n",
" 0,311 | \n",
" 0,64 | \n",
"
\n",
" \n",
" 185 | \n",
" Slovakia | \n",
" EASTERN EUROPE | \n",
" 5439448 | \n",
" 48845 | \n",
" 111,4 | \n",
" 0,00 | \n",
" 0,3 | \n",
" 7,41 | \n",
" 13300.0 | \n",
" NaN | \n",
" 220,1 | \n",
" 30,16 | \n",
" 2,62 | \n",
" 67,22 | \n",
" 3 | \n",
" 10,65 | \n",
" 9,45 | \n",
" 0,035 | \n",
" 0,294 | \n",
" 0,672 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Country Region Population \\\n",
"127 Maldives ASIA (EX. NEAR EAST) 359008 \n",
"10 Aruba LATIN AMER. & CARIB 71891 \n",
"130 Marshall Islands OCEANIA 60422 \n",
"163 Poland EASTERN EUROPE 38536869 \n",
"185 Slovakia EASTERN EUROPE 5439448 \n",
"\n",
" Area (sq. mi.) Pop. Density (per sq. mi.) Coastline (coast/area ratio) \\\n",
"127 300 1196,7 214,67 \n",
"10 193 372,5 35,49 \n",
"130 11854 5,1 3,12 \n",
"163 312685 123,3 0,16 \n",
"185 48845 111,4 0,00 \n",
"\n",
" Net migration Infant mortality (per 1000 births) GDP ($ per capita) \\\n",
"127 0 56,52 3900.0 \n",
"10 0 5,89 28000.0 \n",
"130 -6,04 29,45 1600.0 \n",
"163 -0,49 8,51 11100.0 \n",
"185 0,3 7,41 13300.0 \n",
"\n",
" Literacy (%) Phones (per 1000) Arable (%) Crops (%) Other (%) Climate \\\n",
"127 97,2 90,0 13,33 16,67 70 2 \n",
"10 97,0 516,1 10,53 0 89,47 2 \n",
"130 93,7 91,2 16,67 38,89 44,44 2 \n",
"163 99,8 306,3 45,91 1,12 52,97 3 \n",
"185 NaN 220,1 30,16 2,62 67,22 3 \n",
"\n",
" Birthrate Deathrate Agriculture Industry Service \n",
"127 34,81 7,06 0,2 0,18 0,62 \n",
"10 11,03 6,68 0,004 0,333 0,663 \n",
"130 33,05 4,78 0,317 0,149 0,534 \n",
"163 9,85 9,89 0,05 0,311 0,64 \n",
"185 10,65 9,45 0,035 0,294 0,672 "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# load data\n",
"countriesdatapath ='data/'\n",
"CoTW_df = pd.read_csv(countriesdatapath+'CountriesOfTheWorld.csv')\n",
"CoTW_df.sample(5)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"What is the list of countries in the gapminder and CoTW datasets?"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina', 'Australia', 'Austria', 'Bahrain', 'Bangladesh', 'Belgium']\n"
]
}
],
"source": [
"gap_country_names_series=gap_df['country'].drop_duplicates()\n",
"print(gap_country_names_series.to_list()[:10]) #if you print the length, you will get a result of 142"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Afghanistan ' 'Albania ' 'Algeria ' 'American Samoa ' 'Andorra '\n",
" 'Angola ' 'Anguilla ' 'Antigua & Barbuda ' 'Argentina ' 'Armenia ']\n"
]
}
],
"source": [
"CoTW_country_names=CoTW_df['Country'].unique()\n",
"print(CoTW_country_names[:10]) #if you print the length, you will get a result of 227"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Afghanistan' 'Albania' 'Algeria' 'American Samoa' 'Andorra' 'Angola'\n",
" 'Anguilla' 'Antigua & Barbuda' 'Argentina' 'Armenia']\n"
]
}
],
"source": [
"# Let's get rid of the trailing space in the records\n",
"CoTW_df['Country']=CoTW_df.Country.apply(lambda x: x[:-1])\n",
"# Alternative: CoTW_df['Country']=CoTW_df.Country.apply(lambda x: x.rstrip()) \n",
"# Alternate way would be to `strip' the space (in this case, rstrip would also work)\n",
"CoTW_country_names=CoTW_df['Country'].unique()\n",
"print(CoTW_country_names[:10])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### Apply & Lambda\n",
"- We just saw the use of two very interesting tools (which often go hand-in-hand when processing a data frame):\n",
" 1. Apply: Apply a function along an axis of the data frame. It can also be used over a column. \n",
" * See details at the doc page \n",
" 2. Lambda: A small anonymous function. You can of-course use a named function inside your lambda, if you need something more complicated.\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Rank | \n",
" NOC | \n",
" Gold | \n",
" Silver | \n",
" Bronze | \n",
" Total | \n",
"
\n",
" \n",
" \n",
" \n",
" 84 | \n",
" 78 | \n",
" Trinidad and Tobago | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" 15 | \n",
" 16 | \n",
" Jamaica | \n",
" 6 | \n",
" 3 | \n",
" 2 | \n",
" 11 | \n",
"
\n",
" \n",
" 35 | \n",
" 35 | \n",
" Thailand | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 6 | \n",
"
\n",
" \n",
" 0 | \n",
" 1 | \n",
" United States | \n",
" 46 | \n",
" 37 | \n",
" 38 | \n",
" 121 | \n",
"
\n",
" \n",
" 10 | \n",
" 11 | \n",
" Netherlands | \n",
" 8 | \n",
" 7 | \n",
" 4 | \n",
" 19 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Rank NOC Gold Silver Bronze Total\n",
"84 78 Trinidad and Tobago 0 0 1 1\n",
"15 16 Jamaica 6 3 2 11\n",
"35 35 Thailand 2 2 2 6\n",
"0 1 United States 46 37 38 121\n",
"10 11 Netherlands 8 7 4 19"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# load data\n",
"olymp_df=pd.read_html(r'https://en.wikipedia.org/wiki/2016_Summer_Olympics_medal_table')\n",
"olymp2016medals=olymp_df[2][:-1]\n",
"olymp2016medals.sample(5)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"#### The following code was there because previously, each country name in the table was followed with a (ThreeLetterAcronym)\n",
"olymp2016medals['NOC']=olymp2016medals.NOC.apply(lambda x: x.split(\"(\",1)[0][:-1])\n",
"olymp2016medals"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Bahamas', 'Brazil*', 'Chinese Taipei', 'Great Britain', 'Independent Olympic Athletes', 'Ivory Coast', 'Kosovo', 'North Korea', 'South Korea', 'Trinidad and Tobago']\n",
"['Afghanistan', 'Albania', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antigua & Barbuda', 'Aruba', 'Bahamas, The', 'Bangladesh', 'Barbados', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia', 'Bosnia & Herzegovina', 'Botswana', 'Brazil', 'British Virgin Is.', 'Brunei', 'Burkina Faso', 'Burma', 'Cambodia', 'Cameroon', 'Cape Verde', 'Cayman Islands', 'Central African Rep.', 'Chad', 'Chile', 'Comoros', 'Congo, Dem. Rep.', 'Congo, Repub. of the', 'Cook Islands', 'Costa Rica', \"Cote d'Ivoire\", 'Cyprus', 'Djibouti', 'Dominica', 'East Timor', 'Ecuador', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Faroe Islands', 'French Guiana', 'French Polynesia', 'Gabon', 'Gambia, The', 'Gaza Strip', 'Ghana', 'Gibraltar', 'Greenland', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti', 'Honduras', 'Hong Kong', 'Iceland', 'Iraq', 'Isle of Man', 'Jersey', 'Kiribati', 'Korea, North', 'Korea, South', 'Kuwait', 'Kyrgyzstan', 'Laos', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Luxembourg', 'Macau', 'Macedonia', 'Madagascar', 'Malawi', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Micronesia, Fed. St.', 'Moldova', 'Monaco', 'Montserrat', 'Mozambique', 'N. Mariana Islands', 'Namibia', 'Nauru', 'Nepal', 'Netherlands Antilles', 'New Caledonia', 'Nicaragua', 'Oman', 'Pakistan', 'Palau', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Reunion', 'Rwanda', 'Saint Helena', 'Saint Kitts & Nevis', 'Saint Lucia', 'Saint Vincent and the Grenadines', 'Samoa', 'San Marino', 'Sao Tome & Principe', 'Saudi Arabia', 'Senegal', 'Seychelles', 'Sierra Leone', 'Solomon Islands', 'Somalia', 'Sri Lanka', 'St Pierre & Miquelon', 'Sudan', 'Suriname', 'Swaziland', 'Syria', 'Taiwan', 'Tanzania', 'Togo', 'Tonga', 'Trinidad & Tobago', 'Turkmenistan', 'Turks & Caicos Is', 'Tuvalu', 'Uganda', 'United Kingdom', 'Uruguay', 'Vanuatu', 'Virgin Islands', 'Wallis and Futuna', 'West Bank', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe']\n"
]
}
],
"source": [
"print(sorted(set(olymp2016medals['NOC'])-set(CoTW_country_names)) )\n",
"print(sorted(set(CoTW_country_names)-set(olymp2016medals['NOC'])))\n",
"# Note that several of the same countries are named or spelled differently in the two lists"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Independent Olympic Athletes', 'Kosovo']\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\datta\\anaconda3\\lib\\site-packages\\pandas\\core\\indexing.py:1765: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" isetter(loc, value)\n"
]
}
],
"source": [
"CoTW_df.loc[(CoTW_df.Country==\"Bahamas, The\"),'Country']='Bahamas'\n",
"CoTW_df.loc[(CoTW_df.Country=='Taiwan'),'Country']='Chinese Taipei'\n",
"CoTW_df.loc[(CoTW_df.Country=='United Kingdom'),'Country']='Great Britain'\n",
"CoTW_df.loc[(CoTW_df.Country==\"Cote d'Ivoire\"),'Country']='Ivory Coast'\n",
"CoTW_df.loc[(CoTW_df.Country=='Korea, North'),'Country']='North Korea'\n",
"CoTW_df.loc[(CoTW_df.Country=='Korea, South'),'Country']='South Korea'\n",
"CoTW_df.loc[(CoTW_df.Country==\"Trinidad & Tobago\"),'Country']='Trinidad and Tobago'\n",
"olymp2016medals.loc[(olymp2016medals.NOC==\"Brazil*\"),'NOC']='Brazil'\n",
"CoTW_country_names=CoTW_df['Country'].unique()\n",
"print(sorted(set(olymp2016medals['NOC'])-set(CoTW_country_names)) )\n",
"#print(sorted(set(CoTW_country_names)-set(olymp2016medals['NOC'])))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Need to `fix' all the anomalous data, particularly if we want to correlate the information across the data frames. Above, I did this manually.\n",
"\n",
"Talking points: \n",
"1. Challenges to do this at scale.\n",
" * Check for similarity of words: e.g., Korea, North vs North Korea; Trindad & Tobago vs Trinidad and Tobago\n",
" * What about Cote d'Ivoire vs Ivory Coast? Swaziland versus Switzerland? Niger vs Nigeria?\n",
" * What to do with United Kingdom vs Great Britain, Taiwan vs Chinese Taipei?\n",
"2. Sometimes, you may be lucky and can leverage on Named entity recognition and matching tools.\n",
"\n",
"Caveat: Different domains may be more or less amenable to certain solutions/heuristics. "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### Detour\n",
"\n",
"For the particular problem at hand, namely, each country has been spelled variously, and we want to identify them to be equivalent, instead of writing our own tool, we may leverage on a nice tool someone already built!\n",
"\n",
"Namely, country_converter: https://github.com/konstantinstadler/country_converter\n",
"\n",
"There's also pycountry, which is useful in some related cases, but it maynot be suitable for our particular scenario (at least, I didn't figure out how to use it for our problem): https://pypi.org/project/pycountry/ "
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"#!pip install country_converter --upgrade\n",
"import country_converter as coco"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Alice's Wonderland not found in regex\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Niger', 'Nigeria', 'North Korea', 'North Korea', \"Cote d'Ivoire\", \"Cote d'Ivoire\", 'Brazil', 'Brazil', 'United Kingdom', 'United Kingdom', 'Taiwan', 'Taiwan', 'Trinidad and Tobago', 'Trinidad and Tobago', 'not found']\n"
]
}
],
"source": [
"# sanity check\n",
"some_names = ['Niger', 'Nigeria', 'Korea, North', \"North Korea\", \"Cote d'Ivoire\", \"Ivory Coast\", \"Brazil*\", \"Brazil\", \"Great Britain\", \"United Kingdom\", \"Taiwan\", \"Chinese taipei\", \"Trinidad & Tobago\", \"Trinidad and Tobago\", \"Alice's Wonderland\"]\n",
"standard_names = coco.convert(names=some_names, to='name_short')\n",
"print(standard_names)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### Merge data frames\n",
"- how: {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’\n",
" * Pandas doc\n",
" * A nice discussion on various options how merger is done can be at Stack Overflow. [Below figure is from there.]\n",
" \n",
"
\n",
"\n",
"- Note that Pandas has also a Join method, which however in-fact join columns with other DataFrame either on index or on a key column. A good resource to check."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Rank | \n",
" NOC | \n",
" Gold | \n",
" Silver | \n",
" Bronze | \n",
" Total | \n",
" Country | \n",
" Population | \n",
"
\n",
" \n",
" \n",
" \n",
" 57 | \n",
" 60 | \n",
" Malaysia | \n",
" 0 | \n",
" 4 | \n",
" 1 | \n",
" 5 | \n",
" Malaysia | \n",
" 24385858 | \n",
"
\n",
" \n",
" 58 | \n",
" 61 | \n",
" Mexico | \n",
" 0 | \n",
" 3 | \n",
" 2 | \n",
" 5 | \n",
" Mexico | \n",
" 107449525 | \n",
"
\n",
" \n",
" 59 | \n",
" 62 | \n",
" Venezuela | \n",
" 0 | \n",
" 2 | \n",
" 1 | \n",
" 3 | \n",
" Venezuela | \n",
" 25730435 | \n",
"
\n",
" \n",
" 60 | \n",
" 63 | \n",
" Algeria | \n",
" 0 | \n",
" 2 | \n",
" 0 | \n",
" 2 | \n",
" Algeria | \n",
" 32930091 | \n",
"
\n",
" \n",
" 64 | \n",
" 67 | \n",
" India | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" India | \n",
" 1095351995 | \n",
"
\n",
" \n",
" 69 | \n",
" 69 | \n",
" Philippines | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" Philippines | \n",
" 89468677 | \n",
"
\n",
" \n",
" 72 | \n",
" 75 | \n",
" Egypt | \n",
" 0 | \n",
" 0 | \n",
" 3 | \n",
" 3 | \n",
" Egypt | \n",
" 78887007 | \n",
"
\n",
" \n",
" 79 | \n",
" 78 | \n",
" Morocco | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" Morocco | \n",
" 33241259 | \n",
"
\n",
" \n",
" 80 | \n",
" 78 | \n",
" Nigeria | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" Nigeria | \n",
" 131859731 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Rank NOC Gold Silver Bronze Total Country Population\n",
"57 60 Malaysia 0 4 1 5 Malaysia 24385858\n",
"58 61 Mexico 0 3 2 5 Mexico 107449525\n",
"59 62 Venezuela 0 2 1 3 Venezuela 25730435\n",
"60 63 Algeria 0 2 0 2 Algeria 32930091\n",
"64 67 India 0 1 1 2 India 1095351995\n",
"69 69 Philippines 0 1 0 1 Philippines 89468677\n",
"72 75 Egypt 0 0 3 3 Egypt 78887007\n",
"79 78 Morocco 0 0 1 1 Morocco 33241259\n",
"80 78 Nigeria 0 0 1 1 Nigeria 131859731"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Assuming that the data has been `cleaned' \n",
"# Determine all the countries with a population of more than 20 million \n",
"# as per the CountriesOfTheWorld table, that got no gold medals. \n",
"Merged_df=olymp2016medals.merge(CoTW_df[['Country','Population']],left_on='NOC', right_on='Country')\n",
"Merged_df[(Merged_df['Population']>20000000) & (Merged_df['Gold']==0)]\n",
"# Note: For the specific purpose, from CoTW_df, I do not need the other columns than Country and Population\n",
"# I could have cleaned the data more, e.g. NOC and Country columns are redundant."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Ungraded task 2.1: Ratio of total medals to per capita GDP
\n",
"\n",
"Determine the ratio of the number of total medals to the per capita GDP for each country as per the CountriesOfTheWorld table and sort them in decreasing order. Compare the results with others.\n",
"\n",
"- Discuss: What subtleties are involved? \n",
" * How do you account for the countries which are not in the original Olympics table?"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### Groupby, Hierarchical index"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"year\n",
"1952 28.801\n",
"1957 30.332\n",
"1962 31.997\n",
"1967 34.020\n",
"1972 35.400\n",
"1977 31.220\n",
"1982 38.445\n",
"1987 39.906\n",
"1992 23.599\n",
"1997 36.087\n",
"2002 39.193\n",
"2007 39.613\n",
"Name: lifeExp, dtype: float64"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# For each year for which data is available in the gap_df Dataframe\n",
"# What was the minimum recorded life expectancy?\n",
"gap_df.groupby([\"year\"])[\"lifeExp\"].min() "
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" country | \n",
" lifeExp | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1952 | \n",
" Afghanistan | \n",
" 28.801 | \n",
"
\n",
" \n",
" 1 | \n",
" 1957 | \n",
" Afghanistan | \n",
" 30.332 | \n",
"
\n",
" \n",
" 2 | \n",
" 1962 | \n",
" Afghanistan | \n",
" 31.997 | \n",
"
\n",
" \n",
" 3 | \n",
" 1967 | \n",
" Afghanistan | \n",
" 34.020 | \n",
"
\n",
" \n",
" 1348 | \n",
" 1972 | \n",
" Sierra Leone | \n",
" 35.400 | \n",
"
\n",
" \n",
" 221 | \n",
" 1977 | \n",
" Cambodia | \n",
" 31.220 | \n",
"
\n",
" \n",
" 1350 | \n",
" 1982 | \n",
" Sierra Leone | \n",
" 38.445 | \n",
"
\n",
" \n",
" 43 | \n",
" 1987 | \n",
" Angola | \n",
" 39.906 | \n",
"
\n",
" \n",
" 1292 | \n",
" 1992 | \n",
" Rwanda | \n",
" 23.599 | \n",
"
\n",
" \n",
" 1293 | \n",
" 1997 | \n",
" Rwanda | \n",
" 36.087 | \n",
"
\n",
" \n",
" 1690 | \n",
" 2002 | \n",
" Zambia | \n",
" 39.193 | \n",
"
\n",
" \n",
" 1463 | \n",
" 2007 | \n",
" Swaziland | \n",
" 39.613 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year country lifeExp\n",
"0 1952 Afghanistan 28.801\n",
"1 1957 Afghanistan 30.332\n",
"2 1962 Afghanistan 31.997\n",
"3 1967 Afghanistan 34.020\n",
"1348 1972 Sierra Leone 35.400\n",
"221 1977 Cambodia 31.220\n",
"1350 1982 Sierra Leone 38.445\n",
"43 1987 Angola 39.906\n",
"1292 1992 Rwanda 23.599\n",
"1293 1997 Rwanda 36.087\n",
"1690 2002 Zambia 39.193\n",
"1463 2007 Swaziland 39.613"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# What if we also want to know, which country had said minimum life expectancy? \n",
"gap_df.loc[gap_df.groupby([\"year\"])[\"lifeExp\"].idxmin()][['year','country','lifeExp']] "
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" Population | \n",
" Area (sq. mi.) | \n",
" GDP ($ per capita) | \n",
"
\n",
" \n",
" Region | \n",
" Country | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" ASIA (EX. NEAR EAST) | \n",
" Afghanistan | \n",
" 31056997 | \n",
" 647500 | \n",
" 700.0 | \n",
"
\n",
" \n",
" Bangladesh | \n",
" 147365352 | \n",
" 144000 | \n",
" 1900.0 | \n",
"
\n",
" \n",
" Bhutan | \n",
" 2279723 | \n",
" 47000 | \n",
" 1300.0 | \n",
"
\n",
" \n",
" Brunei | \n",
" 379444 | \n",
" 5770 | \n",
" 18600.0 | \n",
"
\n",
" \n",
" Burma | \n",
" 47382633 | \n",
" 678500 | \n",
" 1800.0 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" WESTERN EUROPE | \n",
" Portugal | \n",
" 10605870 | \n",
" 92391 | \n",
" 18000.0 | \n",
"
\n",
" \n",
" San Marino | \n",
" 29251 | \n",
" 61 | \n",
" 34600.0 | \n",
"
\n",
" \n",
" Spain | \n",
" 40397842 | \n",
" 504782 | \n",
" 22000.0 | \n",
"
\n",
" \n",
" Sweden | \n",
" 9016596 | \n",
" 449964 | \n",
" 26800.0 | \n",
"
\n",
" \n",
" Switzerland | \n",
" 7523934 | \n",
" 41290 | \n",
" 32700.0 | \n",
"
\n",
" \n",
"
\n",
"
227 rows × 3 columns
\n",
"
"
],
"text/plain": [
" Population Area (sq. mi.) \\\n",
"Region Country \n",
"ASIA (EX. NEAR EAST) Afghanistan 31056997 647500 \n",
" Bangladesh 147365352 144000 \n",
" Bhutan 2279723 47000 \n",
" Brunei 379444 5770 \n",
" Burma 47382633 678500 \n",
"... ... ... \n",
"WESTERN EUROPE Portugal 10605870 92391 \n",
" San Marino 29251 61 \n",
" Spain 40397842 504782 \n",
" Sweden 9016596 449964 \n",
" Switzerland 7523934 41290 \n",
"\n",
" GDP ($ per capita) \n",
"Region Country \n",
"ASIA (EX. NEAR EAST) Afghanistan 700.0 \n",
" Bangladesh 1900.0 \n",
" Bhutan 1300.0 \n",
" Brunei 18600.0 \n",
" Burma 1800.0 \n",
"... ... \n",
"WESTERN EUROPE Portugal 18000.0 \n",
" San Marino 34600.0 \n",
" Spain 22000.0 \n",
" Sweden 26800.0 \n",
" Switzerland 32700.0 \n",
"\n",
"[227 rows x 3 columns]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Groupby can be useful in creating hierarchical indexing\n",
"# e.g., When we want to check the data by regions.\n",
"CoTW_df.groupby(['Region','Country']).mean()\n",
"# Many of the data, though representing numeric values, are in-fact treated as string in the current Dataframe. \n",
"# Refer to Ungraded Task 2.2 (below) for more on this. The nature of the data may limit what can be computed. "
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" Population | \n",
" Area (sq. mi.) | \n",
" Pop. Density (per sq. mi.) | \n",
" Coastline (coast/area ratio) | \n",
" Net migration | \n",
" Infant mortality (per 1000 births) | \n",
" GDP ($ per capita) | \n",
" Literacy (%) | \n",
" Phones (per 1000) | \n",
" Arable (%) | \n",
" Crops (%) | \n",
" Other (%) | \n",
" Climate | \n",
" Birthrate | \n",
" Deathrate | \n",
" Agriculture | \n",
" Industry | \n",
" Service | \n",
"
\n",
" \n",
" Region | \n",
" Country | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" ASIA (EX. NEAR EAST) | \n",
" Afghanistan | \n",
" 31056997 | \n",
" 647500 | \n",
" 48,0 | \n",
" 0,00 | \n",
" 23,06 | \n",
" 163,07 | \n",
" 700.0 | \n",
" 36,0 | \n",
" 3,2 | \n",
" 12,13 | \n",
" 0,22 | \n",
" 87,65 | \n",
" 1 | \n",
" 46,6 | \n",
" 20,34 | \n",
" 0,38 | \n",
" 0,24 | \n",
" 0,38 | \n",
"
\n",
" \n",
" Bangladesh | \n",
" 147365352 | \n",
" 144000 | \n",
" 1023,4 | \n",
" 0,40 | \n",
" -0,71 | \n",
" 62,6 | \n",
" 1900.0 | \n",
" 43,1 | \n",
" 7,3 | \n",
" 62,11 | \n",
" 3,07 | \n",
" 34,82 | \n",
" 2 | \n",
" 29,8 | \n",
" 8,27 | \n",
" 0,199 | \n",
" 0,198 | \n",
" 0,603 | \n",
"
\n",
" \n",
" Bhutan | \n",
" 2279723 | \n",
" 47000 | \n",
" 48,5 | \n",
" 0,00 | \n",
" 0 | \n",
" 100,44 | \n",
" 1300.0 | \n",
" 42,2 | \n",
" 14,3 | \n",
" 3,09 | \n",
" 0,43 | \n",
" 96,48 | \n",
" 2 | \n",
" 33,65 | \n",
" 12,7 | \n",
" 0,258 | \n",
" 0,379 | \n",
" 0,363 | \n",
"
\n",
" \n",
" Brunei | \n",
" 379444 | \n",
" 5770 | \n",
" 65,8 | \n",
" 2,79 | \n",
" 3,59 | \n",
" 12,61 | \n",
" 18600.0 | \n",
" 93,9 | \n",
" 237,2 | \n",
" 0,57 | \n",
" 0,76 | \n",
" 98,67 | \n",
" 2 | \n",
" 18,79 | \n",
" 3,45 | \n",
" 0,036 | \n",
" 0,561 | \n",
" 0,403 | \n",
"
\n",
" \n",
" Burma | \n",
" 47382633 | \n",
" 678500 | \n",
" 69,8 | \n",
" 0,28 | \n",
" -1,8 | \n",
" 67,24 | \n",
" 1800.0 | \n",
" 85,3 | \n",
" 10,1 | \n",
" 15,19 | \n",
" 0,97 | \n",
" 83,84 | \n",
" 2 | \n",
" 17,91 | \n",
" 9,83 | \n",
" 0,564 | \n",
" 0,082 | \n",
" 0,353 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" WESTERN EUROPE | \n",
" Portugal | \n",
" 10605870 | \n",
" 92391 | \n",
" 114,8 | \n",
" 1,94 | \n",
" 3,57 | \n",
" 5,05 | \n",
" 18000.0 | \n",
" 93,3 | \n",
" 399,2 | \n",
" 21,75 | \n",
" 7,81 | \n",
" 70,44 | \n",
" 3 | \n",
" 10,72 | \n",
" 10,5 | \n",
" 0,053 | \n",
" 0,274 | \n",
" 0,673 | \n",
"
\n",
" \n",
" San Marino | \n",
" 29251 | \n",
" 61 | \n",
" 479,5 | \n",
" 0,00 | \n",
" 10,98 | \n",
" 5,73 | \n",
" 34600.0 | \n",
" 96,0 | \n",
" 704,3 | \n",
" 16,67 | \n",
" 0 | \n",
" 83,33 | \n",
" NaN | \n",
" 10,02 | \n",
" 8,17 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" Spain | \n",
" 40397842 | \n",
" 504782 | \n",
" 80,0 | \n",
" 0,98 | \n",
" 0,99 | \n",
" 4,42 | \n",
" 22000.0 | \n",
" 97,9 | \n",
" 453,5 | \n",
" 26,07 | \n",
" 9,87 | \n",
" 64,06 | \n",
" 3 | \n",
" 10,06 | \n",
" 9,72 | \n",
" 0,04 | \n",
" 0,295 | \n",
" 0,665 | \n",
"
\n",
" \n",
" Sweden | \n",
" 9016596 | \n",
" 449964 | \n",
" 20,0 | \n",
" 0,72 | \n",
" 1,67 | \n",
" 2,77 | \n",
" 26800.0 | \n",
" 99,0 | \n",
" 715,0 | \n",
" 6,54 | \n",
" 0,01 | \n",
" 93,45 | \n",
" 3 | \n",
" 10,27 | \n",
" 10,31 | \n",
" 0,011 | \n",
" 0,282 | \n",
" 0,707 | \n",
"
\n",
" \n",
" Switzerland | \n",
" 7523934 | \n",
" 41290 | \n",
" 182,2 | \n",
" 0,00 | \n",
" 4,05 | \n",
" 4,39 | \n",
" 32700.0 | \n",
" 99,0 | \n",
" 680,9 | \n",
" 10,42 | \n",
" 0,61 | \n",
" 88,97 | \n",
" 3 | \n",
" 9,71 | \n",
" 8,49 | \n",
" 0,015 | \n",
" 0,34 | \n",
" 0,645 | \n",
"
\n",
" \n",
"
\n",
"
227 rows × 18 columns
\n",
"
"
],
"text/plain": [
" Population Area (sq. mi.) \\\n",
"Region Country \n",
"ASIA (EX. NEAR EAST) Afghanistan 31056997 647500 \n",
" Bangladesh 147365352 144000 \n",
" Bhutan 2279723 47000 \n",
" Brunei 379444 5770 \n",
" Burma 47382633 678500 \n",
"... ... ... \n",
"WESTERN EUROPE Portugal 10605870 92391 \n",
" San Marino 29251 61 \n",
" Spain 40397842 504782 \n",
" Sweden 9016596 449964 \n",
" Switzerland 7523934 41290 \n",
"\n",
" Pop. Density (per sq. mi.) \\\n",
"Region Country \n",
"ASIA (EX. NEAR EAST) Afghanistan 48,0 \n",
" Bangladesh 1023,4 \n",
" Bhutan 48,5 \n",
" Brunei 65,8 \n",
" Burma 69,8 \n",
"... ... \n",
"WESTERN EUROPE Portugal 114,8 \n",
" San Marino 479,5 \n",
" Spain 80,0 \n",
" Sweden 20,0 \n",
" Switzerland 182,2 \n",
"\n",
" Coastline (coast/area ratio) \\\n",
"Region Country \n",
"ASIA (EX. NEAR EAST) Afghanistan 0,00 \n",
" Bangladesh 0,40 \n",
" Bhutan 0,00 \n",
" Brunei 2,79 \n",
" Burma 0,28 \n",
"... ... \n",
"WESTERN EUROPE Portugal 1,94 \n",
" San Marino 0,00 \n",
" Spain 0,98 \n",
" Sweden 0,72 \n",
" Switzerland 0,00 \n",
"\n",
" Net migration \\\n",
"Region Country \n",
"ASIA (EX. NEAR EAST) Afghanistan 23,06 \n",
" Bangladesh -0,71 \n",
" Bhutan 0 \n",
" Brunei 3,59 \n",
" Burma -1,8 \n",
"... ... \n",
"WESTERN EUROPE Portugal 3,57 \n",
" San Marino 10,98 \n",
" Spain 0,99 \n",
" Sweden 1,67 \n",
" Switzerland 4,05 \n",
"\n",
" Infant mortality (per 1000 births) \\\n",
"Region Country \n",
"ASIA (EX. NEAR EAST) Afghanistan 163,07 \n",
" Bangladesh 62,6 \n",
" Bhutan 100,44 \n",
" Brunei 12,61 \n",
" Burma 67,24 \n",
"... ... \n",
"WESTERN EUROPE Portugal 5,05 \n",
" San Marino 5,73 \n",
" Spain 4,42 \n",
" Sweden 2,77 \n",
" Switzerland 4,39 \n",
"\n",
" GDP ($ per capita) \\\n",
"Region Country \n",
"ASIA (EX. NEAR EAST) Afghanistan 700.0 \n",
" Bangladesh 1900.0 \n",
" Bhutan 1300.0 \n",
" Brunei 18600.0 \n",
" Burma 1800.0 \n",
"... ... \n",
"WESTERN EUROPE Portugal 18000.0 \n",
" San Marino 34600.0 \n",
" Spain 22000.0 \n",
" Sweden 26800.0 \n",
" Switzerland 32700.0 \n",
"\n",
" Literacy (%) \\\n",
"Region Country \n",
"ASIA (EX. NEAR EAST) Afghanistan 36,0 \n",
" Bangladesh 43,1 \n",
" Bhutan 42,2 \n",
" Brunei 93,9 \n",
" Burma 85,3 \n",
"... ... \n",
"WESTERN EUROPE Portugal 93,3 \n",
" San Marino 96,0 \n",
" Spain 97,9 \n",
" Sweden 99,0 \n",
" Switzerland 99,0 \n",
"\n",
" Phones (per 1000) Arable (%) \\\n",
"Region Country \n",
"ASIA (EX. NEAR EAST) Afghanistan 3,2 12,13 \n",
" Bangladesh 7,3 62,11 \n",
" Bhutan 14,3 3,09 \n",
" Brunei 237,2 0,57 \n",
" Burma 10,1 15,19 \n",
"... ... ... \n",
"WESTERN EUROPE Portugal 399,2 21,75 \n",
" San Marino 704,3 16,67 \n",
" Spain 453,5 26,07 \n",
" Sweden 715,0 6,54 \n",
" Switzerland 680,9 10,42 \n",
"\n",
" Crops (%) Other (%) Climate \\\n",
"Region Country \n",
"ASIA (EX. NEAR EAST) Afghanistan 0,22 87,65 1 \n",
" Bangladesh 3,07 34,82 2 \n",
" Bhutan 0,43 96,48 2 \n",
" Brunei 0,76 98,67 2 \n",
" Burma 0,97 83,84 2 \n",
"... ... ... ... \n",
"WESTERN EUROPE Portugal 7,81 70,44 3 \n",
" San Marino 0 83,33 NaN \n",
" Spain 9,87 64,06 3 \n",
" Sweden 0,01 93,45 3 \n",
" Switzerland 0,61 88,97 3 \n",
"\n",
" Birthrate Deathrate \\\n",
"Region Country \n",
"ASIA (EX. NEAR EAST) Afghanistan 46,6 20,34 \n",
" Bangladesh 29,8 8,27 \n",
" Bhutan 33,65 12,7 \n",
" Brunei 18,79 3,45 \n",
" Burma 17,91 9,83 \n",
"... ... ... \n",
"WESTERN EUROPE Portugal 10,72 10,5 \n",
" San Marino 10,02 8,17 \n",
" Spain 10,06 9,72 \n",
" Sweden 10,27 10,31 \n",
" Switzerland 9,71 8,49 \n",
"\n",
" Agriculture Industry Service \n",
"Region Country \n",
"ASIA (EX. NEAR EAST) Afghanistan 0,38 0,24 0,38 \n",
" Bangladesh 0,199 0,198 0,603 \n",
" Bhutan 0,258 0,379 0,363 \n",
" Brunei 0,036 0,561 0,403 \n",
" Burma 0,564 0,082 0,353 \n",
"... ... ... ... \n",
"WESTERN EUROPE Portugal 0,053 0,274 0,673 \n",
" San Marino NaN NaN NaN \n",
" Spain 0,04 0,295 0,665 \n",
" Sweden 0,011 0,282 0,707 \n",
" Switzerland 0,015 0,34 0,645 \n",
"\n",
"[227 rows x 18 columns]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"CoTW_df.groupby(['Region','Country']).max()\n",
"# With something like max, Pandas seems more generous in ignoring data type ;)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" Population | \n",
" Area (sq. mi.) | \n",
" Pop. Density (per sq. mi.) | \n",
" Coastline (coast/area ratio) | \n",
" Net migration | \n",
" Infant mortality (per 1000 births) | \n",
" GDP ($ per capita) | \n",
" Literacy (%) | \n",
" Phones (per 1000) | \n",
" Arable (%) | \n",
" Crops (%) | \n",
" Other (%) | \n",
" Climate | \n",
" Birthrate | \n",
" Deathrate | \n",
" Agriculture | \n",
" Industry | \n",
" Service | \n",
"
\n",
" \n",
" Region | \n",
" Country | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" ASIA (EX. NEAR EAST) | \n",
" Afghanistan | \n",
" 31056997 | \n",
" 647500 | \n",
" 48,0 | \n",
" 0,00 | \n",
" 23,06 | \n",
" 163,07 | \n",
" 700.0 | \n",
" 36,0 | \n",
" 3,2 | \n",
" 12,13 | \n",
" 0,22 | \n",
" 87,65 | \n",
" 1 | \n",
" 46,6 | \n",
" 20,34 | \n",
" 0,38 | \n",
" 0,24 | \n",
" 0,38 | \n",
"
\n",
" \n",
" Bangladesh | \n",
" 147365352 | \n",
" 144000 | \n",
" 1023,4 | \n",
" 0,40 | \n",
" -0,71 | \n",
" 62,6 | \n",
" 1900.0 | \n",
" 43,1 | \n",
" 7,3 | \n",
" 62,11 | \n",
" 3,07 | \n",
" 34,82 | \n",
" 2 | \n",
" 29,8 | \n",
" 8,27 | \n",
" 0,199 | \n",
" 0,198 | \n",
" 0,603 | \n",
"
\n",
" \n",
" Bhutan | \n",
" 2279723 | \n",
" 47000 | \n",
" 48,5 | \n",
" 0,00 | \n",
" 0 | \n",
" 100,44 | \n",
" 1300.0 | \n",
" 42,2 | \n",
" 14,3 | \n",
" 3,09 | \n",
" 0,43 | \n",
" 96,48 | \n",
" 2 | \n",
" 33,65 | \n",
" 12,7 | \n",
" 0,258 | \n",
" 0,379 | \n",
" 0,363 | \n",
"
\n",
" \n",
" Brunei | \n",
" 379444 | \n",
" 5770 | \n",
" 65,8 | \n",
" 2,79 | \n",
" 3,59 | \n",
" 12,61 | \n",
" 18600.0 | \n",
" 93,9 | \n",
" 237,2 | \n",
" 0,57 | \n",
" 0,76 | \n",
" 98,67 | \n",
" 2 | \n",
" 18,79 | \n",
" 3,45 | \n",
" 0,036 | \n",
" 0,561 | \n",
" 0,403 | \n",
"
\n",
" \n",
" Burma | \n",
" 47382633 | \n",
" 678500 | \n",
" 69,8 | \n",
" 0,28 | \n",
" -1,8 | \n",
" 67,24 | \n",
" 1800.0 | \n",
" 85,3 | \n",
" 10,1 | \n",
" 15,19 | \n",
" 0,97 | \n",
" 83,84 | \n",
" 2 | \n",
" 17,91 | \n",
" 9,83 | \n",
" 0,564 | \n",
" 0,082 | \n",
" 0,353 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" WESTERN EUROPE | \n",
" Portugal | \n",
" 10605870 | \n",
" 92391 | \n",
" 114,8 | \n",
" 1,94 | \n",
" 3,57 | \n",
" 5,05 | \n",
" 18000.0 | \n",
" 93,3 | \n",
" 399,2 | \n",
" 21,75 | \n",
" 7,81 | \n",
" 70,44 | \n",
" 3 | \n",
" 10,72 | \n",
" 10,5 | \n",
" 0,053 | \n",
" 0,274 | \n",
" 0,673 | \n",
"
\n",
" \n",
" San Marino | \n",
" 29251 | \n",
" 61 | \n",
" 479,5 | \n",
" 0,00 | \n",
" 10,98 | \n",
" 5,73 | \n",
" 34600.0 | \n",
" 96,0 | \n",
" 704,3 | \n",
" 16,67 | \n",
" 0 | \n",
" 83,33 | \n",
" NaN | \n",
" 10,02 | \n",
" 8,17 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" Spain | \n",
" 40397842 | \n",
" 504782 | \n",
" 80,0 | \n",
" 0,98 | \n",
" 0,99 | \n",
" 4,42 | \n",
" 22000.0 | \n",
" 97,9 | \n",
" 453,5 | \n",
" 26,07 | \n",
" 9,87 | \n",
" 64,06 | \n",
" 3 | \n",
" 10,06 | \n",
" 9,72 | \n",
" 0,04 | \n",
" 0,295 | \n",
" 0,665 | \n",
"
\n",
" \n",
" Sweden | \n",
" 9016596 | \n",
" 449964 | \n",
" 20,0 | \n",
" 0,72 | \n",
" 1,67 | \n",
" 2,77 | \n",
" 26800.0 | \n",
" 99,0 | \n",
" 715,0 | \n",
" 6,54 | \n",
" 0,01 | \n",
" 93,45 | \n",
" 3 | \n",
" 10,27 | \n",
" 10,31 | \n",
" 0,011 | \n",
" 0,282 | \n",
" 0,707 | \n",
"
\n",
" \n",
" Switzerland | \n",
" 7523934 | \n",
" 41290 | \n",
" 182,2 | \n",
" 0,00 | \n",
" 4,05 | \n",
" 4,39 | \n",
" 32700.0 | \n",
" 99,0 | \n",
" 680,9 | \n",
" 10,42 | \n",
" 0,61 | \n",
" 88,97 | \n",
" 3 | \n",
" 9,71 | \n",
" 8,49 | \n",
" 0,015 | \n",
" 0,34 | \n",
" 0,645 | \n",
"
\n",
" \n",
"
\n",
"
227 rows × 18 columns
\n",
"
"
],
"text/plain": [
" Population Area (sq. mi.) \\\n",
"Region Country \n",
"ASIA (EX. NEAR EAST) Afghanistan 31056997 647500 \n",
" Bangladesh 147365352 144000 \n",
" Bhutan 2279723 47000 \n",
" Brunei 379444 5770 \n",
" Burma 47382633 678500 \n",
"... ... ... \n",
"WESTERN EUROPE Portugal 10605870 92391 \n",
" San Marino 29251 61 \n",
" Spain 40397842 504782 \n",
" Sweden 9016596 449964 \n",
" Switzerland 7523934 41290 \n",
"\n",
" Pop. Density (per sq. mi.) \\\n",
"Region Country \n",
"ASIA (EX. NEAR EAST) Afghanistan 48,0 \n",
" Bangladesh 1023,4 \n",
" Bhutan 48,5 \n",
" Brunei 65,8 \n",
" Burma 69,8 \n",
"... ... \n",
"WESTERN EUROPE Portugal 114,8 \n",
" San Marino 479,5 \n",
" Spain 80,0 \n",
" Sweden 20,0 \n",
" Switzerland 182,2 \n",
"\n",
" Coastline (coast/area ratio) \\\n",
"Region Country \n",
"ASIA (EX. NEAR EAST) Afghanistan 0,00 \n",
" Bangladesh 0,40 \n",
" Bhutan 0,00 \n",
" Brunei 2,79 \n",
" Burma 0,28 \n",
"... ... \n",
"WESTERN EUROPE Portugal 1,94 \n",
" San Marino 0,00 \n",
" Spain 0,98 \n",
" Sweden 0,72 \n",
" Switzerland 0,00 \n",
"\n",
" Net migration \\\n",
"Region Country \n",
"ASIA (EX. NEAR EAST) Afghanistan 23,06 \n",
" Bangladesh -0,71 \n",
" Bhutan 0 \n",
" Brunei 3,59 \n",
" Burma -1,8 \n",
"... ... \n",
"WESTERN EUROPE Portugal 3,57 \n",
" San Marino 10,98 \n",
" Spain 0,99 \n",
" Sweden 1,67 \n",
" Switzerland 4,05 \n",
"\n",
" Infant mortality (per 1000 births) \\\n",
"Region Country \n",
"ASIA (EX. NEAR EAST) Afghanistan 163,07 \n",
" Bangladesh 62,6 \n",
" Bhutan 100,44 \n",
" Brunei 12,61 \n",
" Burma 67,24 \n",
"... ... \n",
"WESTERN EUROPE Portugal 5,05 \n",
" San Marino 5,73 \n",
" Spain 4,42 \n",
" Sweden 2,77 \n",
" Switzerland 4,39 \n",
"\n",
" GDP ($ per capita) \\\n",
"Region Country \n",
"ASIA (EX. NEAR EAST) Afghanistan 700.0 \n",
" Bangladesh 1900.0 \n",
" Bhutan 1300.0 \n",
" Brunei 18600.0 \n",
" Burma 1800.0 \n",
"... ... \n",
"WESTERN EUROPE Portugal 18000.0 \n",
" San Marino 34600.0 \n",
" Spain 22000.0 \n",
" Sweden 26800.0 \n",
" Switzerland 32700.0 \n",
"\n",
" Literacy (%) \\\n",
"Region Country \n",
"ASIA (EX. NEAR EAST) Afghanistan 36,0 \n",
" Bangladesh 43,1 \n",
" Bhutan 42,2 \n",
" Brunei 93,9 \n",
" Burma 85,3 \n",
"... ... \n",
"WESTERN EUROPE Portugal 93,3 \n",
" San Marino 96,0 \n",
" Spain 97,9 \n",
" Sweden 99,0 \n",
" Switzerland 99,0 \n",
"\n",
" Phones (per 1000) Arable (%) \\\n",
"Region Country \n",
"ASIA (EX. NEAR EAST) Afghanistan 3,2 12,13 \n",
" Bangladesh 7,3 62,11 \n",
" Bhutan 14,3 3,09 \n",
" Brunei 237,2 0,57 \n",
" Burma 10,1 15,19 \n",
"... ... ... \n",
"WESTERN EUROPE Portugal 399,2 21,75 \n",
" San Marino 704,3 16,67 \n",
" Spain 453,5 26,07 \n",
" Sweden 715,0 6,54 \n",
" Switzerland 680,9 10,42 \n",
"\n",
" Crops (%) Other (%) Climate \\\n",
"Region Country \n",
"ASIA (EX. NEAR EAST) Afghanistan 0,22 87,65 1 \n",
" Bangladesh 3,07 34,82 2 \n",
" Bhutan 0,43 96,48 2 \n",
" Brunei 0,76 98,67 2 \n",
" Burma 0,97 83,84 2 \n",
"... ... ... ... \n",
"WESTERN EUROPE Portugal 7,81 70,44 3 \n",
" San Marino 0 83,33 NaN \n",
" Spain 9,87 64,06 3 \n",
" Sweden 0,01 93,45 3 \n",
" Switzerland 0,61 88,97 3 \n",
"\n",
" Birthrate Deathrate \\\n",
"Region Country \n",
"ASIA (EX. NEAR EAST) Afghanistan 46,6 20,34 \n",
" Bangladesh 29,8 8,27 \n",
" Bhutan 33,65 12,7 \n",
" Brunei 18,79 3,45 \n",
" Burma 17,91 9,83 \n",
"... ... ... \n",
"WESTERN EUROPE Portugal 10,72 10,5 \n",
" San Marino 10,02 8,17 \n",
" Spain 10,06 9,72 \n",
" Sweden 10,27 10,31 \n",
" Switzerland 9,71 8,49 \n",
"\n",
" Agriculture Industry Service \n",
"Region Country \n",
"ASIA (EX. NEAR EAST) Afghanistan 0,38 0,24 0,38 \n",
" Bangladesh 0,199 0,198 0,603 \n",
" Bhutan 0,258 0,379 0,363 \n",
" Brunei 0,036 0,561 0,403 \n",
" Burma 0,564 0,082 0,353 \n",
"... ... ... ... \n",
"WESTERN EUROPE Portugal 0,053 0,274 0,673 \n",
" San Marino NaN NaN NaN \n",
" Spain 0,04 0,295 0,665 \n",
" Sweden 0,011 0,282 0,707 \n",
" Switzerland 0,015 0,34 0,645 \n",
"\n",
"[227 rows x 18 columns]"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# For what we want, an alternate, and simpler (and more robust, since we don't need to worry about data types) \n",
"# option is to just use create hierarchical index using set_index, and then sort_index.\n",
"CoTW_df.set_index(['Region','Country']).sort_index()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### Missing data \n",
"\n",
"There are many reasons that data may be missing. e.g.:\n",
"- It may be inherently missing: not collected, not entered, etc.\n",
"- It may be implicitly missing.\n",
"- Missing entries may get created during data processing. "
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Date | \n",
" Day | \n",
" Cases_Guinea | \n",
" Cases_Liberia | \n",
" Cases_SierraLeone | \n",
" Cases_Nigeria | \n",
" Cases_Senegal | \n",
" Cases_UnitedStates | \n",
" Cases_Spain | \n",
" Cases_Mali | \n",
" Deaths_Guinea | \n",
" Deaths_Liberia | \n",
" Deaths_SierraLeone | \n",
" Deaths_Nigeria | \n",
" Deaths_Senegal | \n",
" Deaths_UnitedStates | \n",
" Deaths_Spain | \n",
" Deaths_Mali | \n",
"
\n",
" \n",
" \n",
" \n",
" 108 | \n",
" 4/14/2014 | \n",
" 23 | \n",
" 168.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 108.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 88 | \n",
" 6/3/2014 | \n",
" 73 | \n",
" 344.0 | \n",
" 13.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 215.0 | \n",
" 12.0 | \n",
" 6.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 100 | \n",
" 4/24/2014 | \n",
" 33 | \n",
" NaN | \n",
" 35.0 | \n",
" 0.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 0.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 35 | \n",
" 10/19/2014 | \n",
" 211 | \n",
" 1540.0 | \n",
" NaN | \n",
" 3706.0 | \n",
" 20.0 | \n",
" 1.0 | \n",
" 3.0 | \n",
" 1.0 | \n",
" NaN | \n",
" 904.0 | \n",
" NaN | \n",
" 1259.0 | \n",
" 8.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 22 | \n",
" 11/11/2014 | \n",
" 234 | \n",
" 1919.0 | \n",
" NaN | \n",
" 5586.0 | \n",
" 20.0 | \n",
" 1.0 | \n",
" 4.0 | \n",
" 1.0 | \n",
" 4.0 | \n",
" 1166.0 | \n",
" NaN | \n",
" 1187.0 | \n",
" 8.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 3.0 | \n",
"
\n",
" \n",
" 96 | \n",
" 5/5/2014 | \n",
" 44 | \n",
" 235.0 | \n",
" 13.0 | \n",
" 0.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 157.0 | \n",
" 11.0 | \n",
" 0.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone \\\n",
"108 4/14/2014 23 168.0 NaN NaN \n",
"88 6/3/2014 73 344.0 13.0 NaN \n",
"100 4/24/2014 33 NaN 35.0 0.0 \n",
"35 10/19/2014 211 1540.0 NaN 3706.0 \n",
"22 11/11/2014 234 1919.0 NaN 5586.0 \n",
"96 5/5/2014 44 235.0 13.0 0.0 \n",
"\n",
" Cases_Nigeria Cases_Senegal Cases_UnitedStates Cases_Spain \\\n",
"108 NaN NaN NaN NaN \n",
"88 NaN NaN NaN NaN \n",
"100 NaN NaN NaN NaN \n",
"35 20.0 1.0 3.0 1.0 \n",
"22 20.0 1.0 4.0 1.0 \n",
"96 NaN NaN NaN NaN \n",
"\n",
" Cases_Mali Deaths_Guinea Deaths_Liberia Deaths_SierraLeone \\\n",
"108 NaN 108.0 NaN NaN \n",
"88 NaN 215.0 12.0 6.0 \n",
"100 NaN NaN NaN 0.0 \n",
"35 NaN 904.0 NaN 1259.0 \n",
"22 4.0 1166.0 NaN 1187.0 \n",
"96 NaN 157.0 11.0 0.0 \n",
"\n",
" Deaths_Nigeria Deaths_Senegal Deaths_UnitedStates Deaths_Spain \\\n",
"108 NaN NaN NaN NaN \n",
"88 NaN NaN NaN NaN \n",
"100 NaN NaN NaN NaN \n",
"35 8.0 0.0 1.0 0.0 \n",
"22 8.0 0.0 1.0 0.0 \n",
"96 NaN NaN NaN NaN \n",
"\n",
" Deaths_Mali \n",
"108 NaN \n",
"88 NaN \n",
"100 NaN \n",
"35 NaN \n",
"22 3.0 \n",
"96 NaN "
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# load data (source: https://github.com/chendaniely/pandas_for_everyone) \n",
"pfe_path ='data/pandas_for_everyone_data/' # change this to adjust relative path\n",
"#visited_df = pd.read_csv(pfe_path+'survey_visited.csv')\n",
"#survey_df = pd.read_csv(pfe_path+'survey_survey.csv')\n",
"ebola_df= pd.read_csv(pfe_path+'country_timeseries.csv')\n",
"ebola_df.sample(6)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Date | \n",
" Day | \n",
" Cases_Guinea | \n",
" Cases_Liberia | \n",
" Cases_SierraLeone | \n",
" Cases_Nigeria | \n",
" Cases_Senegal | \n",
" Cases_UnitedStates | \n",
" Cases_Spain | \n",
" Cases_Mali | \n",
" Deaths_Guinea | \n",
" Deaths_Liberia | \n",
" Deaths_SierraLeone | \n",
" Deaths_Nigeria | \n",
" Deaths_Senegal | \n",
" Deaths_UnitedStates | \n",
" Deaths_Spain | \n",
" Deaths_Mali | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1/5/2015 | \n",
" 289 | \n",
" 2776.0 | \n",
" 0.0 | \n",
" 10030.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1786.0 | \n",
" 0.0 | \n",
" 2977.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 1/4/2015 | \n",
" 288 | \n",
" 2775.0 | \n",
" 0.0 | \n",
" 9780.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1781.0 | \n",
" 0.0 | \n",
" 2943.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 1/3/2015 | \n",
" 287 | \n",
" 2769.0 | \n",
" 8166.0 | \n",
" 9722.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1767.0 | \n",
" 3496.0 | \n",
" 2915.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 1/2/2015 | \n",
" 286 | \n",
" 0.0 | \n",
" 8157.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 3496.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 12/31/2014 | \n",
" 284 | \n",
" 2730.0 | \n",
" 8115.0 | \n",
" 9633.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1739.0 | \n",
" 3471.0 | \n",
" 2827.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 117 | \n",
" 3/27/2014 | \n",
" 5 | \n",
" 103.0 | \n",
" 8.0 | \n",
" 6.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 66.0 | \n",
" 6.0 | \n",
" 5.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 118 | \n",
" 3/26/2014 | \n",
" 4 | \n",
" 86.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 62.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 119 | \n",
" 3/25/2014 | \n",
" 3 | \n",
" 86.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 60.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 120 | \n",
" 3/24/2014 | \n",
" 2 | \n",
" 86.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 59.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 121 | \n",
" 3/22/2014 | \n",
" 0 | \n",
" 49.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 29.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
"
\n",
"
122 rows × 18 columns
\n",
"
"
],
"text/plain": [
" Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone \\\n",
"0 1/5/2015 289 2776.0 0.0 10030.0 \n",
"1 1/4/2015 288 2775.0 0.0 9780.0 \n",
"2 1/3/2015 287 2769.0 8166.0 9722.0 \n",
"3 1/2/2015 286 0.0 8157.0 0.0 \n",
"4 12/31/2014 284 2730.0 8115.0 9633.0 \n",
".. ... ... ... ... ... \n",
"117 3/27/2014 5 103.0 8.0 6.0 \n",
"118 3/26/2014 4 86.0 0.0 0.0 \n",
"119 3/25/2014 3 86.0 0.0 0.0 \n",
"120 3/24/2014 2 86.0 0.0 0.0 \n",
"121 3/22/2014 0 49.0 0.0 0.0 \n",
"\n",
" Cases_Nigeria Cases_Senegal Cases_UnitedStates Cases_Spain \\\n",
"0 0.0 0.0 0.0 0.0 \n",
"1 0.0 0.0 0.0 0.0 \n",
"2 0.0 0.0 0.0 0.0 \n",
"3 0.0 0.0 0.0 0.0 \n",
"4 0.0 0.0 0.0 0.0 \n",
".. ... ... ... ... \n",
"117 0.0 0.0 0.0 0.0 \n",
"118 0.0 0.0 0.0 0.0 \n",
"119 0.0 0.0 0.0 0.0 \n",
"120 0.0 0.0 0.0 0.0 \n",
"121 0.0 0.0 0.0 0.0 \n",
"\n",
" Cases_Mali Deaths_Guinea Deaths_Liberia Deaths_SierraLeone \\\n",
"0 0.0 1786.0 0.0 2977.0 \n",
"1 0.0 1781.0 0.0 2943.0 \n",
"2 0.0 1767.0 3496.0 2915.0 \n",
"3 0.0 0.0 3496.0 0.0 \n",
"4 0.0 1739.0 3471.0 2827.0 \n",
".. ... ... ... ... \n",
"117 0.0 66.0 6.0 5.0 \n",
"118 0.0 62.0 0.0 0.0 \n",
"119 0.0 60.0 0.0 0.0 \n",
"120 0.0 59.0 0.0 0.0 \n",
"121 0.0 29.0 0.0 0.0 \n",
"\n",
" Deaths_Nigeria Deaths_Senegal Deaths_UnitedStates Deaths_Spain \\\n",
"0 0.0 0.0 0.0 0.0 \n",
"1 0.0 0.0 0.0 0.0 \n",
"2 0.0 0.0 0.0 0.0 \n",
"3 0.0 0.0 0.0 0.0 \n",
"4 0.0 0.0 0.0 0.0 \n",
".. ... ... ... ... \n",
"117 0.0 0.0 0.0 0.0 \n",
"118 0.0 0.0 0.0 0.0 \n",
"119 0.0 0.0 0.0 0.0 \n",
"120 0.0 0.0 0.0 0.0 \n",
"121 0.0 0.0 0.0 0.0 \n",
"\n",
" Deaths_Mali \n",
"0 0.0 \n",
"1 0.0 \n",
"2 0.0 \n",
"3 0.0 \n",
"4 0.0 \n",
".. ... \n",
"117 0.0 \n",
"118 0.0 \n",
"119 0.0 \n",
"120 0.0 \n",
"121 0.0 \n",
"\n",
"[122 rows x 18 columns]"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# One option is to consider missing data to have some default value.\n",
"# e.g., in the case of number of Ebola cases, it may be fair to consider missing values to be 0.\n",
"ebola_df.fillna(0,inplace=True)\n",
"ebola_df"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" pop | \n",
" gdpPercap | \n",
"
\n",
" \n",
" year | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2002 | \n",
" 25268405 | \n",
" 726.734055 | \n",
"
\n",
" \n",
" 2007 | \n",
" 31889923 | \n",
" 974.580338 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" pop gdpPercap\n",
"year \n",
"2002 25268405 726.734055\n",
"2007 31889923 974.580338"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Let's consider an example of how missing data may get created during the processing (because it was implicitly missing)\n",
"afg_df = gap_df[gap_df['country']=='Afghanistan'][['year','pop','gdpPercap']].tail(2).set_index('year')\n",
"afg_df"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" pop | \n",
" gdpPercap | \n",
"
\n",
" \n",
" year | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2000 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 2001 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 2002 | \n",
" 25268405.0 | \n",
" 726.734055 | \n",
"
\n",
" \n",
" 2003 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 2004 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 2005 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 2006 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 2007 | \n",
" 31889923.0 | \n",
" 974.580338 | \n",
"
\n",
" \n",
" 2008 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 2009 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 2010 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" pop gdpPercap\n",
"year \n",
"2000 NaN NaN\n",
"2001 NaN NaN\n",
"2002 25268405.0 726.734055\n",
"2003 NaN NaN\n",
"2004 NaN NaN\n",
"2005 NaN NaN\n",
"2006 NaN NaN\n",
"2007 31889923.0 974.580338\n",
"2008 NaN NaN\n",
"2009 NaN NaN\n",
"2010 NaN NaN"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# If we for some reason need yearly population and per capita GDP\n",
"# That information was missing implicitly, and during our processing, it may get so explicitly\n",
"afg_df=afg_df.reindex(afg_df.index.tolist() + list(range(2000, 2011)))\n",
"afg_df=afg_df[~afg_df.index.duplicated(keep='first')].sort_index()\n",
"afg_df"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" pop | \n",
" gdpPercap | \n",
"
\n",
" \n",
" year | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2000 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 2001 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 2002 | \n",
" 25268405.0 | \n",
" 726.734055 | \n",
"
\n",
" \n",
" 2003 | \n",
" 25268405.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 2004 | \n",
" 25268405.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 2005 | \n",
" 25268405.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 2006 | \n",
" 25268405.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 2007 | \n",
" 31889923.0 | \n",
" 974.580338 | \n",
"
\n",
" \n",
" 2008 | \n",
" 31889923.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 2009 | \n",
" 31889923.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 2010 | \n",
" 31889923.0 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" pop gdpPercap\n",
"year \n",
"2000 NaN NaN\n",
"2001 NaN NaN\n",
"2002 25268405.0 726.734055\n",
"2003 25268405.0 NaN\n",
"2004 25268405.0 NaN\n",
"2005 25268405.0 NaN\n",
"2006 25268405.0 NaN\n",
"2007 31889923.0 974.580338\n",
"2008 31889923.0 NaN\n",
"2009 31889923.0 NaN\n",
"2010 31889923.0 NaN"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# forward fill (the population)\n",
"afg_df.loc[:,'pop'] = afg_df.loc[:,'pop'].ffill()\n",
"afg_df"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" pop | \n",
" gdpPercap | \n",
"
\n",
" \n",
" year | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2000 | \n",
" 25268405.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 2001 | \n",
" 25268405.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 2002 | \n",
" 25268405.0 | \n",
" 726.734055 | \n",
"
\n",
" \n",
" 2003 | \n",
" 25268405.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 2004 | \n",
" 25268405.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 2005 | \n",
" 25268405.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 2006 | \n",
" 25268405.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 2007 | \n",
" 31889923.0 | \n",
" 974.580338 | \n",
"
\n",
" \n",
" 2008 | \n",
" 31889923.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 2009 | \n",
" 31889923.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 2010 | \n",
" 31889923.0 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" pop gdpPercap\n",
"year \n",
"2000 25268405.0 NaN\n",
"2001 25268405.0 NaN\n",
"2002 25268405.0 726.734055\n",
"2003 25268405.0 NaN\n",
"2004 25268405.0 NaN\n",
"2005 25268405.0 NaN\n",
"2006 25268405.0 NaN\n",
"2007 31889923.0 974.580338\n",
"2008 31889923.0 NaN\n",
"2009 31889923.0 NaN\n",
"2010 31889923.0 NaN"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# backward fill (the population)\n",
"afg_df.loc[:,'pop'] = afg_df.loc[:,'pop'].bfill()\n",
"afg_df"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" pop | \n",
" gdpPercap | \n",
"
\n",
" \n",
" year | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2000 | \n",
" 25268405.0 | \n",
" 726.734055 | \n",
"
\n",
" \n",
" 2001 | \n",
" 25268405.0 | \n",
" 726.734055 | \n",
"
\n",
" \n",
" 2002 | \n",
" 25268405.0 | \n",
" 726.734055 | \n",
"
\n",
" \n",
" 2003 | \n",
" 25268405.0 | \n",
" 776.303312 | \n",
"
\n",
" \n",
" 2004 | \n",
" 25268405.0 | \n",
" 825.872568 | \n",
"
\n",
" \n",
" 2005 | \n",
" 25268405.0 | \n",
" 875.441825 | \n",
"
\n",
" \n",
" 2006 | \n",
" 25268405.0 | \n",
" 925.011082 | \n",
"
\n",
" \n",
" 2007 | \n",
" 31889923.0 | \n",
" 974.580338 | \n",
"
\n",
" \n",
" 2008 | \n",
" 31889923.0 | \n",
" 974.580338 | \n",
"
\n",
" \n",
" 2009 | \n",
" 31889923.0 | \n",
" 974.580338 | \n",
"
\n",
" \n",
" 2010 | \n",
" 31889923.0 | \n",
" 974.580338 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" pop gdpPercap\n",
"year \n",
"2000 25268405.0 726.734055\n",
"2001 25268405.0 726.734055\n",
"2002 25268405.0 726.734055\n",
"2003 25268405.0 776.303312\n",
"2004 25268405.0 825.872568\n",
"2005 25268405.0 875.441825\n",
"2006 25268405.0 925.011082\n",
"2007 31889923.0 974.580338\n",
"2008 31889923.0 974.580338\n",
"2009 31889923.0 974.580338\n",
"2010 31889923.0 974.580338"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# interpolate fill: many caveats apply\n",
"# only linear interpolation is supported\n",
"# the gaps are filled with linear interpolation. \n",
"# the extremities just repeat the data\n",
"\n",
"afg_df.loc[:,'gdpPercap'] = afg_df.loc[:,'gdpPercap'].interpolate(limit_direction='both')\n",
"afg_df\n",
"# you may need your own custom method"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"scrolled": false,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" time | \n",
" user | \n",
" video | \n",
" playback position | \n",
" paused | \n",
" volume | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1469974424 | \n",
" cheryl | \n",
" intro.html | \n",
" 5 | \n",
" False | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 1469974454 | \n",
" cheryl | \n",
" intro.html | \n",
" 6 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" 1469974544 | \n",
" cheryl | \n",
" intro.html | \n",
" 9 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" 1469974574 | \n",
" cheryl | \n",
" intro.html | \n",
" 10 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 4 | \n",
" 1469977514 | \n",
" bob | \n",
" intro.html | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 5 | \n",
" 1469977544 | \n",
" bob | \n",
" intro.html | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 6 | \n",
" 1469977574 | \n",
" bob | \n",
" intro.html | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 7 | \n",
" 1469977604 | \n",
" bob | \n",
" intro.html | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 8 | \n",
" 1469974604 | \n",
" cheryl | \n",
" intro.html | \n",
" 11 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 9 | \n",
" 1469974694 | \n",
" cheryl | \n",
" intro.html | \n",
" 14 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 10 | \n",
" 1469974724 | \n",
" cheryl | \n",
" intro.html | \n",
" 15 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 11 | \n",
" 1469974454 | \n",
" sue | \n",
" advanced.html | \n",
" 24 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 12 | \n",
" 1469974524 | \n",
" sue | \n",
" advanced.html | \n",
" 25 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 13 | \n",
" 1469974424 | \n",
" sue | \n",
" advanced.html | \n",
" 23 | \n",
" False | \n",
" 10.0 | \n",
"
\n",
" \n",
" 14 | \n",
" 1469974554 | \n",
" sue | \n",
" advanced.html | \n",
" 26 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 15 | \n",
" 1469974624 | \n",
" sue | \n",
" advanced.html | \n",
" 27 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 16 | \n",
" 1469974654 | \n",
" sue | \n",
" advanced.html | \n",
" 28 | \n",
" NaN | \n",
" 5.0 | \n",
"
\n",
" \n",
" 17 | \n",
" 1469974724 | \n",
" sue | \n",
" advanced.html | \n",
" 29 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 18 | \n",
" 1469974484 | \n",
" cheryl | \n",
" intro.html | \n",
" 7 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 19 | \n",
" 1469974514 | \n",
" cheryl | \n",
" intro.html | \n",
" 8 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 20 | \n",
" 1469974754 | \n",
" sue | \n",
" advanced.html | \n",
" 30 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 21 | \n",
" 1469974824 | \n",
" sue | \n",
" advanced.html | \n",
" 31 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 22 | \n",
" 1469974854 | \n",
" sue | \n",
" advanced.html | \n",
" 32 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 23 | \n",
" 1469974924 | \n",
" sue | \n",
" advanced.html | \n",
" 33 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 24 | \n",
" 1469977424 | \n",
" bob | \n",
" intro.html | \n",
" 1 | \n",
" True | \n",
" 10.0 | \n",
"
\n",
" \n",
" 25 | \n",
" 1469977454 | \n",
" bob | \n",
" intro.html | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 26 | \n",
" 1469977484 | \n",
" bob | \n",
" intro.html | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 27 | \n",
" 1469977634 | \n",
" bob | \n",
" intro.html | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 28 | \n",
" 1469977664 | \n",
" bob | \n",
" intro.html | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 29 | \n",
" 1469974634 | \n",
" cheryl | \n",
" intro.html | \n",
" 12 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 30 | \n",
" 1469974664 | \n",
" cheryl | \n",
" intro.html | \n",
" 13 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 31 | \n",
" 1469977694 | \n",
" bob | \n",
" intro.html | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 32 | \n",
" 1469977724 | \n",
" bob | \n",
" intro.html | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" time user video playback position paused volume\n",
"0 1469974424 cheryl intro.html 5 False 10.0\n",
"1 1469974454 cheryl intro.html 6 NaN NaN\n",
"2 1469974544 cheryl intro.html 9 NaN NaN\n",
"3 1469974574 cheryl intro.html 10 NaN NaN\n",
"4 1469977514 bob intro.html 1 NaN NaN\n",
"5 1469977544 bob intro.html 1 NaN NaN\n",
"6 1469977574 bob intro.html 1 NaN NaN\n",
"7 1469977604 bob intro.html 1 NaN NaN\n",
"8 1469974604 cheryl intro.html 11 NaN NaN\n",
"9 1469974694 cheryl intro.html 14 NaN NaN\n",
"10 1469974724 cheryl intro.html 15 NaN NaN\n",
"11 1469974454 sue advanced.html 24 NaN NaN\n",
"12 1469974524 sue advanced.html 25 NaN NaN\n",
"13 1469974424 sue advanced.html 23 False 10.0\n",
"14 1469974554 sue advanced.html 26 NaN NaN\n",
"15 1469974624 sue advanced.html 27 NaN NaN\n",
"16 1469974654 sue advanced.html 28 NaN 5.0\n",
"17 1469974724 sue advanced.html 29 NaN NaN\n",
"18 1469974484 cheryl intro.html 7 NaN NaN\n",
"19 1469974514 cheryl intro.html 8 NaN NaN\n",
"20 1469974754 sue advanced.html 30 NaN NaN\n",
"21 1469974824 sue advanced.html 31 NaN NaN\n",
"22 1469974854 sue advanced.html 32 NaN NaN\n",
"23 1469974924 sue advanced.html 33 NaN NaN\n",
"24 1469977424 bob intro.html 1 True 10.0\n",
"25 1469977454 bob intro.html 1 NaN NaN\n",
"26 1469977484 bob intro.html 1 NaN NaN\n",
"27 1469977634 bob intro.html 1 NaN NaN\n",
"28 1469977664 bob intro.html 1 NaN NaN\n",
"29 1469974634 cheryl intro.html 12 NaN NaN\n",
"30 1469974664 cheryl intro.html 13 NaN NaN\n",
"31 1469977694 bob intro.html 1 NaN NaN\n",
"32 1469977724 bob intro.html 1 NaN NaN"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Be mindful regarding the semantics of the data\n",
"# This data is from a 'toy example' on video playbacks \n",
"# from Prof. Christopher Brooks' course on Data Science in Python \n",
"# Usage logs being collected from multiple users parallely, with some missing data\n",
"data_from_coursera_path ='data/coursera_sourced_data/'\n",
"log_df=pd.read_csv(data_from_coursera_path+'log.csv')\n",
"log_df"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user | \n",
" video | \n",
" playback position | \n",
" paused | \n",
" volume | \n",
"
\n",
" \n",
" time | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 1469974424 | \n",
" cheryl | \n",
" intro.html | \n",
" 5 | \n",
" False | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1469974424 | \n",
" sue | \n",
" advanced.html | \n",
" 23 | \n",
" False | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1469974454 | \n",
" cheryl | \n",
" intro.html | \n",
" 6 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974454 | \n",
" sue | \n",
" advanced.html | \n",
" 24 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974484 | \n",
" cheryl | \n",
" intro.html | \n",
" 7 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974514 | \n",
" cheryl | \n",
" intro.html | \n",
" 8 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974524 | \n",
" sue | \n",
" advanced.html | \n",
" 25 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974544 | \n",
" cheryl | \n",
" intro.html | \n",
" 9 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974554 | \n",
" sue | \n",
" advanced.html | \n",
" 26 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974574 | \n",
" cheryl | \n",
" intro.html | \n",
" 10 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974604 | \n",
" cheryl | \n",
" intro.html | \n",
" 11 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974624 | \n",
" sue | \n",
" advanced.html | \n",
" 27 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974634 | \n",
" cheryl | \n",
" intro.html | \n",
" 12 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974654 | \n",
" sue | \n",
" advanced.html | \n",
" 28 | \n",
" NaN | \n",
" 5.0 | \n",
"
\n",
" \n",
" 1469974664 | \n",
" cheryl | \n",
" intro.html | \n",
" 13 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974694 | \n",
" cheryl | \n",
" intro.html | \n",
" 14 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974724 | \n",
" cheryl | \n",
" intro.html | \n",
" 15 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974724 | \n",
" sue | \n",
" advanced.html | \n",
" 29 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974754 | \n",
" sue | \n",
" advanced.html | \n",
" 30 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974824 | \n",
" sue | \n",
" advanced.html | \n",
" 31 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974854 | \n",
" sue | \n",
" advanced.html | \n",
" 32 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974924 | \n",
" sue | \n",
" advanced.html | \n",
" 33 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469977424 | \n",
" bob | \n",
" intro.html | \n",
" 1 | \n",
" True | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1469977454 | \n",
" bob | \n",
" intro.html | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469977484 | \n",
" bob | \n",
" intro.html | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469977514 | \n",
" bob | \n",
" intro.html | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469977544 | \n",
" bob | \n",
" intro.html | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469977574 | \n",
" bob | \n",
" intro.html | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469977604 | \n",
" bob | \n",
" intro.html | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469977634 | \n",
" bob | \n",
" intro.html | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469977664 | \n",
" bob | \n",
" intro.html | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469977694 | \n",
" bob | \n",
" intro.html | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469977724 | \n",
" bob | \n",
" intro.html | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user video playback position paused volume\n",
"time \n",
"1469974424 cheryl intro.html 5 False 10.0\n",
"1469974424 sue advanced.html 23 False 10.0\n",
"1469974454 cheryl intro.html 6 NaN NaN\n",
"1469974454 sue advanced.html 24 NaN NaN\n",
"1469974484 cheryl intro.html 7 NaN NaN\n",
"1469974514 cheryl intro.html 8 NaN NaN\n",
"1469974524 sue advanced.html 25 NaN NaN\n",
"1469974544 cheryl intro.html 9 NaN NaN\n",
"1469974554 sue advanced.html 26 NaN NaN\n",
"1469974574 cheryl intro.html 10 NaN NaN\n",
"1469974604 cheryl intro.html 11 NaN NaN\n",
"1469974624 sue advanced.html 27 NaN NaN\n",
"1469974634 cheryl intro.html 12 NaN NaN\n",
"1469974654 sue advanced.html 28 NaN 5.0\n",
"1469974664 cheryl intro.html 13 NaN NaN\n",
"1469974694 cheryl intro.html 14 NaN NaN\n",
"1469974724 cheryl intro.html 15 NaN NaN\n",
"1469974724 sue advanced.html 29 NaN NaN\n",
"1469974754 sue advanced.html 30 NaN NaN\n",
"1469974824 sue advanced.html 31 NaN NaN\n",
"1469974854 sue advanced.html 32 NaN NaN\n",
"1469974924 sue advanced.html 33 NaN NaN\n",
"1469977424 bob intro.html 1 True 10.0\n",
"1469977454 bob intro.html 1 NaN NaN\n",
"1469977484 bob intro.html 1 NaN NaN\n",
"1469977514 bob intro.html 1 NaN NaN\n",
"1469977544 bob intro.html 1 NaN NaN\n",
"1469977574 bob intro.html 1 NaN NaN\n",
"1469977604 bob intro.html 1 NaN NaN\n",
"1469977634 bob intro.html 1 NaN NaN\n",
"1469977664 bob intro.html 1 NaN NaN\n",
"1469977694 bob intro.html 1 NaN NaN\n",
"1469977724 bob intro.html 1 NaN NaN"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Let's view the data sorted by time\n",
"log_df=log_df.set_index('time').sort_index() \n",
"log_df"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" video | \n",
" playback position | \n",
" paused | \n",
" volume | \n",
"
\n",
" \n",
" user | \n",
" time | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" bob | \n",
" 1469977424 | \n",
" intro.html | \n",
" 1 | \n",
" True | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1469977454 | \n",
" intro.html | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469977484 | \n",
" intro.html | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469977514 | \n",
" intro.html | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469977544 | \n",
" intro.html | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469977574 | \n",
" intro.html | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469977604 | \n",
" intro.html | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469977634 | \n",
" intro.html | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469977664 | \n",
" intro.html | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469977694 | \n",
" intro.html | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469977724 | \n",
" intro.html | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" cheryl | \n",
" 1469974424 | \n",
" intro.html | \n",
" 5 | \n",
" False | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1469974454 | \n",
" intro.html | \n",
" 6 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974484 | \n",
" intro.html | \n",
" 7 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974514 | \n",
" intro.html | \n",
" 8 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974544 | \n",
" intro.html | \n",
" 9 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974574 | \n",
" intro.html | \n",
" 10 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974604 | \n",
" intro.html | \n",
" 11 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974634 | \n",
" intro.html | \n",
" 12 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974664 | \n",
" intro.html | \n",
" 13 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974694 | \n",
" intro.html | \n",
" 14 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974724 | \n",
" intro.html | \n",
" 15 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" sue | \n",
" 1469974424 | \n",
" advanced.html | \n",
" 23 | \n",
" False | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1469974454 | \n",
" advanced.html | \n",
" 24 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974524 | \n",
" advanced.html | \n",
" 25 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974554 | \n",
" advanced.html | \n",
" 26 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974624 | \n",
" advanced.html | \n",
" 27 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974654 | \n",
" advanced.html | \n",
" 28 | \n",
" NaN | \n",
" 5.0 | \n",
"
\n",
" \n",
" 1469974724 | \n",
" advanced.html | \n",
" 29 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974754 | \n",
" advanced.html | \n",
" 30 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974824 | \n",
" advanced.html | \n",
" 31 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974854 | \n",
" advanced.html | \n",
" 32 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1469974924 | \n",
" advanced.html | \n",
" 33 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" video playback position paused volume\n",
"user time \n",
"bob 1469977424 intro.html 1 True 10.0\n",
" 1469977454 intro.html 1 NaN NaN\n",
" 1469977484 intro.html 1 NaN NaN\n",
" 1469977514 intro.html 1 NaN NaN\n",
" 1469977544 intro.html 1 NaN NaN\n",
" 1469977574 intro.html 1 NaN NaN\n",
" 1469977604 intro.html 1 NaN NaN\n",
" 1469977634 intro.html 1 NaN NaN\n",
" 1469977664 intro.html 1 NaN NaN\n",
" 1469977694 intro.html 1 NaN NaN\n",
" 1469977724 intro.html 1 NaN NaN\n",
"cheryl 1469974424 intro.html 5 False 10.0\n",
" 1469974454 intro.html 6 NaN NaN\n",
" 1469974484 intro.html 7 NaN NaN\n",
" 1469974514 intro.html 8 NaN NaN\n",
" 1469974544 intro.html 9 NaN NaN\n",
" 1469974574 intro.html 10 NaN NaN\n",
" 1469974604 intro.html 11 NaN NaN\n",
" 1469974634 intro.html 12 NaN NaN\n",
" 1469974664 intro.html 13 NaN NaN\n",
" 1469974694 intro.html 14 NaN NaN\n",
" 1469974724 intro.html 15 NaN NaN\n",
"sue 1469974424 advanced.html 23 False 10.0\n",
" 1469974454 advanced.html 24 NaN NaN\n",
" 1469974524 advanced.html 25 NaN NaN\n",
" 1469974554 advanced.html 26 NaN NaN\n",
" 1469974624 advanced.html 27 NaN NaN\n",
" 1469974654 advanced.html 28 NaN 5.0\n",
" 1469974724 advanced.html 29 NaN NaN\n",
" 1469974754 advanced.html 30 NaN NaN\n",
" 1469974824 advanced.html 31 NaN NaN\n",
" 1469974854 advanced.html 32 NaN NaN\n",
" 1469974924 advanced.html 33 NaN NaN"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"log_df=log_df.reset_index().set_index(['user','time']).sort_index()\n",
"# We want to use a 2-level index, so that all entries for a given user are together\n",
"# Chaining can keep your code short. Make sure it does not make it too cryptic though ;) \n",
"# Motivation: Using user Cheryl's volume setting to fill-in user Bob's volume may not be meaningful\n",
"log_df"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" video | \n",
" playback position | \n",
" paused | \n",
" volume | \n",
"
\n",
" \n",
" user | \n",
" time | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" bob | \n",
" 1469977424 | \n",
" intro.html | \n",
" 1 | \n",
" True | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1469977454 | \n",
" intro.html | \n",
" 1 | \n",
" True | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1469977484 | \n",
" intro.html | \n",
" 1 | \n",
" True | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1469977514 | \n",
" intro.html | \n",
" 1 | \n",
" True | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1469977544 | \n",
" intro.html | \n",
" 1 | \n",
" True | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1469977574 | \n",
" intro.html | \n",
" 1 | \n",
" True | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1469977604 | \n",
" intro.html | \n",
" 1 | \n",
" True | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1469977634 | \n",
" intro.html | \n",
" 1 | \n",
" True | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1469977664 | \n",
" intro.html | \n",
" 1 | \n",
" True | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1469977694 | \n",
" intro.html | \n",
" 1 | \n",
" True | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1469977724 | \n",
" intro.html | \n",
" 1 | \n",
" True | \n",
" 10.0 | \n",
"
\n",
" \n",
" cheryl | \n",
" 1469974424 | \n",
" intro.html | \n",
" 5 | \n",
" False | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1469974454 | \n",
" intro.html | \n",
" 6 | \n",
" False | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1469974484 | \n",
" intro.html | \n",
" 7 | \n",
" False | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1469974514 | \n",
" intro.html | \n",
" 8 | \n",
" False | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1469974544 | \n",
" intro.html | \n",
" 9 | \n",
" False | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1469974574 | \n",
" intro.html | \n",
" 10 | \n",
" False | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1469974604 | \n",
" intro.html | \n",
" 11 | \n",
" False | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1469974634 | \n",
" intro.html | \n",
" 12 | \n",
" False | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1469974664 | \n",
" intro.html | \n",
" 13 | \n",
" False | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1469974694 | \n",
" intro.html | \n",
" 14 | \n",
" False | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1469974724 | \n",
" intro.html | \n",
" 15 | \n",
" False | \n",
" 10.0 | \n",
"
\n",
" \n",
" sue | \n",
" 1469974424 | \n",
" advanced.html | \n",
" 23 | \n",
" False | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1469974454 | \n",
" advanced.html | \n",
" 24 | \n",
" False | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1469974524 | \n",
" advanced.html | \n",
" 25 | \n",
" False | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1469974554 | \n",
" advanced.html | \n",
" 26 | \n",
" False | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1469974624 | \n",
" advanced.html | \n",
" 27 | \n",
" False | \n",
" 10.0 | \n",
"
\n",
" \n",
" 1469974654 | \n",
" advanced.html | \n",
" 28 | \n",
" False | \n",
" 5.0 | \n",
"
\n",
" \n",
" 1469974724 | \n",
" advanced.html | \n",
" 29 | \n",
" False | \n",
" 5.0 | \n",
"
\n",
" \n",
" 1469974754 | \n",
" advanced.html | \n",
" 30 | \n",
" False | \n",
" 5.0 | \n",
"
\n",
" \n",
" 1469974824 | \n",
" advanced.html | \n",
" 31 | \n",
" False | \n",
" 5.0 | \n",
"
\n",
" \n",
" 1469974854 | \n",
" advanced.html | \n",
" 32 | \n",
" False | \n",
" 5.0 | \n",
"
\n",
" \n",
" 1469974924 | \n",
" advanced.html | \n",
" 33 | \n",
" False | \n",
" 5.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" video playback position paused volume\n",
"user time \n",
"bob 1469977424 intro.html 1 True 10.0\n",
" 1469977454 intro.html 1 True 10.0\n",
" 1469977484 intro.html 1 True 10.0\n",
" 1469977514 intro.html 1 True 10.0\n",
" 1469977544 intro.html 1 True 10.0\n",
" 1469977574 intro.html 1 True 10.0\n",
" 1469977604 intro.html 1 True 10.0\n",
" 1469977634 intro.html 1 True 10.0\n",
" 1469977664 intro.html 1 True 10.0\n",
" 1469977694 intro.html 1 True 10.0\n",
" 1469977724 intro.html 1 True 10.0\n",
"cheryl 1469974424 intro.html 5 False 10.0\n",
" 1469974454 intro.html 6 False 10.0\n",
" 1469974484 intro.html 7 False 10.0\n",
" 1469974514 intro.html 8 False 10.0\n",
" 1469974544 intro.html 9 False 10.0\n",
" 1469974574 intro.html 10 False 10.0\n",
" 1469974604 intro.html 11 False 10.0\n",
" 1469974634 intro.html 12 False 10.0\n",
" 1469974664 intro.html 13 False 10.0\n",
" 1469974694 intro.html 14 False 10.0\n",
" 1469974724 intro.html 15 False 10.0\n",
"sue 1469974424 advanced.html 23 False 10.0\n",
" 1469974454 advanced.html 24 False 10.0\n",
" 1469974524 advanced.html 25 False 10.0\n",
" 1469974554 advanced.html 26 False 10.0\n",
" 1469974624 advanced.html 27 False 10.0\n",
" 1469974654 advanced.html 28 False 5.0\n",
" 1469974724 advanced.html 29 False 5.0\n",
" 1469974754 advanced.html 30 False 5.0\n",
" 1469974824 advanced.html 31 False 5.0\n",
" 1469974854 advanced.html 32 False 5.0\n",
" 1469974924 advanced.html 33 False 5.0"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Now, it makes certain sense to assume that \n",
"# the missing values correspond to the last recorded value for that particular user\n",
"# Forward the Fill! (based on a reasonable Foundation of the assumptions ;)\n",
"log_df = log_df.fillna(method='ffill')\n",
"log_df"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Some basic text processing & RegEx\n",
"
\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Doctors', 'may', 'soon', 'identify', '#diabetes', 'patients', 'who', 'have', 'a', 'higher', 'risk', 'of', 'vascular', '#inflammation', 'with', 'a', 'one-step', 'process', 'using', 'a', '‘lab-on-a-chip’', 'device', 'developed', 'by', 'scientists', 'from', '@NTUsg,', '@TTSH', 'and', '@MIT.', '#NTUsg2025', '#NTUsgResearch', 'https://bit.ly/ExoDFFchip']\n"
]
}
],
"source": [
"# Tokens\n",
"NTU_tweet =\"\"\"Doctors may soon identify #diabetes patients who have a higher risk of vascular #inflammation with a one-step process using a ‘lab-on-a-chip’ device developed by scientists from @NTUsg, @TTSH and @MIT. #NTUsg2025 #NTUsgResearch https://bit.ly/ExoDFFchip\"\"\"\n",
"Tweet_tokens=NTU_tweet.split()\n",
"print(Tweet_tokens)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Doctors', 'identify', 'patients', 'higher', 'vascular', 'one-step', 'process', 'using', '‘lab-on-a-chip’', 'device', 'developed', 'scientists']\n"
]
}
],
"source": [
"# Find 'long' words (more than 4 letters) which are not hashtags, call outs, or urls!\n",
"Tweet_longwords=[x for x in Tweet_tokens if (len(x)>4) & (x.find('#')<0) & (x.find('@')<0) & (x.find('://')<0)]\n",
"print(Tweet_longwords)"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['#diabetes', '#inflammation', '#NTUsg2025', '#NTUsgResearch']\n",
"['#diabetes', '#inflammation', '#NTUsg2025', '#NTUsgResearch', 'stranger#things']\n"
]
}
],
"source": [
"# What about, if we want to find all the hashtags?\n",
"\n",
"# Attempt 1:\n",
"Tweet_hashtags1a=[x for x in Tweet_tokens if (x.find('#')>=0) ]\n",
"Tweet_tokens.append(\"stranger#things\")\n",
"Tweet_hashtags1b=[x for x in Tweet_tokens if (x.find('#')>=0) ]\n",
"\n",
"print(Tweet_hashtags1a)\n",
"print(Tweet_hashtags1b)"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['#diabetes', '#inflammation', '#NTUsg2025', '#NTUsgResearch']\n",
"['#diabetes', '#inflammation', '#NTUsg2025', '#NTUsgResearch']\n"
]
}
],
"source": [
"# Attempt 2:\n",
"# Just in case there's something like `stranger#things' which is not a hashtag!\n",
"# We might want to use one of the following:\n",
"Tweet_hashtags2=[x for x in Tweet_tokens if (x.find('#')==0) ] \n",
"Tweet_hashtags3=[x for x in Tweet_tokens if x.startswith('#') ]\n",
"\n",
"print(Tweet_hashtags2)\n",
"print(Tweet_hashtags3)"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['@NTUsg,', '#NTUsg2025', '#NTUsgResearch']\n",
"['@NTUsg,', '#NTUsg2025', '#NTUsgResearch']\n"
]
}
],
"source": [
"# Find all mentions of 'NTU'\n",
"Contains_NTU1=[x for x in Tweet_tokens if (x.find('NTU')>=0) ]\n",
"Contains_NTU2=[x for x in Tweet_tokens if 'NTU' in x]\n",
"\n",
"print(Contains_NTU1)\n",
"print(Contains_NTU2)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Some frequently used & useful string methods \n",
"\n",
"- checking: isalpha/isdigit/isalnum \n",
"- transforming: lower/upper/title/replace/strip/rstrip\n",
"- locate: find/rfind\n",
"\n",
"See [https://docs.python.org/3/library/stdtypes.html] for a more exhaustive list."
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['#diabetes', '#inflammation', '#NTUsg2025', '#NTUsgResearch']"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Let's redo the extraction of hashtags with a RegEx\n",
"\n",
"[x for x in Tweet_tokens if re.search('\\B#[A-Za-z-0-9_]+',x)] \n",
"\n",
"# For the specifc example tweet, if we used re.search('#',x), it would have sufficed.\n",
"# But then it would not have managed to handle the 'stranger' cases such as isolated #, ## or hash sandwiched by text.."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### How to interpret '\\B#[A-Za-z-0-9_]+'?\n",
" [x for x in Tweet_tokens if re.search('\\B#[A-Za-z-0-9_]+',x)] \n",
" \n",
"- \\B Matches boundary with non-alphanumeric characters\n",
"\n",
"- [A-Za-z-0-9_] Matches any of A to Z, a to z, 0 to 9 and _\n",
"\n",
"- + Greedily matches that expression on its immediate left is present at least once."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['#diabetes', '#inflammation', '#NTUsg2025', '#NTUsgResearch']"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# The same RegEx can be rewritten in a much shorter (and slightly cryptic) manner\n",
"[x for x in Tweet_tokens if re.search('\\B#\\w+',x)]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### RegEx cheatsheet snippet\n",
"Source: Dataquest.io\n",
"\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### RegEx cheatsheet snippet\n",
"Source: Dataquest.io\n",
"\n",
"
\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 03/25/93 Total time of visit (in minutes):\n",
"1 6/18/85 Primary Care Doctor:\n",
"2 sshe plans to move as of 7/8/71 In-Home Servic...\n",
"3 7 on 9/27/75 Audit C Score Current:\n",
"4 2/6/96 sleep studyPain Treatment Pain Level (N...\n",
" ... \n",
"495 1979 Family Psych History: Family History of S...\n",
"496 therapist and friend died in ~2006 Parental/Ca...\n",
"497 2008 partial thyroidectomy\n",
"498 sPt describes a history of sexual abuse as a c...\n",
"499 . In 1980, patient was living in Naples and de...\n",
"Length: 500, dtype: object"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Open a file, read it line-by-line and create a Data Series \n",
"# The following data is obtained from Prof V. Vydiswaran's course on Applied Text Mining in Python\n",
"\n",
"doc = []\n",
"with open(data_from_coursera_path+'text-with-dates.txt','r') as file:\n",
" for line in file:\n",
" doc.append(line.rstrip()) # used rstrip to remove trailing newlines \\n\n",
"\n",
"datetxt_df = pd.Series(doc)\n",
"datetxt_df"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['03/25/93 Total time of visit (in minutes):',\n",
" '6/18/85 Primary Care Doctor:',\n",
" 'sshe plans to move as of 7/8/71 In-Home Services: None',\n",
" '7 on 9/27/75 Audit C Score Current:',\n",
" '2/6/96 sleep studyPain Treatment Pain Level (Numeric Scale): 7',\n",
" '.Per 7/06/79 Movement D/O note:',\n",
" \"4, 5/18/78 Patient's thoughts about current substance abuse:\",\n",
" '10/24/89 CPT Code: 90801 - Psychiatric Diagnosis Interview',\n",
" '3/7/86 SOS-10 Total Score:',\n",
" '(4/10/71)Score-1Audit C Score Current:',\n",
" '(5/11/85) Crt-1.96, BUN-26; AST/ALT-16/22; WBC_12.6Activities of Daily Living (ADL) Bathing: Independent',\n",
" '4/09/75 SOS-10 Total Score:',\n",
" '8/01/98 Communication with referring physician?: Done',\n",
" '1/26/72 Communication with referring physician?: Not Done',\n",
" '5/24/1990 CPT Code: 90792: With medical services',\n",
" '1/25/2011 CPT Code: 90792: With medical services',\n",
" '4/12/82 Total time of visit (in minutes):',\n",
" '1; 10/13/1976 Audit C Score, Highest/Date:',\n",
" '4, 4/24/98 Relevant Drug History:',\n",
" ') 59 yo unemployed w referred by Urgent Care for psychiatric evaluation and follow up. The patient reports she has been dx w BAD. Her main complaint today is anergy. Ms. Hartman was evaluated on one occasion 5/21/77. She was a cooperative but somewhat vague historian.History of Present Illness and Precipitating Events',\n",
" '7/21/98 Total time of visit (in minutes):',\n",
" '10/21/79 SOS-10 Total Score:',\n",
" '3/03/90 CPT Code: 90792: With medical services',\n",
" '2/11/76 CPT Code: 90792: With medical services',\n",
" '07/25/1984 CPT Code: 90791: No medical services',\n",
" '4-13-82 Other Child Mental Health Outcomes Scales Used:',\n",
" '9/22/89 CPT Code: 90792: With medical services',\n",
" '9/02/76 CPT Code: 90791: No medical services',\n",
" '9/12/71 [report_end]',\n",
" '10/24/86 Communication with referring physician?: Done',\n",
" '03/31/1985 Total time of visit (in minutes):',\n",
" '7/20/72 CPT Code: 90791: No medical services',\n",
" '4/12/87= 1, negativeAudit C Score Current:',\n",
" '06/20/91 Total time of visit (in minutes):',\n",
" '5/12/2012 Primary Care Doctor:',\n",
" '3/15/83 SOS-10 Total Score:',\n",
" '2/14/73 CPT Code: 90801 - Psychiatric Diagnosis Interview',\n",
" '5/24/88 CPT Code: 90792: With medical services',\n",
" '7/27/1986 Total time of visit (in minutes):',\n",
" '1-14-81 Communication with referring physician?: Done',\n",
" '7-29-75 CPT Code: 90801 - Psychiatric Diagnosis Interview',\n",
" '(6/24/87) TSH-2.18; Activities of Daily Living (ADL) Bathing: Independent',\n",
" '8/14/94 Primary Care Doctor:',\n",
" '4/13/2002 Primary Care Doctor:',\n",
" '8/16/82 CPT Code: 90792: With medical services',\n",
" '2/15/1998 Total time of visit (in minutes):',\n",
" '7/15/91 CPT Code: 90792: With medical services',\n",
" '06/12/94 SOS-10 Total Score:',\n",
" '9/17/84 Communication with referring physician?: Done',\n",
" '2/28/75 Other Adult Mental Health Outcomes Scales Used:',\n",
" 'sOP WPM - Dr. Romo-psychopharm since 11/22/75',\n",
" '\"In the context of the patient\\'s work up, she had Neuropsychological testing 5/24/91. She was referred because of her cognitive decline and symptoms of motor dysfunction to help characterize her current functioning. :',\n",
" '6/13/92 CPT Code: 90791: No medical services',\n",
" '7/11/71 SOS-10 Total Score:',\n",
" '12/26/86 CPT Code: 90791: No medical services',\n",
" '10/11/1987 CPT Code: 90791: No medical services',\n",
" '3/14/95 Primary Care Doctor:',\n",
" '12/01/73 CPT Code: 90791: No medical services',\n",
" '0: 12/5/2010 Audit C Score Current:',\n",
" '08/20/1982 SOS-10 Total Score:',\n",
" '7/24/95 SOS-10 Total Score:',\n",
" '8/06/83 CPT Code: 90791: No medical services',\n",
" '02/22/92 SOS-10 Total Score:',\n",
" '6/28/87 Impression Strengths/Abilities:',\n",
" '07/29/1994 CPT code: 99203',\n",
" '08/11/78 CPT Code: 90801 - Psychiatric Diagnosis Interview',\n",
" '10/29/91 Communication with referring physician?: Done',\n",
" '7/6/91 SOS-10 Total Score:',\n",
" 'onone as of 1/21/87 Protective Factors:',\n",
" '24 yo right handed woman with history of large right frontal mass s/p resection 11/3/1985 who had recent urgent R cranial wound revision and placement of L EVD for declining vision and increased drainage from craniotomy incision site and possible infection. She has a hx of secondary mania related to psychosis and manipulation of her right frontal lobe.',\n",
" '7/04/82 CPT Code: 90801 - Psychiatric Diagnosis Interview',\n",
" '4-13-89 Communication with referring physician?: Not Done',\n",
" 'Lithium 0.25 (7/11/77). LFTS wnl. Urine tox neg. Serum tox + fluoxetine 500; otherwise neg. TSH 3.28. BUN/Cr: 16/0.83. Lipids unremarkable. B12 363, Folate >20. CBC: 4.9/36/308 Pertinent Medical Review of Systems Constitutional:',\n",
" '4/12/74= 1Caffeine / Tobacco Use Caffeinated products: Yes',\n",
" '09/19/81 CPT Code: 90792: With medical services',\n",
" '9/6/79 Primary Care Doctor:',\n",
" '12/5/87 Total time of visit (in minutes):',\n",
" '01/05/1999 [report_end]',\n",
" '4/22/80 SOS-10 Total Score:',\n",
" '10/04/98 SOS-10 Total Score:',\n",
" '.Mr. Echeverria described having panic-like experiences on at least three occasions. The first of these episodes occurred when facial meshing was placed on him during radiation therapy. He recalled the experience vividly and described \"freaking out\", involving a sensation of crushing in his chest, difficulty breathing, a fear that he could die, and intense emotional disquiet. The second episode occurred during the last 10-15 minutes of an MRI scan. Mr. Echeverria noted feeling surprised and fearful when the feelings of panic started. Similar physiological and psychological symptoms were reported as above. The third episode occurred recently on 6/29/81. Mr.Echeverria was driving his car and felt similar symptoms as described above. He reported fearing that he would crash his vehicle and subsequently pulled over and abandoned his vehicle.',\n",
" '. Wile taking additional history, pt endorsed moderate depression and anxiety last 2 weeks, even though her period was on 8/04/78. Pt then acknowledged low to moderate mood and anxiety symptoms throughout the month, with premenstrual worsening. This mood pattern seems to have existed for many years, while premenstrual worsening increased in severity during last ~ 2 years, in context of likely peri menopause transition. Pt reported average of 3 hot flashes per day and 2 night sweats.',\n",
" 'Death of mother; 7/07/1974 Meaningful activities/supports:',\n",
" '09/14/2000 CPT Code: 90792: With medical services',\n",
" '5/18/71 Total time of visit (in minutes):',\n",
" '8/09/1981 Communication with referring physician?: Done',\n",
" '6/05/93 CPT Code: 90791: No medical services',\n",
" ')Dilantin (PHENYTOIN) 100 MG CAPSULE Take 2 Capsule(s) PO HS; No Change (Taking Differently), Comments: decreased from 290 daily to 260 mg daily due to elevated phenytoin level from 8/9/97',\n",
" '12/8/82 Audit C=3Audit C Score Current:',\n",
" '8/26/89 CPT Code: 90791: No medical services',\n",
" '10/13/95 CPT Code: 90791: No medical services',\n",
" '4/19/91 Communication with referring physician?: Not Done',\n",
" '.APS - Psychiatry consult paged/requested in person at 04/08/2004 16:39 Patrick, Christian [hpp2]',\n",
" '9/20/76 CPT Code: 90801 - Psychiatric Diagnosis Interview',\n",
" '12/08/1990 @11 am [report_end]',\n",
" '4/11/1974 Chief Complaint / HPI Chief Complaint (Patients own words)',\n",
" '7/18/86 SOS-10 Total Score:',\n",
" '3/31/91 Communication with referring physician?: Done',\n",
" '5/13/72 Other Adult Mental Health Outcomes Scales Used:',\n",
" '011/14/83 Audit C Score Current:',\n",
" '8/16/92 SOS-10 Total Score:',\n",
" '10/05/97 CPT Code: 90791: No medical services',\n",
" '07/18/2002 CPT Code: 90792: With medical services',\n",
" '9/22/82 Total time of visit (in minutes):',\n",
" '2/24/74 SOS-10 Total Score:',\n",
" '(2/03/78) TSH-0.90 Activities of Daily Living (ADL) Bathing: Independent',\n",
" '2/11/2006 CPT Code: 90791: No medical services',\n",
" \"Pt is a 21 year old, single, heterosexual identified, Liechtenstein male. He resides in Rabat, in a rented apartment, with a roommate. He has not graduated from high school nor has he earned his GED. He has been working full time, steadily, since the age of 16. He recently left a job at jacobs engineering group where he worked for about 5 years. He started working at becton dickinson a few weeks ago. Referral to SMH was precipitated by pt seeing his PCP, Dr. Ford, for a physical and sharing concerns re: depression. He was having suicidal thoughts the week of 8/22/83 and noticed that his depression and anxiety symptoms were starting to interfere with work. Per Dr. Ford's recommendation, he contacted PAL for assistance. When screened for substance use, pt was referred to SMH for an intake. Pt was scheduled to be seen last week due to urgency of his depression and SI, but missed the intake appointment due to getting extremely intoxicated the night before. Pt presents today seeking individual therapy and psychiatry services to address longstanding depression.\",\n",
" 'PET Scan (DPSH 5/04/74): 1) Marked hypometabolism involving the bilateral caudate nuclei. This is a non-specific finding, although this raises the question of a multisystem atrophy, specifically MSA-P (striatonigral degeneration), 2) Cortical hypometabolism involving the right inferior frontal gyrus. This is a non-specific finding of uncertain significance, and is not definitively related to a neurodegenerative process, although pathology in this region is not excluded',\n",
" '7/20/2011 [report_end]',\n",
" '6/17/95 Total time of visit (in minutes):',\n",
" '6/10/72 SOS-10 Total Score:',\n",
" 'nPt denied use to me but endorsed use on 10/16/82 as part of BVH initial visit summary supplement by Nicholas BenjaminOpiates: Yes',\n",
" '12/15/92 CPT Code: 90801 - Psychiatric Diagnosis Interview',\n",
" '12/8/97 SOS-10 Total Score:',\n",
" '4/05/89 Primary Care Doctor:',\n",
" '12/04/87 SOS-10 Total Score:',\n",
" '4 (6/20/77)Audit C Score Current:',\n",
" 'see 4/27/2006 consult note Dr. GuevaraWhat factors in prior treatment were helpful/not helpful:',\n",
" '07/17/92 CPT Code: 90791: No medical services',\n",
" '12/22/98 CPT Code: 90801 - Psychiatric Diagnosis Interview',\n",
" '10/02/96 Age:',\n",
" '11/05/90 CPT Code: 90792: With medical services',\n",
" '5/04/77 CPT Code: 90792: With medical services',\n",
" '2/27/96 Communication with referring physician?: Done']"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# What does the following code do?\n",
"[x for x in datetxt_df if re.findall(r'\\d{1,2}[/|-]\\d{1,2}[/|-]\\d{2,4}',x)]"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"scrolled": false,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['03/25/93 Total time of visit (in minutes):',\n",
" '6/18/85 Primary Care Doctor:',\n",
" 'sshe plans to move as of 7/8/71 In-Home Services: None',\n",
" '7 on 9/27/75 Audit C Score Current:',\n",
" '2/6/96 sleep studyPain Treatment Pain Level (Numeric Scale): 7',\n",
" '.Per 7/06/79 Movement D/O note:',\n",
" \"4, 5/18/78 Patient's thoughts about current substance abuse:\",\n",
" '10/24/89 CPT Code: 90801 - Psychiatric Diagnosis Interview',\n",
" '3/7/86 SOS-10 Total Score:',\n",
" '(4/10/71)Score-1Audit C Score Current:',\n",
" '(5/11/85) Crt-1.96, BUN-26; AST/ALT-16/22; WBC_12.6Activities of Daily Living (ADL) Bathing: Independent',\n",
" '4/09/75 SOS-10 Total Score:',\n",
" '8/01/98 Communication with referring physician?: Done',\n",
" '1/26/72 Communication with referring physician?: Not Done',\n",
" '5/24/1990 CPT Code: 90792: With medical services',\n",
" '1/25/2011 CPT Code: 90792: With medical services',\n",
" '4/12/82 Total time of visit (in minutes):',\n",
" '1; 10/13/1976 Audit C Score, Highest/Date:',\n",
" '4, 4/24/98 Relevant Drug History:',\n",
" ') 59 yo unemployed w referred by Urgent Care for psychiatric evaluation and follow up. The patient reports she has been dx w BAD. Her main complaint today is anergy. Ms. Hartman was evaluated on one occasion 5/21/77. She was a cooperative but somewhat vague historian.History of Present Illness and Precipitating Events',\n",
" '7/21/98 Total time of visit (in minutes):',\n",
" '10/21/79 SOS-10 Total Score:',\n",
" '3/03/90 CPT Code: 90792: With medical services',\n",
" '2/11/76 CPT Code: 90792: With medical services',\n",
" '07/25/1984 CPT Code: 90791: No medical services',\n",
" '4-13-82 Other Child Mental Health Outcomes Scales Used:',\n",
" '9/22/89 CPT Code: 90792: With medical services',\n",
" '9/02/76 CPT Code: 90791: No medical services',\n",
" '9/12/71 [report_end]',\n",
" '10/24/86 Communication with referring physician?: Done',\n",
" '03/31/1985 Total time of visit (in minutes):',\n",
" '7/20/72 CPT Code: 90791: No medical services',\n",
" '4/12/87= 1, negativeAudit C Score Current:',\n",
" '06/20/91 Total time of visit (in minutes):',\n",
" '5/12/2012 Primary Care Doctor:',\n",
" '3/15/83 SOS-10 Total Score:',\n",
" '2/14/73 CPT Code: 90801 - Psychiatric Diagnosis Interview',\n",
" '5/24/88 CPT Code: 90792: With medical services',\n",
" '7/27/1986 Total time of visit (in minutes):',\n",
" '1-14-81 Communication with referring physician?: Done',\n",
" '7-29-75 CPT Code: 90801 - Psychiatric Diagnosis Interview',\n",
" '(6/24/87) TSH-2.18; Activities of Daily Living (ADL) Bathing: Independent',\n",
" '8/14/94 Primary Care Doctor:',\n",
" '4/13/2002 Primary Care Doctor:',\n",
" '8/16/82 CPT Code: 90792: With medical services',\n",
" '2/15/1998 Total time of visit (in minutes):',\n",
" '7/15/91 CPT Code: 90792: With medical services',\n",
" '06/12/94 SOS-10 Total Score:',\n",
" '9/17/84 Communication with referring physician?: Done',\n",
" '2/28/75 Other Adult Mental Health Outcomes Scales Used:',\n",
" 'sOP WPM - Dr. Romo-psychopharm since 11/22/75',\n",
" '\"In the context of the patient\\'s work up, she had Neuropsychological testing 5/24/91. She was referred because of her cognitive decline and symptoms of motor dysfunction to help characterize her current functioning. :',\n",
" '6/13/92 CPT Code: 90791: No medical services',\n",
" '7/11/71 SOS-10 Total Score:',\n",
" '12/26/86 CPT Code: 90791: No medical services',\n",
" '10/11/1987 CPT Code: 90791: No medical services',\n",
" '3/14/95 Primary Care Doctor:',\n",
" '12/01/73 CPT Code: 90791: No medical services',\n",
" '0: 12/5/2010 Audit C Score Current:',\n",
" '08/20/1982 SOS-10 Total Score:',\n",
" '7/24/95 SOS-10 Total Score:',\n",
" '8/06/83 CPT Code: 90791: No medical services',\n",
" '02/22/92 SOS-10 Total Score:',\n",
" '6/28/87 Impression Strengths/Abilities:',\n",
" '07/29/1994 CPT code: 99203',\n",
" '08/11/78 CPT Code: 90801 - Psychiatric Diagnosis Interview',\n",
" '10/29/91 Communication with referring physician?: Done',\n",
" '7/6/91 SOS-10 Total Score:',\n",
" 'onone as of 1/21/87 Protective Factors:',\n",
" '24 yo right handed woman with history of large right frontal mass s/p resection 11/3/1985 who had recent urgent R cranial wound revision and placement of L EVD for declining vision and increased drainage from craniotomy incision site and possible infection. She has a hx of secondary mania related to psychosis and manipulation of her right frontal lobe.',\n",
" '7/04/82 CPT Code: 90801 - Psychiatric Diagnosis Interview',\n",
" '4-13-89 Communication with referring physician?: Not Done',\n",
" 'Lithium 0.25 (7/11/77). LFTS wnl. Urine tox neg. Serum tox + fluoxetine 500; otherwise neg. TSH 3.28. BUN/Cr: 16/0.83. Lipids unremarkable. B12 363, Folate >20. CBC: 4.9/36/308 Pertinent Medical Review of Systems Constitutional:',\n",
" '4/12/74= 1Caffeine / Tobacco Use Caffeinated products: Yes',\n",
" '09/19/81 CPT Code: 90792: With medical services',\n",
" '9/6/79 Primary Care Doctor:',\n",
" '12/5/87 Total time of visit (in minutes):',\n",
" '01/05/1999 [report_end]',\n",
" '4/22/80 SOS-10 Total Score:',\n",
" '10/04/98 SOS-10 Total Score:',\n",
" '.Mr. Echeverria described having panic-like experiences on at least three occasions. The first of these episodes occurred when facial meshing was placed on him during radiation therapy. He recalled the experience vividly and described \"freaking out\", involving a sensation of crushing in his chest, difficulty breathing, a fear that he could die, and intense emotional disquiet. The second episode occurred during the last 10-15 minutes of an MRI scan. Mr. Echeverria noted feeling surprised and fearful when the feelings of panic started. Similar physiological and psychological symptoms were reported as above. The third episode occurred recently on 6/29/81. Mr.Echeverria was driving his car and felt similar symptoms as described above. He reported fearing that he would crash his vehicle and subsequently pulled over and abandoned his vehicle.',\n",
" '. Wile taking additional history, pt endorsed moderate depression and anxiety last 2 weeks, even though her period was on 8/04/78. Pt then acknowledged low to moderate mood and anxiety symptoms throughout the month, with premenstrual worsening. This mood pattern seems to have existed for many years, while premenstrual worsening increased in severity during last ~ 2 years, in context of likely peri menopause transition. Pt reported average of 3 hot flashes per day and 2 night sweats.',\n",
" 'Death of mother; 7/07/1974 Meaningful activities/supports:',\n",
" '09/14/2000 CPT Code: 90792: With medical services',\n",
" '5/18/71 Total time of visit (in minutes):',\n",
" '8/09/1981 Communication with referring physician?: Done',\n",
" '6/05/93 CPT Code: 90791: No medical services',\n",
" ')Dilantin (PHENYTOIN) 100 MG CAPSULE Take 2 Capsule(s) PO HS; No Change (Taking Differently), Comments: decreased from 290 daily to 260 mg daily due to elevated phenytoin level from 8/9/97',\n",
" '12/8/82 Audit C=3Audit C Score Current:',\n",
" '8/26/89 CPT Code: 90791: No medical services',\n",
" '10/13/95 CPT Code: 90791: No medical services',\n",
" '4/19/91 Communication with referring physician?: Not Done',\n",
" '.APS - Psychiatry consult paged/requested in person at 04/08/2004 16:39 Patrick, Christian [hpp2]',\n",
" '9/20/76 CPT Code: 90801 - Psychiatric Diagnosis Interview',\n",
" '12/08/1990 @11 am [report_end]',\n",
" '4/11/1974 Chief Complaint / HPI Chief Complaint (Patients own words)',\n",
" '7/18/86 SOS-10 Total Score:',\n",
" '3/31/91 Communication with referring physician?: Done',\n",
" '5/13/72 Other Adult Mental Health Outcomes Scales Used:',\n",
" '011/14/83 Audit C Score Current:',\n",
" '8/16/92 SOS-10 Total Score:',\n",
" '10/05/97 CPT Code: 90791: No medical services',\n",
" '07/18/2002 CPT Code: 90792: With medical services',\n",
" '9/22/82 Total time of visit (in minutes):',\n",
" '2/24/74 SOS-10 Total Score:',\n",
" '(2/03/78) TSH-0.90 Activities of Daily Living (ADL) Bathing: Independent',\n",
" '2/11/2006 CPT Code: 90791: No medical services',\n",
" \"Pt is a 21 year old, single, heterosexual identified, Liechtenstein male. He resides in Rabat, in a rented apartment, with a roommate. He has not graduated from high school nor has he earned his GED. He has been working full time, steadily, since the age of 16. He recently left a job at jacobs engineering group where he worked for about 5 years. He started working at becton dickinson a few weeks ago. Referral to SMH was precipitated by pt seeing his PCP, Dr. Ford, for a physical and sharing concerns re: depression. He was having suicidal thoughts the week of 8/22/83 and noticed that his depression and anxiety symptoms were starting to interfere with work. Per Dr. Ford's recommendation, he contacted PAL for assistance. When screened for substance use, pt was referred to SMH for an intake. Pt was scheduled to be seen last week due to urgency of his depression and SI, but missed the intake appointment due to getting extremely intoxicated the night before. Pt presents today seeking individual therapy and psychiatry services to address longstanding depression.\",\n",
" 'PET Scan (DPSH 5/04/74): 1) Marked hypometabolism involving the bilateral caudate nuclei. This is a non-specific finding, although this raises the question of a multisystem atrophy, specifically MSA-P (striatonigral degeneration), 2) Cortical hypometabolism involving the right inferior frontal gyrus. This is a non-specific finding of uncertain significance, and is not definitively related to a neurodegenerative process, although pathology in this region is not excluded',\n",
" '7/20/2011 [report_end]',\n",
" '6/17/95 Total time of visit (in minutes):',\n",
" '6/10/72 SOS-10 Total Score:',\n",
" 'nPt denied use to me but endorsed use on 10/16/82 as part of BVH initial visit summary supplement by Nicholas BenjaminOpiates: Yes',\n",
" '12/15/92 CPT Code: 90801 - Psychiatric Diagnosis Interview',\n",
" '12/8/97 SOS-10 Total Score:',\n",
" '4/05/89 Primary Care Doctor:',\n",
" '12/04/87 SOS-10 Total Score:',\n",
" '4 (6/20/77)Audit C Score Current:',\n",
" 'see 4/27/2006 consult note Dr. GuevaraWhat factors in prior treatment were helpful/not helpful:',\n",
" '07/17/92 CPT Code: 90791: No medical services',\n",
" '12/22/98 CPT Code: 90801 - Psychiatric Diagnosis Interview',\n",
" '10/02/96 Age:',\n",
" '11/05/90 CPT Code: 90792: With medical services',\n",
" '5/04/77 CPT Code: 90792: With medical services',\n",
" '2/27/96 Communication with referring physician?: Done']"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[x for x in datetxt_df if re.search(r'\\d{1,2}[/|-]\\d{1,2}[/|-]\\d{2,4}',x)]\n",
"\n",
"# Search returns the first occurrence, while findall finds all the (non-overlapping) matches.\n",
"# For our usage here, either suffices.\n",
"# Even re.match() would suffice for the current purpose. \n",
"# Check the subtle differences between match/search/findall for yourself! "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### pandas.Series.str.extract & extractall\n",
"\n",
"Extract captures groups in the regex pattern as columns in DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 03/25/93 | \n",
"
\n",
" \n",
" 1 | \n",
" 6/18/85 | \n",
"
\n",
" \n",
" 2 | \n",
" 7/8/71 | \n",
"
\n",
" \n",
" 3 | \n",
" 9/27/75 | \n",
"
\n",
" \n",
" 4 | \n",
" 2/6/96 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 495 | \n",
" NaN | \n",
"
\n",
" \n",
" 496 | \n",
" NaN | \n",
"
\n",
" \n",
" 497 | \n",
" NaN | \n",
"
\n",
" \n",
" 498 | \n",
" NaN | \n",
"
\n",
" \n",
" 499 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
500 rows × 1 columns
\n",
"
"
],
"text/plain": [
" 0\n",
"0 03/25/93\n",
"1 6/18/85\n",
"2 7/8/71\n",
"3 9/27/75\n",
"4 2/6/96\n",
".. ...\n",
"495 NaN\n",
"496 NaN\n",
"497 NaN\n",
"498 NaN\n",
"499 NaN\n",
"\n",
"[500 rows x 1 columns]"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Extract: Grouping done using ()\n",
"datetxt_df.str.extract(r'(\\d{1,2}[/|-]\\d{1,2}[/|-]\\d{2,4})')"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" 0 | \n",
"
\n",
" \n",
" | \n",
" match | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 03/25/93 | \n",
"
\n",
" \n",
" 1 | \n",
" 0 | \n",
" 6/18/85 | \n",
"
\n",
" \n",
" 2 | \n",
" 0 | \n",
" 7/8/71 | \n",
"
\n",
" \n",
" 3 | \n",
" 0 | \n",
" 9/27/75 | \n",
"
\n",
" \n",
" 4 | \n",
" 0 | \n",
" 2/6/96 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 120 | \n",
" 0 | \n",
" 12/22/98 | \n",
"
\n",
" \n",
" 121 | \n",
" 0 | \n",
" 10/02/96 | \n",
"
\n",
" \n",
" 122 | \n",
" 0 | \n",
" 11/05/90 | \n",
"
\n",
" \n",
" 123 | \n",
" 0 | \n",
" 5/04/77 | \n",
"
\n",
" \n",
" 124 | \n",
" 0 | \n",
" 2/27/96 | \n",
"
\n",
" \n",
"
\n",
"
126 rows × 1 columns
\n",
"
"
],
"text/plain": [
" 0\n",
" match \n",
"0 0 03/25/93\n",
"1 0 6/18/85\n",
"2 0 7/8/71\n",
"3 0 9/27/75\n",
"4 0 2/6/96\n",
"... ...\n",
"120 0 12/22/98\n",
"121 0 10/02/96\n",
"122 0 11/05/90\n",
"123 0 5/04/77\n",
"124 0 2/27/96\n",
"\n",
"[126 rows x 1 columns]"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Extract all\n",
"datetxt_df.str.extractall(r'(\\d{1,2}[/|-]\\d{1,2}[/|-]\\d{2,4})')"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" | \n",
" match | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 03 | \n",
" 25 | \n",
" 93 | \n",
"
\n",
" \n",
" 1 | \n",
" 0 | \n",
" 6 | \n",
" 18 | \n",
" 85 | \n",
"
\n",
" \n",
" 2 | \n",
" 0 | \n",
" 7 | \n",
" 8 | \n",
" 71 | \n",
"
\n",
" \n",
" 3 | \n",
" 0 | \n",
" 9 | \n",
" 27 | \n",
" 75 | \n",
"
\n",
" \n",
" 4 | \n",
" 0 | \n",
" 2 | \n",
" 6 | \n",
" 96 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 120 | \n",
" 0 | \n",
" 12 | \n",
" 22 | \n",
" 98 | \n",
"
\n",
" \n",
" 121 | \n",
" 0 | \n",
" 10 | \n",
" 02 | \n",
" 96 | \n",
"
\n",
" \n",
" 122 | \n",
" 0 | \n",
" 11 | \n",
" 05 | \n",
" 90 | \n",
"
\n",
" \n",
" 123 | \n",
" 0 | \n",
" 5 | \n",
" 04 | \n",
" 77 | \n",
"
\n",
" \n",
" 124 | \n",
" 0 | \n",
" 2 | \n",
" 27 | \n",
" 96 | \n",
"
\n",
" \n",
"
\n",
"
126 rows × 3 columns
\n",
"
"
],
"text/plain": [
" 0 1 2\n",
" match \n",
"0 0 03 25 93\n",
"1 0 6 18 85\n",
"2 0 7 8 71\n",
"3 0 9 27 75\n",
"4 0 2 6 96\n",
"... .. .. ..\n",
"120 0 12 22 98\n",
"121 0 10 02 96\n",
"122 0 11 05 90\n",
"123 0 5 04 77\n",
"124 0 2 27 96\n",
"\n",
"[126 rows x 3 columns]"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Further grouping with () \n",
"datetxt_df.str.extractall(r'(\\d{1,2})[/|-](\\d{1,2})[/|-](\\d{2,4})')"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
"
\n",
" \n",
" | \n",
" match | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 03/25/93 | \n",
" 03 | \n",
" 25/93 | \n",
" 25 | \n",
" 93 | \n",
"
\n",
" \n",
" 1 | \n",
" 0 | \n",
" 6/18/85 | \n",
" 6 | \n",
" 18/85 | \n",
" 18 | \n",
" 85 | \n",
"
\n",
" \n",
" 2 | \n",
" 0 | \n",
" 7/8/71 | \n",
" 7 | \n",
" 8/71 | \n",
" 8 | \n",
" 71 | \n",
"
\n",
" \n",
" 3 | \n",
" 0 | \n",
" 9/27/75 | \n",
" 9 | \n",
" 27/75 | \n",
" 27 | \n",
" 75 | \n",
"
\n",
" \n",
" 4 | \n",
" 0 | \n",
" 2/6/96 | \n",
" 2 | \n",
" 6/96 | \n",
" 6 | \n",
" 96 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 120 | \n",
" 0 | \n",
" 12/22/98 | \n",
" 12 | \n",
" 22/98 | \n",
" 22 | \n",
" 98 | \n",
"
\n",
" \n",
" 121 | \n",
" 0 | \n",
" 10/02/96 | \n",
" 10 | \n",
" 02/96 | \n",
" 02 | \n",
" 96 | \n",
"
\n",
" \n",
" 122 | \n",
" 0 | \n",
" 11/05/90 | \n",
" 11 | \n",
" 05/90 | \n",
" 05 | \n",
" 90 | \n",
"
\n",
" \n",
" 123 | \n",
" 0 | \n",
" 5/04/77 | \n",
" 5 | \n",
" 04/77 | \n",
" 04 | \n",
" 77 | \n",
"
\n",
" \n",
" 124 | \n",
" 0 | \n",
" 2/27/96 | \n",
" 2 | \n",
" 27/96 | \n",
" 27 | \n",
" 96 | \n",
"
\n",
" \n",
"
\n",
"
126 rows × 5 columns
\n",
"
"
],
"text/plain": [
" 0 1 2 3 4\n",
" match \n",
"0 0 03/25/93 03 25/93 25 93\n",
"1 0 6/18/85 6 18/85 18 85\n",
"2 0 7/8/71 7 8/71 8 71\n",
"3 0 9/27/75 9 27/75 27 75\n",
"4 0 2/6/96 2 6/96 6 96\n",
"... ... .. ... .. ..\n",
"120 0 12/22/98 12 22/98 22 98\n",
"121 0 10/02/96 10 02/96 02 96\n",
"122 0 11/05/90 11 05/90 05 90\n",
"123 0 5/04/77 5 04/77 04 77\n",
"124 0 2/27/96 2 27/96 27 96\n",
"\n",
"[126 rows x 5 columns]"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Grouping can be carried out in a nested fashion!\n",
"datetxt_df.str.extractall(r'((\\d{1,2})[/|-]((\\d{1,2})[/|-](\\d{2,4})))')"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Date | \n",
" Month | \n",
" Day | \n",
" Year | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 03/25/93 | \n",
" 03 | \n",
" 25 | \n",
" 93 | \n",
"
\n",
" \n",
" 1 | \n",
" 6/18/85 | \n",
" 6 | \n",
" 18 | \n",
" 85 | \n",
"
\n",
" \n",
" 2 | \n",
" 7/8/71 | \n",
" 7 | \n",
" 8 | \n",
" 71 | \n",
"
\n",
" \n",
" 3 | \n",
" 9/27/75 | \n",
" 9 | \n",
" 27 | \n",
" 75 | \n",
"
\n",
" \n",
" 4 | \n",
" 2/6/96 | \n",
" 2 | \n",
" 6 | \n",
" 96 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 495 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 496 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 497 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 498 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 499 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
500 rows × 4 columns
\n",
"
"
],
"text/plain": [
" Date Month Day Year\n",
"0 03/25/93 03 25 93\n",
"1 6/18/85 6 18 85\n",
"2 7/8/71 7 8 71\n",
"3 9/27/75 9 27 75\n",
"4 2/6/96 2 6 96\n",
".. ... ... ... ...\n",
"495 NaN NaN NaN NaN\n",
"496 NaN NaN NaN NaN\n",
"497 NaN NaN NaN NaN\n",
"498 NaN NaN NaN NaN\n",
"499 NaN NaN NaN NaN\n",
"\n",
"[500 rows x 4 columns]"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Named groups ?P \n",
"datetxt_df.str.extract(r'(?P(?P\\d{1,2})[/|-](?P\\d{1,2})[/|-](?P\\d{2,4}))')"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Original_txt | \n",
" Date | \n",
" Month | \n",
" Day | \n",
" Year | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 03/25/93 Total time of visit (in minutes): | \n",
" 03/25/93 | \n",
" 03 | \n",
" 25 | \n",
" 93 | \n",
"
\n",
" \n",
" 1 | \n",
" 6/18/85 Primary Care Doctor: | \n",
" 6/18/85 | \n",
" 6 | \n",
" 18 | \n",
" 85 | \n",
"
\n",
" \n",
" 2 | \n",
" sshe plans to move as of 7/8/71 In-Home Servic... | \n",
" 7/8/71 | \n",
" 7 | \n",
" 8 | \n",
" 71 | \n",
"
\n",
" \n",
" 3 | \n",
" 7 on 9/27/75 Audit C Score Current: | \n",
" 9/27/75 | \n",
" 9 | \n",
" 27 | \n",
" 75 | \n",
"
\n",
" \n",
" 4 | \n",
" 2/6/96 sleep studyPain Treatment Pain Level (N... | \n",
" 2/6/96 | \n",
" 2 | \n",
" 6 | \n",
" 96 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 495 | \n",
" 1979 Family Psych History: Family History of S... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 496 | \n",
" therapist and friend died in ~2006 Parental/Ca... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 497 | \n",
" 2008 partial thyroidectomy | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 498 | \n",
" sPt describes a history of sexual abuse as a c... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 499 | \n",
" . In 1980, patient was living in Naples and de... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
500 rows × 5 columns
\n",
"
"
],
"text/plain": [
" Original_txt Date Month Day \\\n",
"0 03/25/93 Total time of visit (in minutes): 03/25/93 03 25 \n",
"1 6/18/85 Primary Care Doctor: 6/18/85 6 18 \n",
"2 sshe plans to move as of 7/8/71 In-Home Servic... 7/8/71 7 8 \n",
"3 7 on 9/27/75 Audit C Score Current: 9/27/75 9 27 \n",
"4 2/6/96 sleep studyPain Treatment Pain Level (N... 2/6/96 2 6 \n",
".. ... ... ... ... \n",
"495 1979 Family Psych History: Family History of S... NaN NaN NaN \n",
"496 therapist and friend died in ~2006 Parental/Ca... NaN NaN NaN \n",
"497 2008 partial thyroidectomy NaN NaN NaN \n",
"498 sPt describes a history of sexual abuse as a c... NaN NaN NaN \n",
"499 . In 1980, patient was living in Naples and de... NaN NaN NaN \n",
"\n",
" Year \n",
"0 93 \n",
"1 85 \n",
"2 71 \n",
"3 75 \n",
"4 96 \n",
".. ... \n",
"495 NaN \n",
"496 NaN \n",
"497 NaN \n",
"498 NaN \n",
"499 NaN \n",
"\n",
"[500 rows x 5 columns]"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# It can come handy to add derived columns, e.g. with a .join method to join Dataframes\n",
"# Since datetxt_df is a Pandas Series, we first need to convert it into a Dataframe\n",
"datetxt_df.to_frame(name=\"Original_txt\")\\\n",
".join(datetxt_df.str.extract\\\n",
"(r'(?P(?P\\d{1,2})[/|-](?P\\d{1,2})[/|-](?P\\d{2,4}))'))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### Putting a few of the pieces together\n",
"- Prelude to Ungraded task 2.2\n",
" - A bit of data cleaning \n",
" - A bit of missing data handling\n",
" - Doing so with RegEx "
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Country | \n",
" Region | \n",
" Population | \n",
" Area (sq. mi.) | \n",
" Pop. Density (per sq. mi.) | \n",
" Coastline (coast/area ratio) | \n",
" Net migration | \n",
" Infant mortality (per 1000 births) | \n",
" GDP ($ per capita) | \n",
" Literacy (%) | \n",
" Phones (per 1000) | \n",
" Arable (%) | \n",
" Crops (%) | \n",
" Other (%) | \n",
" Climate | \n",
" Birthrate | \n",
" Deathrate | \n",
" Agriculture | \n",
" Industry | \n",
" Service | \n",
"
\n",
" \n",
" \n",
" \n",
" 221 | \n",
" Wallis and Futuna | \n",
" OCEANIA | \n",
" 16025 | \n",
" 274 | \n",
" 58,5 | \n",
" 47,08 | \n",
" NaN | \n",
" NaN | \n",
" 3700.0 | \n",
" 50,0 | \n",
" 118,6 | \n",
" 5 | \n",
" 25 | \n",
" 70 | \n",
" 2 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 222 | \n",
" West Bank | \n",
" NEAR EAST | \n",
" 2460492 | \n",
" 5860 | \n",
" 419,9 | \n",
" 0,00 | \n",
" 2,98 | \n",
" 19,62 | \n",
" 800.0 | \n",
" NaN | \n",
" 145,2 | \n",
" 16,9 | \n",
" 18,97 | \n",
" 64,13 | \n",
" 3 | \n",
" 31,67 | \n",
" 3,92 | \n",
" 0,09 | \n",
" 0,28 | \n",
" 0,63 | \n",
"
\n",
" \n",
" 223 | \n",
" Western Sahara | \n",
" NORTHERN AFRICA | \n",
" 273008 | \n",
" 266000 | \n",
" 1,0 | \n",
" 0,42 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 0,02 | \n",
" 0 | \n",
" 99,98 | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 0,4 | \n",
"
\n",
" \n",
" 224 | \n",
" Yemen | \n",
" NEAR EAST | \n",
" 21456188 | \n",
" 527970 | \n",
" 40,6 | \n",
" 0,36 | \n",
" 0 | \n",
" 61,5 | \n",
" 800.0 | \n",
" 50,2 | \n",
" 37,2 | \n",
" 2,78 | \n",
" 0,24 | \n",
" 96,98 | \n",
" 1 | \n",
" 42,89 | \n",
" 8,3 | \n",
" 0,135 | \n",
" 0,472 | \n",
" 0,393 | \n",
"
\n",
" \n",
" 225 | \n",
" Zambia | \n",
" SUB-SAHARAN AFRICA | \n",
" 11502010 | \n",
" 752614 | \n",
" 15,3 | \n",
" 0,00 | \n",
" 0 | \n",
" 88,29 | \n",
" 800.0 | \n",
" 80,6 | \n",
" 8,2 | \n",
" 7,08 | \n",
" 0,03 | \n",
" 92,9 | \n",
" 2 | \n",
" 41 | \n",
" 19,93 | \n",
" 0,22 | \n",
" 0,29 | \n",
" 0,489 | \n",
"
\n",
" \n",
" 226 | \n",
" Zimbabwe | \n",
" SUB-SAHARAN AFRICA | \n",
" 12236805 | \n",
" 390580 | \n",
" 31,3 | \n",
" 0,00 | \n",
" 0 | \n",
" 67,69 | \n",
" 1900.0 | \n",
" 90,7 | \n",
" 26,8 | \n",
" 8,32 | \n",
" 0,34 | \n",
" 91,34 | \n",
" 2 | \n",
" 28,01 | \n",
" 21,84 | \n",
" 0,179 | \n",
" 0,243 | \n",
" 0,579 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Country Region Population \\\n",
"221 Wallis and Futuna OCEANIA 16025 \n",
"222 West Bank NEAR EAST 2460492 \n",
"223 Western Sahara NORTHERN AFRICA 273008 \n",
"224 Yemen NEAR EAST 21456188 \n",
"225 Zambia SUB-SAHARAN AFRICA 11502010 \n",
"226 Zimbabwe SUB-SAHARAN AFRICA 12236805 \n",
"\n",
" Area (sq. mi.) Pop. Density (per sq. mi.) Coastline (coast/area ratio) \\\n",
"221 274 58,5 47,08 \n",
"222 5860 419,9 0,00 \n",
"223 266000 1,0 0,42 \n",
"224 527970 40,6 0,36 \n",
"225 752614 15,3 0,00 \n",
"226 390580 31,3 0,00 \n",
"\n",
" Net migration Infant mortality (per 1000 births) GDP ($ per capita) \\\n",
"221 NaN NaN 3700.0 \n",
"222 2,98 19,62 800.0 \n",
"223 NaN NaN NaN \n",
"224 0 61,5 800.0 \n",
"225 0 88,29 800.0 \n",
"226 0 67,69 1900.0 \n",
"\n",
" Literacy (%) Phones (per 1000) Arable (%) Crops (%) Other (%) Climate \\\n",
"221 50,0 118,6 5 25 70 2 \n",
"222 NaN 145,2 16,9 18,97 64,13 3 \n",
"223 NaN NaN 0,02 0 99,98 1 \n",
"224 50,2 37,2 2,78 0,24 96,98 1 \n",
"225 80,6 8,2 7,08 0,03 92,9 2 \n",
"226 90,7 26,8 8,32 0,34 91,34 2 \n",
"\n",
" Birthrate Deathrate Agriculture Industry Service \n",
"221 NaN NaN NaN NaN NaN \n",
"222 31,67 3,92 0,09 0,28 0,63 \n",
"223 NaN NaN NaN NaN 0,4 \n",
"224 42,89 8,3 0,135 0,472 0,393 \n",
"225 41 19,93 0,22 0,29 0,489 \n",
"226 28,01 21,84 0,179 0,243 0,579 "
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Let's go back to the CountriesOfTheWorld data.\n",
"# We previously loaded it in the CoTW_df DataFrame.\n",
"# We see lot's of missing data. \n",
"# We had noticed previously that some of the numeric data is stored improperly (as string objects). \n",
"CoTW_df.tail(6)"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Country object\n",
"Region object\n",
"Population int64\n",
"Area (sq. mi.) int64\n",
"Pop. Density (per sq. mi.) object\n",
"Coastline (coast/area ratio) object\n",
"Net migration object\n",
"Infant mortality (per 1000 births) object\n",
"GDP ($ per capita) float64\n",
"Literacy (%) object\n",
"Phones (per 1000) object\n",
"Arable (%) object\n",
"Crops (%) object\n",
"Other (%) object\n",
"Climate object\n",
"Birthrate object\n",
"Deathrate object\n",
"Agriculture object\n",
"Industry object\n",
"Service object\n",
"dtype: object"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Let's check the dtypes to confirm this.\n",
"CoTW_df.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Country | \n",
" Region | \n",
" Population | \n",
" Area (sq. mi.) | \n",
" Pop. Density (per sq. mi.) | \n",
" Coastline (coast/area ratio) | \n",
" Net migration | \n",
" Infant mortality (per 1000 births) | \n",
" GDP ($ per capita) | \n",
" Literacy (%) | \n",
" Phones (per 1000) | \n",
" Arable (%) | \n",
" Crops (%) | \n",
" Other (%) | \n",
" Climate | \n",
" Birthrate | \n",
" Deathrate | \n",
" Agriculture | \n",
" Industry | \n",
" Service | \n",
"
\n",
" \n",
" \n",
" \n",
" 217 | \n",
" Vanuatu | \n",
" OCEANIA | \n",
" 208869 | \n",
" 12200 | \n",
" 17,1 | \n",
" 20,72 | \n",
" 0 | \n",
" 55,16 | \n",
" 2900.0 | \n",
" 53.0 | \n",
" 32,6 | \n",
" 2,46 | \n",
" 7,38 | \n",
" 90,16 | \n",
" 2 | \n",
" 22,72 | \n",
" 7,82 | \n",
" 0,26 | \n",
" 0,12 | \n",
" 0,62 | \n",
"
\n",
" \n",
" 218 | \n",
" Venezuela | \n",
" LATIN AMER. & CARIB | \n",
" 25730435 | \n",
" 912050 | \n",
" 28,2 | \n",
" 0,31 | \n",
" -0,04 | \n",
" 22,2 | \n",
" 4800.0 | \n",
" 93.4 | \n",
" 140,1 | \n",
" 2,95 | \n",
" 0,92 | \n",
" 96,13 | \n",
" 2 | \n",
" 18,71 | \n",
" 4,92 | \n",
" 0,04 | \n",
" 0,419 | \n",
" 0,541 | \n",
"
\n",
" \n",
" 219 | \n",
" Vietnam | \n",
" ASIA (EX. NEAR EAST) | \n",
" 84402966 | \n",
" 329560 | \n",
" 256,1 | \n",
" 1,05 | \n",
" -0,45 | \n",
" 25,95 | \n",
" 2500.0 | \n",
" 90.3 | \n",
" 187,7 | \n",
" 19,97 | \n",
" 5,95 | \n",
" 74,08 | \n",
" 2 | \n",
" 16,86 | \n",
" 6,22 | \n",
" 0,209 | \n",
" 0,41 | \n",
" 0,381 | \n",
"
\n",
" \n",
" 220 | \n",
" Virgin Islands | \n",
" LATIN AMER. & CARIB | \n",
" 108605 | \n",
" 1910 | \n",
" 56,9 | \n",
" 9,84 | \n",
" -8,94 | \n",
" 8,03 | \n",
" 17200.0 | \n",
" NaN | \n",
" 652,8 | \n",
" 11,76 | \n",
" 2,94 | \n",
" 85,3 | \n",
" 2 | \n",
" 13,96 | \n",
" 6,43 | \n",
" 0,01 | \n",
" 0,19 | \n",
" 0,8 | \n",
"
\n",
" \n",
" 221 | \n",
" Wallis and Futuna | \n",
" OCEANIA | \n",
" 16025 | \n",
" 274 | \n",
" 58,5 | \n",
" 47,08 | \n",
" NaN | \n",
" NaN | \n",
" 3700.0 | \n",
" 50.0 | \n",
" 118,6 | \n",
" 5 | \n",
" 25 | \n",
" 70 | \n",
" 2 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 222 | \n",
" West Bank | \n",
" NEAR EAST | \n",
" 2460492 | \n",
" 5860 | \n",
" 419,9 | \n",
" 0,00 | \n",
" 2,98 | \n",
" 19,62 | \n",
" 800.0 | \n",
" NaN | \n",
" 145,2 | \n",
" 16,9 | \n",
" 18,97 | \n",
" 64,13 | \n",
" 3 | \n",
" 31,67 | \n",
" 3,92 | \n",
" 0,09 | \n",
" 0,28 | \n",
" 0,63 | \n",
"
\n",
" \n",
" 223 | \n",
" Western Sahara | \n",
" NORTHERN AFRICA | \n",
" 273008 | \n",
" 266000 | \n",
" 1,0 | \n",
" 0,42 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 0,02 | \n",
" 0 | \n",
" 99,98 | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 0,4 | \n",
"
\n",
" \n",
" 224 | \n",
" Yemen | \n",
" NEAR EAST | \n",
" 21456188 | \n",
" 527970 | \n",
" 40,6 | \n",
" 0,36 | \n",
" 0 | \n",
" 61,5 | \n",
" 800.0 | \n",
" 50.2 | \n",
" 37,2 | \n",
" 2,78 | \n",
" 0,24 | \n",
" 96,98 | \n",
" 1 | \n",
" 42,89 | \n",
" 8,3 | \n",
" 0,135 | \n",
" 0,472 | \n",
" 0,393 | \n",
"
\n",
" \n",
" 225 | \n",
" Zambia | \n",
" SUB-SAHARAN AFRICA | \n",
" 11502010 | \n",
" 752614 | \n",
" 15,3 | \n",
" 0,00 | \n",
" 0 | \n",
" 88,29 | \n",
" 800.0 | \n",
" 80.6 | \n",
" 8,2 | \n",
" 7,08 | \n",
" 0,03 | \n",
" 92,9 | \n",
" 2 | \n",
" 41 | \n",
" 19,93 | \n",
" 0,22 | \n",
" 0,29 | \n",
" 0,489 | \n",
"
\n",
" \n",
" 226 | \n",
" Zimbabwe | \n",
" SUB-SAHARAN AFRICA | \n",
" 12236805 | \n",
" 390580 | \n",
" 31,3 | \n",
" 0,00 | \n",
" 0 | \n",
" 67,69 | \n",
" 1900.0 | \n",
" 90.7 | \n",
" 26,8 | \n",
" 8,32 | \n",
" 0,34 | \n",
" 91,34 | \n",
" 2 | \n",
" 28,01 | \n",
" 21,84 | \n",
" 0,179 | \n",
" 0,243 | \n",
" 0,579 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Country Region Population \\\n",
"217 Vanuatu OCEANIA 208869 \n",
"218 Venezuela LATIN AMER. & CARIB 25730435 \n",
"219 Vietnam ASIA (EX. NEAR EAST) 84402966 \n",
"220 Virgin Islands LATIN AMER. & CARIB 108605 \n",
"221 Wallis and Futuna OCEANIA 16025 \n",
"222 West Bank NEAR EAST 2460492 \n",
"223 Western Sahara NORTHERN AFRICA 273008 \n",
"224 Yemen NEAR EAST 21456188 \n",
"225 Zambia SUB-SAHARAN AFRICA 11502010 \n",
"226 Zimbabwe SUB-SAHARAN AFRICA 12236805 \n",
"\n",
" Area (sq. mi.) Pop. Density (per sq. mi.) Coastline (coast/area ratio) \\\n",
"217 12200 17,1 20,72 \n",
"218 912050 28,2 0,31 \n",
"219 329560 256,1 1,05 \n",
"220 1910 56,9 9,84 \n",
"221 274 58,5 47,08 \n",
"222 5860 419,9 0,00 \n",
"223 266000 1,0 0,42 \n",
"224 527970 40,6 0,36 \n",
"225 752614 15,3 0,00 \n",
"226 390580 31,3 0,00 \n",
"\n",
" Net migration Infant mortality (per 1000 births) GDP ($ per capita) \\\n",
"217 0 55,16 2900.0 \n",
"218 -0,04 22,2 4800.0 \n",
"219 -0,45 25,95 2500.0 \n",
"220 -8,94 8,03 17200.0 \n",
"221 NaN NaN 3700.0 \n",
"222 2,98 19,62 800.0 \n",
"223 NaN NaN NaN \n",
"224 0 61,5 800.0 \n",
"225 0 88,29 800.0 \n",
"226 0 67,69 1900.0 \n",
"\n",
" Literacy (%) Phones (per 1000) Arable (%) Crops (%) Other (%) Climate \\\n",
"217 53.0 32,6 2,46 7,38 90,16 2 \n",
"218 93.4 140,1 2,95 0,92 96,13 2 \n",
"219 90.3 187,7 19,97 5,95 74,08 2 \n",
"220 NaN 652,8 11,76 2,94 85,3 2 \n",
"221 50.0 118,6 5 25 70 2 \n",
"222 NaN 145,2 16,9 18,97 64,13 3 \n",
"223 NaN NaN 0,02 0 99,98 1 \n",
"224 50.2 37,2 2,78 0,24 96,98 1 \n",
"225 80.6 8,2 7,08 0,03 92,9 2 \n",
"226 90.7 26,8 8,32 0,34 91,34 2 \n",
"\n",
" Birthrate Deathrate Agriculture Industry Service \n",
"217 22,72 7,82 0,26 0,12 0,62 \n",
"218 18,71 4,92 0,04 0,419 0,541 \n",
"219 16,86 6,22 0,209 0,41 0,381 \n",
"220 13,96 6,43 0,01 0,19 0,8 \n",
"221 NaN NaN NaN NaN NaN \n",
"222 31,67 3,92 0,09 0,28 0,63 \n",
"223 NaN NaN NaN NaN 0,4 \n",
"224 42,89 8,3 0,135 0,472 0,393 \n",
"225 41 19,93 0,22 0,29 0,489 \n",
"226 28,01 21,84 0,179 0,243 0,579 "
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Let's convert the Literacy (%) entries into float() type. \n",
"# We need to replace the ',' with '.'; following which, we can apply astype(float)\n",
"CoTW_df['Literacy (%)']=CoTW_df['Literacy (%)'].str.replace(',', '.').astype(float)\n",
"CoTW_df.tail(10)"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Country object\n",
"Region object\n",
"Population int64\n",
"Area (sq. mi.) int64\n",
"Pop. Density (per sq. mi.) object\n",
"Coastline (coast/area ratio) object\n",
"Net migration object\n",
"Infant mortality (per 1000 births) object\n",
"GDP ($ per capita) float64\n",
"Literacy (%) float64\n",
"Phones (per 1000) object\n",
"Arable (%) object\n",
"Crops (%) object\n",
"Other (%) object\n",
"Climate object\n",
"Birthrate object\n",
"Deathrate object\n",
"Agriculture object\n",
"Industry object\n",
"Service object\n",
"dtype: object"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# reconfirm type change\n",
"CoTW_df.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### Datetime object\n",
" https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior\n"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"False\n"
]
}
],
"source": [
"date1='03-25-93'\n",
"date2=\"3/25/1993\"\n",
"date3=\"25 Mar 1993\"\n",
"print(date1==date3)"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1993-03-25 00:00:00\n"
]
}
],
"source": [
"datetime_object1 = datetime.datetime.strptime(date1,\"%m-%d-%y\")\n",
"print(datetime_object1)"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1993-03-25 00:00:00\n"
]
}
],
"source": [
"datetime_object2 = datetime.datetime.strptime(date2,\"%m/%d/%Y\")\n",
"print(datetime_object2)"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1993-03-25 00:00:00\n",
"True\n"
]
}
],
"source": [
"datetime_object3 = datetime.datetime.strptime(date3,\"%d %b %Y\")\n",
"print(datetime_object3)\n",
"print(datetime_object1==datetime_object3)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
" Ungraded task 2.2: Clean the CoTW_df Dataframe\n",
"\n",
"Clean the data in the CoTW_df Dataframe as follows:\n",
"\n",
"1. Create rule based methods (e.g., using regular expressions) to replace the column names as follows: \n",
" * Remove everything in brackets, e.g., replace \"Coastline (coast/area ratio)\" with \"Coastline\"\n",
" * If there are multiple words in the title, then join them with underscores. As such, \"Infant mortality (per 1000 births)\" would become \"Infant_mortality\". \n",
"\n",
"2. Replace all the numeric values which have been stored as strings, into floats.\n",
"\n",
"3. For all the missing numeric values (NaN), replace them with the Mean value available for the \"Region\". e.g., Wallis and Futuna is from OCEANIA region, so replace its missing Net_Migration information with the mean Net_Migration as determined for the values available for other countries in OCEANIA. \n",
" * Discuss alternatives for dealing with the NaN values. These alternatives may be dependent on type of column and/or the rows involved."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Ungraded task 2.3: Find all the dates in the datetxt_df Series
\n",
"\n",
"- Identify the (first instance of) date in each line of the Data Series.\n",
" - If only month and year information is available, then assume that the day of the month is 1st.\n",
" - If only year information is available, consider the month to be January.\n",
"- Create a DataFrame (as shown below) with two columns, the Position of Date in the original Series, and the Date identified, sorted in increasing order of the date. \n",
"\n",
"| | Position | Date |\n",
"|----:|-----------:|:-----------|\n",
"| 0 | 9 | 10/4/1971 |\n",
"| 1 | 84 | 18/5/1971 |\n",
"| 2 | 2 | 8/7/1971 |\n",
"| ... | ... | ... |\n",
"| ... | ... | ... |\n",
"| 498 | 161 | 19/10/2016 |\n",
"| 499 | 413 | 1/11/2016 |\n",
"\n",
"\n",
"- Discuss: Assuming that you have access to a correct/ideal solution (e.g., the one provided), and you obtain some solution based on your processing, how will you quantitatively compare the quality of result you obtained w.r.to the ideal solution? If you consider several metrics, discuss pros and cons of such various metrics. Implement the corresponding metric and quantify the quality of result you obtain.\n",
"- A tentative solution has been provided in an accompanying file named sorted-dates.csv\n",
" - Caveat: The solution provided may be faulty. If you find a better solution, please share with me. Please also let me know how you verified inaccuracies of the tentative solution."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"That's it folks!
\n",
"\n",
"
"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}