{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "# A little bit of SQL(ite) \n", "### and very little Pandas\n", "

\n", "
\n", "

SC 4125: Developing Data Products

\n", "

Module-1: Introduction


\n", "\n", "
\n", "by Anwitaman DATTA
\n", "School of Computer Science and Engineering, NTU Singapore. \n", "
\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Teaching material\n", "- .pdf deck of slides (complements the html slides)\n", "- .html deck of slides\n", "- .ipynb Jupyter notebook" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ " This is companion material for Module-1: Introduction lecture for the Developing Data Products course.
\n", " \n", "The executable source .ipynb notebook file for these HTML slides, and the CountriesDB.db file that goes with it can be found through the links. \n", "\n", "\"SQLLite" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### A very non-systematic review/recap\n", "\n", "You are expected to have already taken introductory courses \n", "- SC 1015: Introduction to Data Science & AI\n", "- SC 2207: Introduction to Database Systems" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Some useful references for beginners/catching-up:\n", "\n", "https://www.sqlitetutorial.net/\n", "\n", "https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### SQLite Browser\n", "\n", "https://sqlitebrowser.org/\n", "\n", "\"SQLLite" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "This notebook is accompanied with CountriesDB.db file, which comprises of three tables. The tables correspond to the Countries of the world and gapminder data obtained from Kaggle, and the Olympics 2016 medals data obtained from Wikipedia. \n", "\n", "You may want to install and use the SQLite Browser to check out the data in the .db file (as shown in the previous panel). You may also want to check out the source files indicated above. \n", "\n", "It is easy to import the .csv (and other similar formats, such as .tsv) directly using the graphical interface of SQLite Browser. Nevertheless, we will next explore how to do so, and other actions such as manipulate and query the data from the Jupyter programming environment. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Import modules\n", "\n", "Let's start with importing necessary modules. We will also check the environment/version/etc." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Python version: 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]\n", "Numpy version: 1.19.2\n", "Pandas version: 1.1.3\n", "SQLite version: 2.6.0\n" ] } ], "source": [ "# Housekeeping: We will first carry out the imports and print version numbers\n", "import sys\n", "print(\"Python version: \" +str(sys.version))\n", "import numpy as np\n", "import pandas as pd\n", "print(\"Numpy version: \" +str(np.version.version))\n", "print(\"Pandas version: \" +str(pd.__version__))\n", "\n", "#import xml.etree.ElementTree as ET\n", "import sqlite3\n", "print(\"SQLite version: \" + str(sqlite3.version))" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "jupyter core : 4.6.3\n", "jupyter-notebook : 6.1.4\n", "qtconsole : 4.7.7\n", "ipython : 7.19.0\n", "ipykernel : 5.3.4\n", "jupyter client : 6.1.7\n", "jupyter lab : 3.0.16\n", "nbconvert : 6.0.7\n", "ipywidgets : 7.5.1\n", "nbformat : 5.0.8\n", "traitlets : 5.0.5\n" ] } ], "source": [ "! jupyter --version" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " country continent year lifeExp pop gdpPercap\n", "0 Afghanistan Asia 1952 28.801 8425333 779.445314\n", "1 Afghanistan Asia 1957 30.332 9240934 820.853030\n", "2 Afghanistan Asia 1962 31.997 10267083 853.100710\n", "3 Afghanistan Asia 1967 34.020 11537966 836.197138\n", "4 Afghanistan Asia 1972 36.088 13079460 739.981106\n", "... ... ... ... ... ... ...\n", "1699 Zimbabwe Africa 1987 62.351 9216418 706.157306\n", "1700 Zimbabwe Africa 1992 60.377 10704340 693.420786\n", "1701 Zimbabwe Africa 1997 46.809 11404948 792.449960\n", "1702 Zimbabwe Africa 2002 39.989 11926563 672.038623\n", "1703 Zimbabwe Africa 2007 43.487 12311143 469.709298\n", "\n", "[1704 rows x 6 columns]\n", "['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap']\n", " country continent year lifeExp pop gdpPercap\n", "0 Afghanistan Asia 1952 28.801 8425333 779.445314\n", "1 Afghanistan Asia 1957 30.332 9240934 820.853030\n", "2 Afghanistan Asia 1962 31.997 10267083 853.100710\n", "3 Afghanistan Asia 1967 34.020 11537966 836.197138\n", "4 Afghanistan Asia 1972 36.088 13079460 739.981106\n", "5 Afghanistan Asia 1977 38.438 14880372 786.113360\n", "6 Afghanistan Asia 1982 39.854 12881816 978.011439\n", "7 Afghanistan Asia 1987 40.822 13867957 852.395945\n", "8 Afghanistan Asia 1992 41.674 16317921 649.341395\n", "9 Afghanistan Asia 1997 41.763 22227415 635.341351\n", "10 Afghanistan Asia 2002 42.129 25268405 726.734055\n", "11 Afghanistan Asia 2007 43.828 31889923 974.580338\n" ] } ], "source": [ "# load data from .tsv file (as an alternate to loading from the DB file)\n", "\n", "gapminderdatapath ='data/gapminder/' # change this to adjust relative path\n", "gap_df = pd.read_csv(gapminderdatapath+'gapminder.tsv', sep='\\t')\n", "print(gap_df)\n", "print(gap_df.columns.to_list())\n", "print(gap_df[gap_df['country']=='Afghanistan'])\n", "country_names_series=gap_df['country'].drop_duplicates()\n", "country_names_array=gap_df['country'].unique()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We begin by connecting to the CountriesDB.db database, using the sqlite3.connect() function. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "conn = sqlite3.connect('CountriesDB.db')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We then create a cursor object using the cursor() function. We determine the tables stored in the database, we can query (using a SELECT statement) a special table called the sqlite_schema. It can also be referenced as sqlite_master. \n", "\n", "To retrieve data after executing a SELECT statement, one can either treat the cursor as an iterator, call the cursor’s fetchone() method to retrieve a single matching row, or call fetchall() to get a list of the matching rows." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['CountriesOfTheWorld', 'gapminder', 'Olymp2016']\n" ] } ], "source": [ "cursor = conn.cursor()\n", "cursor.execute(\"SELECT name FROM sqlite_schema WHERE type='table';\")\n", "db_tables= [x[0] for x in cursor.fetchall()] # we are storing this information in a list\n", "print(db_tables)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "If we want to determine the schema of one of these tables, say, the gapminder table, we may achieve it as follows." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('CREATE TABLE \"gapminder\" (\\n\\t\"country\"\\tTEXT,\\n\\t\"continent\"\\tTEXT,\\n\\t\"year\"\\tINTEGER,\\n\\t\"lifeExp\"\\tREAL,\\n\\t\"pop\"\\tINTEGER,\\n\\t\"gdpPercap\"\\tREAL\\n)',)]\n" ] } ], "source": [ "cursor.execute(\"SELECT sql FROM sqlite_master WHERE name = 'gapminder';\")\n", "print(cursor.fetchall())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Alternatively, we may also use the description method to determine the schema instead. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap']\n" ] } ], "source": [ "data=cursor.execute(\"SELECT * FROM 'gapminder'\")\n", "data_descr=[x[0] for x in data.description]\n", "print(data_descr)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Country', 'Region', 'Population', 'Area(sq.mi.)', 'Pop.Density(persq.mi.)', 'Coastline(coast/arearatio)', 'Netmigration', 'Infantmortality(per1000births)', 'GDP($percapita)', 'Literacy(%)', 'Phones(per1000)', 'Arable(%)', 'Crops(%)', 'Other(%)', 'Climate', 'Birthrate', 'Deathrate', 'Agriculture', 'Industry', 'Service']\n" ] } ], "source": [ "# If you do not want to type the names, you can ofcourse \n", "# refer to the db_tables and iterate as needed.\n", "data=cursor.execute('SELECT * FROM ' +str(db_tables[0]))\n", "data_descr=[x[0] for x in data.description]\n", "print(data_descr)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "How do we check the data itself? Depending on the purpose, we can obtain it in different manners. We have already seen the most basic option, of carrying out a cursor.execute() followed by fetchone()/fetchall(). \n", "\n", "One can also instead just carry out a pandas read_sql_query, and store the result in a dataframe. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CountryRegionPopulationArea(sq.mi.)Pop.Density(persq.mi.)Coastline(coast/arearatio)NetmigrationInfantmortality(per1000births)GDP($percapita)Literacy(%)Phones(per1000)Arable(%)Crops(%)Other(%)ClimateBirthrateDeathrateAgricultureIndustryService
0AfghanistanASIA (EX. NEAR EAST)3105699764750048,00,0023,06163,07700.036,03,212,130,2287,65146,620,340,380,240,38
1AlbaniaEASTERN EUROPE358165528748124,61,26-4,9321,524500.086,571,221,094,4274,49315,115,220,2320,1880,579
2AlgeriaNORTHERN AFRICA32930091238174013,80,04-0,39316000.070,078,13,220,2596,53117,144,610,1010,60,298
3American SamoaOCEANIA57794199290,458,29-20,719,278000.097,0259,5101575222,463,27NoneNoneNone
4AndorraWESTERN EUROPE71201468152,10,006,64,0519000.0100,0497,22,22097,7838,716,25NoneNoneNone
...............................................................
222West BankNEAR EAST24604925860419,90,002,9819,62800.0None145,216,918,9764,13331,673,920,090,280,63
223Western SaharaNORTHERN AFRICA2730082660001,00,42NoneNoneNaNNoneNone0,02099,981NoneNoneNoneNone0,4
224YemenNEAR EAST2145618852797040,60,36061,5800.050,237,22,780,2496,98142,898,30,1350,4720,393
225ZambiaSUB-SAHARAN AFRICA1150201075261415,30,00088,29800.080,68,27,080,0392,924119,930,220,290,489
226ZimbabweSUB-SAHARAN AFRICA1223680539058031,30,00067,691900.090,726,88,320,3491,34228,0121,840,1790,2430,579
\n", "

227 rows × 20 columns

\n", "
" ], "text/plain": [ " Country Region Population Area(sq.mi.) \\\n", "0 Afghanistan ASIA (EX. NEAR EAST) 31056997 647500 \n", "1 Albania EASTERN EUROPE 3581655 28748 \n", "2 Algeria NORTHERN AFRICA 32930091 2381740 \n", "3 American Samoa OCEANIA 57794 199 \n", "4 Andorra WESTERN EUROPE 71201 468 \n", ".. ... ... ... ... \n", "222 West Bank NEAR EAST 2460492 5860 \n", "223 Western Sahara NORTHERN AFRICA 273008 266000 \n", "224 Yemen NEAR EAST 21456188 527970 \n", "225 Zambia SUB-SAHARAN AFRICA 11502010 752614 \n", "226 Zimbabwe SUB-SAHARAN AFRICA 12236805 390580 \n", "\n", " Pop.Density(persq.mi.) Coastline(coast/arearatio) Netmigration \\\n", "0 48,0 0,00 23,06 \n", "1 124,6 1,26 -4,93 \n", "2 13,8 0,04 -0,39 \n", "3 290,4 58,29 -20,71 \n", "4 152,1 0,00 6,6 \n", ".. ... ... ... \n", "222 419,9 0,00 2,98 \n", "223 1,0 0,42 None \n", "224 40,6 0,36 0 \n", "225 15,3 0,00 0 \n", "226 31,3 0,00 0 \n", "\n", " Infantmortality(per1000births) GDP($percapita) Literacy(%) \\\n", "0 163,07 700.0 36,0 \n", "1 21,52 4500.0 86,5 \n", "2 31 6000.0 70,0 \n", "3 9,27 8000.0 97,0 \n", "4 4,05 19000.0 100,0 \n", ".. ... ... ... \n", "222 19,62 800.0 None \n", "223 None NaN None \n", "224 61,5 800.0 50,2 \n", "225 88,29 800.0 80,6 \n", "226 67,69 1900.0 90,7 \n", "\n", " Phones(per1000) Arable(%) Crops(%) Other(%) Climate Birthrate Deathrate \\\n", "0 3,2 12,13 0,22 87,65 1 46,6 20,34 \n", "1 71,2 21,09 4,42 74,49 3 15,11 5,22 \n", "2 78,1 3,22 0,25 96,53 1 17,14 4,61 \n", "3 259,5 10 15 75 2 22,46 3,27 \n", "4 497,2 2,22 0 97,78 3 8,71 6,25 \n", ".. ... ... ... ... ... ... ... \n", "222 145,2 16,9 18,97 64,13 3 31,67 3,92 \n", "223 None 0,02 0 99,98 1 None None \n", "224 37,2 2,78 0,24 96,98 1 42,89 8,3 \n", "225 8,2 7,08 0,03 92,9 2 41 19,93 \n", "226 26,8 8,32 0,34 91,34 2 28,01 21,84 \n", "\n", " Agriculture Industry Service \n", "0 0,38 0,24 0,38 \n", "1 0,232 0,188 0,579 \n", "2 0,101 0,6 0,298 \n", "3 None None None \n", "4 None None None \n", ".. ... ... ... \n", "222 0,09 0,28 0,63 \n", "223 None None 0,4 \n", "224 0,135 0,472 0,393 \n", "225 0,22 0,29 0,489 \n", "226 0,179 0,243 0,579 \n", "\n", "[227 rows x 20 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Countries_df = pd.read_sql_query(\"SELECT * FROM CountriesOfTheWorld\", conn) # we need to provide the DB connection information\n", "Countries_df" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We will next create the Olymp2016 table using data from Wikipedia for the Rio 2016 Olympics.\n", "\n", "\"socially" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Let's start by reading the data from the Wikipedia page. The read_html() method works neatly (assuming that the HTML source file itself is constructed neatly, otherwise, it may get miserable!). The read_html() obtains several tables. After inspection of the results, it turned out that the table indexed '2' is the medal tally table for the specific Wiki-page (as on 9th August 2021). " ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RankNOCGoldSilverBronzeTotal
01United States (USA)463738121
12Great Britain (GBR)27231767
23China (CHN)26182670
34Russia (RUS)19172056
45Germany (GER)17101542
.....................
8278Nigeria (NGR)0011
8378Portugal (POR)0011
8478Trinidad and Tobago (TTO)0011
8578United Arab Emirates (UAE)0011
86Totals (86 NOCs)Totals (86 NOCs)307307359973
\n", "

87 rows × 6 columns

\n", "
" ], "text/plain": [ " Rank NOC Gold Silver Bronze Total\n", "0 1 United States (USA) 46 37 38 121\n", "1 2 Great Britain (GBR) 27 23 17 67\n", "2 3 China (CHN) 26 18 26 70\n", "3 4 Russia (RUS) 19 17 20 56\n", "4 5 Germany (GER) 17 10 15 42\n", ".. ... ... ... ... ... ...\n", "82 78 Nigeria (NGR) 0 0 1 1\n", "83 78 Portugal (POR) 0 0 1 1\n", "84 78 Trinidad and Tobago (TTO) 0 0 1 1\n", "85 78 United Arab Emirates (UAE) 0 0 1 1\n", "86 Totals (86 NOCs) Totals (86 NOCs) 307 307 359 973\n", "\n", "[87 rows x 6 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "olymp_df=pd.read_html(r'https://en.wikipedia.org/wiki/2016_Summer_Olympics_medal_table')\n", "olymp2016medals=olymp_df[2]\n", "olymp2016medals" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We do not want to populate the database with a derived information of Totals, and wish to eliminate the last line. Let's try that!" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RankNOCGoldSilverBronzeTotal
01United States (USA)463738121
12Great Britain (GBR)27231767
23China (CHN)26182670
34Russia (RUS)19172056
45Germany (GER)17101542
.....................
8178Morocco (MAR)0011
8278Nigeria (NGR)0011
8378Portugal (POR)0011
8478Trinidad and Tobago (TTO)0011
8578United Arab Emirates (UAE)0011
\n", "

86 rows × 6 columns

\n", "
" ], "text/plain": [ " Rank NOC Gold Silver Bronze Total\n", "0 1 United States (USA) 46 37 38 121\n", "1 2 Great Britain (GBR) 27 23 17 67\n", "2 3 China (CHN) 26 18 26 70\n", "3 4 Russia (RUS) 19 17 20 56\n", "4 5 Germany (GER) 17 10 15 42\n", ".. ... ... ... ... ... ...\n", "81 78 Morocco (MAR) 0 0 1 1\n", "82 78 Nigeria (NGR) 0 0 1 1\n", "83 78 Portugal (POR) 0 0 1 1\n", "84 78 Trinidad and Tobago (TTO) 0 0 1 1\n", "85 78 United Arab Emirates (UAE) 0 0 1 1\n", "\n", "[86 rows x 6 columns]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "olymp2016medals=olymp_df[2][:-1]\n", "olymp2016medals" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "NOC, standing for National Olympic Committees may become difficult to recall later on. Let's rename the column as Country instead." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RankCountryGoldSilverBronzeTotal
01United States (USA)463738121
12Great Britain (GBR)27231767
23China (CHN)26182670
34Russia (RUS)19172056
45Germany (GER)17101542
.....................
8178Morocco (MAR)0011
8278Nigeria (NGR)0011
8378Portugal (POR)0011
8478Trinidad and Tobago (TTO)0011
8578United Arab Emirates (UAE)0011
\n", "

86 rows × 6 columns

\n", "
" ], "text/plain": [ " Rank Country Gold Silver Bronze Total\n", "0 1 United States (USA) 46 37 38 121\n", "1 2 Great Britain (GBR) 27 23 17 67\n", "2 3 China (CHN) 26 18 26 70\n", "3 4 Russia (RUS) 19 17 20 56\n", "4 5 Germany (GER) 17 10 15 42\n", ".. ... ... ... ... ... ...\n", "81 78 Morocco (MAR) 0 0 1 1\n", "82 78 Nigeria (NGR) 0 0 1 1\n", "83 78 Portugal (POR) 0 0 1 1\n", "84 78 Trinidad and Tobago (TTO) 0 0 1 1\n", "85 78 United Arab Emirates (UAE) 0 0 1 1\n", "\n", "[86 rows x 6 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "olymp2016medals = olymp2016medals.rename(columns={'NOC':'Country'})\n", "olymp2016medals" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Let's check the data types." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "Rank object\n", "Country object\n", "Gold int64\n", "Silver int64\n", "Bronze int64\n", "Total int64\n", "dtype: object" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "olymp2016medals.dtypes" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "How about we designate rank explicitly as integer? " ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "Rank int64\n", "Country object\n", "Gold int64\n", "Silver int64\n", "Bronze int64\n", "Total int64\n", "dtype: object" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "olymp2016medals=olymp2016medals.astype({'Rank': 'int64'}, copy=True)\n", "olymp2016medals.dtypes" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We are almost there. However, since I have populated the database with this data previously, and (may) reuse this teaching material, it is possible that the table already exists in the DB! As such, lets do a conditional deletion of the table, in case it exists, using a DROP operation. And now, you know how to DROP a table from your DB!\n", "\n", "\"SQLLite\n", "\n", "Once that is done, let's add the Table using Panda's to_sql() function." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# Drop table from DB conditionally, if it already existed.\n", "if \"Olymp2016\" in db_tables:\n", " cursor.execute('''DROP TABLE Olymp2016''')\n", " \n", "# Add table to DB \n", "olymp2016medals.to_sql('Olymp2016', conn, if_exists='replace', index = False)\n", "# With the use of if_exists option, we could in fact get rid of the explicit conditional DROP TABLE op. \n", "# other options for if_exists: fail, append (default: fail)\n", "conn.commit() # Commit the changes to be sure. " ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RankCountryGoldSilverBronzeTotal
0TrueTrueTrueTrueTrueTrue
1TrueTrueTrueTrueTrueTrue
2TrueTrueTrueTrueTrueTrue
3TrueTrueTrueTrueTrueTrue
4TrueTrueTrueTrueTrueTrue
.....................
81TrueTrueTrueTrueTrueTrue
82TrueTrueTrueTrueTrueTrue
83TrueTrueTrueTrueTrueTrue
84TrueTrueTrueTrueTrueTrue
85TrueTrueTrueTrueTrueTrue
\n", "

86 rows × 6 columns

\n", "
" ], "text/plain": [ " Rank Country Gold Silver Bronze Total\n", "0 True True True True True True\n", "1 True True True True True True\n", "2 True True True True True True\n", "3 True True True True True True\n", "4 True True True True True True\n", ".. ... ... ... ... ... ...\n", "81 True True True True True True\n", "82 True True True True True True\n", "83 True True True True True True\n", "84 True True True True True True\n", "85 True True True True True True\n", "\n", "[86 rows x 6 columns]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Sanity check code: Is the data the same?\n", "\n", "Olympics_df=pd.read_sql_query(\"SELECT * FROM Olymp2016\", conn)\n", "olymp2016medals == Olympics_df" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We are all good! The table entries have been added to the DB. We can now start querying the data. For example, if we want to know which all countries won more Gold medals than Silver and Bronze medals combined, we can get that easy-peasy." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RankCountryGoldSilverBronzeTotal
012Hungary (HUN)83415
116Jamaica (JAM)63211
227Argentina (ARG)3104
354Fiji (FIJ)1001
454Jordan (JOR)1001
554Kosovo (KOS)1001
654Puerto Rico (PUR)1001
754Singapore (SIN)1001
854Tajikistan (TJK)1001
\n", "
" ], "text/plain": [ " Rank Country Gold Silver Bronze Total\n", "0 12 Hungary (HUN) 8 3 4 15\n", "1 16 Jamaica (JAM) 6 3 2 11\n", "2 27 Argentina (ARG) 3 1 0 4\n", "3 54 Fiji (FIJ) 1 0 0 1\n", "4 54 Jordan (JOR) 1 0 0 1\n", "5 54 Kosovo (KOS) 1 0 0 1\n", "6 54 Puerto Rico (PUR) 1 0 0 1\n", "7 54 Singapore (SIN) 1 0 0 1\n", "8 54 Tajikistan (TJK) 1 0 0 1" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "MoreGoldThanSilver_df= pd.read_sql_query(\"SELECT * FROM Olymp2016 WHERE Gold>Silver+Bronze\", conn)\n", "MoreGoldThanSilver_df" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "SQLite does not implement SQL fully, e.g., it does not support RIGHT JOIN and FULL OUTER JOIN (though it can be emulated, e.g., see here for details). But it does support a lot of features, including INNER JOIN. \n", "\n", "Here's an example of how we can create a summary of countries with their population as per the CountriesOfTheWorld table, choosing the countries where the life expectancy reported in 21st century has been less than 50 years, along with the year of the record, as per the gapminder table. \n", "\n", "To make the code readable (and reusable?), we have written the query separately first, before invoking the actual read_sql_query() function." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "scrolled": true, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CountryPopulationyearlifeExp
0Afghanistan31056997200242.129
1Afghanistan31056997200743.828
2Angola12127071200241.003
3Angola12127071200742.731
4Botswana1639833200246.634
5Burundi8090068200247.360
6Burundi8090068200749.580
7Cameroon17340702200249.856
8Congo, Dem. Rep.62660551200244.966
9Congo, Dem. Rep.62660551200746.462
10Cote d'Ivoire17654843200246.832
11Cote d'Ivoire17654843200748.328
12Equatorial Guinea540109200249.348
13Guinea-Bissau1442029200245.504
14Guinea-Bissau1442029200746.388
15Lesotho2022331200244.593
16Lesotho2022331200742.592
17Liberia3042004200243.753
18Liberia3042004200745.678
19Malawi13013926200245.009
20Malawi13013926200748.303
21Mozambique19686505200244.026
22Mozambique19686505200742.082
23Nigeria131859731200246.608
24Nigeria131859731200746.859
25Rwanda8648248200243.413
26Rwanda8648248200746.242
27Sierra Leone6005250200241.012
28Sierra Leone6005250200742.568
29Somalia8863338200245.936
30Somalia8863338200748.159
31South Africa44187637200749.339
32Swaziland1136334200243.869
33Swaziland1136334200739.613
34Tanzania37445392200249.651
35Uganda28195754200247.813
36Zambia11502010200239.193
37Zambia11502010200742.384
38Zimbabwe12236805200239.989
39Zimbabwe12236805200743.487
\n", "
" ], "text/plain": [ " Country Population year lifeExp\n", "0 Afghanistan 31056997 2002 42.129\n", "1 Afghanistan 31056997 2007 43.828\n", "2 Angola 12127071 2002 41.003\n", "3 Angola 12127071 2007 42.731\n", "4 Botswana 1639833 2002 46.634\n", "5 Burundi 8090068 2002 47.360\n", "6 Burundi 8090068 2007 49.580\n", "7 Cameroon 17340702 2002 49.856\n", "8 Congo, Dem. Rep. 62660551 2002 44.966\n", "9 Congo, Dem. Rep. 62660551 2007 46.462\n", "10 Cote d'Ivoire 17654843 2002 46.832\n", "11 Cote d'Ivoire 17654843 2007 48.328\n", "12 Equatorial Guinea 540109 2002 49.348\n", "13 Guinea-Bissau 1442029 2002 45.504\n", "14 Guinea-Bissau 1442029 2007 46.388\n", "15 Lesotho 2022331 2002 44.593\n", "16 Lesotho 2022331 2007 42.592\n", "17 Liberia 3042004 2002 43.753\n", "18 Liberia 3042004 2007 45.678\n", "19 Malawi 13013926 2002 45.009\n", "20 Malawi 13013926 2007 48.303\n", "21 Mozambique 19686505 2002 44.026\n", "22 Mozambique 19686505 2007 42.082\n", "23 Nigeria 131859731 2002 46.608\n", "24 Nigeria 131859731 2007 46.859\n", "25 Rwanda 8648248 2002 43.413\n", "26 Rwanda 8648248 2007 46.242\n", "27 Sierra Leone 6005250 2002 41.012\n", "28 Sierra Leone 6005250 2007 42.568\n", "29 Somalia 8863338 2002 45.936\n", "30 Somalia 8863338 2007 48.159\n", "31 South Africa 44187637 2007 49.339\n", "32 Swaziland 1136334 2002 43.869\n", "33 Swaziland 1136334 2007 39.613\n", "34 Tanzania 37445392 2002 49.651\n", "35 Uganda 28195754 2002 47.813\n", "36 Zambia 11502010 2002 39.193\n", "37 Zambia 11502010 2007 42.384\n", "38 Zimbabwe 12236805 2002 39.989\n", "39 Zimbabwe 12236805 2007 43.487" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "SQL_quer = \"\"\"\n", " SELECT CountriesOfTheWorld.Country, CountriesOfTheWorld.Population, year, lifeExp\n", " FROM CountriesOfTheWorld\n", " INNER JOIN gapminder\n", " ON CountriesOfTheWorld.Country =gapminder.country\n", " WHERE year>1999 and lifeExp <50;\n", "\"\"\"\n", "\n", "Quer_res = pd.read_sql_query(SQL_quer, conn)\n", "Quer_res\n", "#conn.close() when you are done, and want to close the connection to the DB" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Ungraded tasks\n", "Now, try a few things out yourselves. Following are some ideas to try. But think for yourself also, what other interesting information do you want to determine from these three tables!\n", "\n", "If your Python/Pandas skills are good, you will be tempted to import the data as Pandas dataframes and then operate on those dataframes. But, let's do that some other time (next session, in fact!). Instead, try to see what all you can do purely with SQL(ite). However, you may use Python to create scripts invoking the necessary SQL commands to carry out any necessary 'data cleaning or manipulation' operations.\n", "\n", "Ungraded Task 1.1: Determine all the countries with a population of more than 20 million as per the CountriesOfTheWorld table, that got no gold medals. \n", "\n", "Ungraded Task 1.2: Determine the ratio of the number of total medals to the per capita GDP for each country as per the CountriesOfTheWorld table.\n", "\n", "If and when you manage to solve these, share and compare your solution with others. Challenge each other with other such queries! Contemplate on what difficulties you encountered in answering these questions. \n", "\n", "Later in the course, we will delve into data wrangling, and use Python/Pandas to solve such questions. You could then compare the pros and cons with your experience. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "

That's it folks!

\n", "\n", "\"SQLLite" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }