{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "###### Natural Language Processing & Machine Learning\n", "

\n", "
\n", "

SC 4125: Developing Data Products

\n", "

Module-6: Statistical tools for natural language processing and machine learning


\n", "\n", " \n", "
\n", "by Anwitaman DATTA
\n", "School of Computer Science and Engineering, NTU Singapore. \n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Teaching material\n", "- .html deck of slides\n", "- .ipynb Jupyter notebook" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Disclaimer/Caveat emptor\n", "\n", "- Non-systematic and non-exhaustive review\n", "- Illustrative approaches are not necessarily the most efficient or elegant, let alone unique" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Acknowledgement & Disclaimer\n", "\n", "> This module is based on some original fragments of codes mixed with code snippets from many different sources (books, blogs, git repos, documentations, videos, ...). The narrative in the module is also influenced from the many sources consulted. They have been referenced in-place in a best effort manner. Links to several third-party videos and reading references have also been provided.\n", "\n", "> If there are any attribution omissions to be rectified, or should anything in the material need to be changed or redacted, the copyright owners are requested to contact me at anwitaman@ntu.edu.sg " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Elementary concepts" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from IPython.display import HTML" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "import nltk\n", "#nltk.download(\"book\") \n", "# We may use some of the existing corpora for some of our examples\n", "# see more about nltk's corpora at https://www.nltk.org/book/ch02.html \n", "#from nltk.book import * ## Not used finally\n", "\n", "#########\n", "\n", "import spacy\n", "nlp = spacy.load(\"en_core_web_sm\")\n", "#!python -m spacy download en_core_web_sm\n", "# see more about spaCy models at https://spacy.io/models" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Making a first sense: What comprises the data?\n", "\n", "> Encoding: e.g. UTF-8
\n", "> - ♞ to F6 \n", "> - Smöoy Frozen Yogurt\n", "\n", "> Cleaning the data\n", "> - e.g., using RegEx: We studied this in one of the early modules " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Tokenization: Granularity - Word, Sentence\n", "\n", "
\n", "
\n", " \"Neo4j\n", "
Image source: Spacy.io \n", "
\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "NYT_Headline = \"F.D.A. Panel Recommends Booster for Many Moderna Vaccine Recipients. Those eligible for the extra shot would include adults over 65 and others at high risk — the same groups now eligible for a Pfizer-BioNTech boost. By Sharon LaFraniere and Noah Weiland, Oct. 14, 2021\"\n", "\n", "EE1 =\"An alarming problem looms once more: Iran is rapidly advancing its nuclear programme; Israel is threatening military action against it; and America is seeking a diplomatic solution. Anton La Guardia, The Economist’s diplomatic editor, wrote about the international crisis this poses. Al-Monitor, a publication that covers the Middle East, has been monitoring the negotiations to revive a nuclear deal from 2015. It makes for a fiendish dilemma that the former French president, Nicolas Sarkozy, encapsulated thus: `an Iranian bomb or the bombing of Iran'. Iran doesn't yet have a nuclear weapon. But the situation is in many ways worse than in the past. The country is closer than ever to being able to make a nuke. One expert, David Albright, puts the `breakout time'—the time needed to make one bomb’s worth of highly enriched uranium—at just one month. Mr Albright’s book is one of the most detailed accounts of the Iranian programme, drawing on years of inspections by the International Atomic Energy Agency and a trove of Iranian documents obtained by Israeli intelligence. Another problem is that the credibility of American diplomacy has been damaged by Donald Trump’s repudiation of the nuclear deal. His successor, Joe Biden, is trying to revive it but Iran doesn’t seem to be interested. Robert Malley, America's special envoy on Iran, discusses the impasse in a recent interview. Iran claims it seeks nuclear energy only for civilian purposes. The trouble is that the technology used to make low-enriched uranium fuel for nuclear power stations is also used to make highly-enriched uranium for weapons. Whatever its ultimate aim, Iran’s regime attaches great national pride to its mastery of nuclear technology: “our moon shot” is how one Iranian official put it, according to “The Back Channel”, a memoir by William Burns, the head of the CIA who, as a diplomat, helped negotiate the nuclear deal.\"" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['An', 'alarming', 'problem', 'looms', 'once', 'more:', 'Iran', 'is', 'rapidly', 'advancing', 'its', 'nuclear', 'programme;', 'Israel', 'is', 'threatening', 'military', 'action', 'against', 'it;', 'and', 'America', 'is', 'seeking', 'a', 'diplomatic', 'solution.', 'Anton', 'La', 'Guardia,', 'The', 'Economist’s', 'diplomatic', 'editor,', 'wrote', 'about', 'the', 'international', 'crisis', 'this', 'poses.', 'Al-Monitor,', 'a', 'publication', 'that', 'covers', 'the', 'Middle', 'East,', 'has', 'been', 'monitoring', 'the', 'negotiations', 'to', 'revive', 'a', 'nuclear', 'deal', 'from', '2015.', 'It', 'makes', 'for', 'a', 'fiendish', 'dilemma', 'that', 'the', 'former', 'French', 'president,', 'Nicolas', 'Sarkozy,', 'encapsulated', 'thus:', '`an', 'Iranian', 'bomb', 'or', 'the', 'bombing', 'of', \"Iran'.\", 'Iran', \"doesn't\", 'yet', 'have', 'a', 'nuclear', 'weapon.', 'But', 'the', 'situation', 'is', 'in', 'many', 'ways', 'worse', 'than', 'in', 'the', 'past.', 'The', 'country', 'is', 'closer', 'than', 'ever', 'to', 'being', 'able', 'to', 'make', 'a', 'nuke.', 'One', 'expert,', 'David', 'Albright,', 'puts', 'the', '`breakout', \"time'—the\", 'time', 'needed', 'to', 'make', 'one', 'bomb’s', 'worth', 'of', 'highly', 'enriched', 'uranium—at', 'just', 'one', 'month.', 'Mr', 'Albright’s', 'book', 'is', 'one', 'of', 'the', 'most', 'detailed', 'accounts', 'of', 'the', 'Iranian', 'programme,', 'drawing', 'on', 'years', 'of', 'inspections', 'by', 'the', 'International', 'Atomic', 'Energy', 'Agency', 'and', 'a', 'trove', 'of', 'Iranian', 'documents', 'obtained', 'by', 'Israeli', 'intelligence.', 'Another', 'problem', 'is', 'that', 'the', 'credibility', 'of', 'American', 'diplomacy', 'has', 'been', 'damaged', 'by', 'Donald', 'Trump’s', 'repudiation', 'of', 'the', 'nuclear', 'deal.', 'His', 'successor,', 'Joe', 'Biden,', 'is', 'trying', 'to', 'revive', 'it', 'but', 'Iran', 'doesn’t', 'seem', 'to', 'be', 'interested.', 'Robert', 'Malley,', \"America's\", 'special', 'envoy', 'on', 'Iran,', 'discusses', 'the', 'impasse', 'in', 'a', 'recent', 'interview.', 'Iran', 'claims', 'it', 'seeks', 'nuclear', 'energy', 'only', 'for', 'civilian', 'purposes.', 'The', 'trouble', 'is', 'that', 'the', 'technology', 'used', 'to', 'make', 'low-enriched', 'uranium', 'fuel', 'for', 'nuclear', 'power', 'stations', 'is', 'also', 'used', 'to', 'make', 'highly-enriched', 'uranium', 'for', 'weapons.', 'Whatever', 'its', 'ultimate', 'aim,', 'Iran’s', 'regime', 'attaches', 'great', 'national', 'pride', 'to', 'its', 'mastery', 'of', 'nuclear', 'technology:', '“our', 'moon', 'shot”', 'is', 'how', 'one', 'Iranian', 'official', 'put', 'it,', 'according', 'to', '“The', 'Back', 'Channel”,', 'a', 'memoir', 'by', 'William', 'Burns,', 'the', 'head', 'of', 'the', 'CIA', 'who,', 'as', 'a', 'diplomat,', 'helped', 'negotiate', 'the', 'nuclear', 'deal.']\n" ] } ], "source": [ "# Word tokenization\n", "# Just split by space?\n", "print(EE1.split())" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['An', 'alarming', 'problem', 'looms', 'once', 'more', ':', 'Iran', 'is', 'rapidly', 'advancing', 'its', 'nuclear', 'programme', ';', 'Israel', 'is', 'threatening', 'military', 'action', 'against', 'it', ';', 'and', 'America', 'is', 'seeking', 'a', 'diplomatic', 'solution', '.', 'Anton', 'La', 'Guardia', ',', 'The', 'Economist', '’', 's', 'diplomatic', 'editor', ',', 'wrote', 'about', 'the', 'international', 'crisis', 'this', 'poses', '.', 'Al-Monitor', ',', 'a', 'publication', 'that', 'covers', 'the', 'Middle', 'East', ',', 'has', 'been', 'monitoring', 'the', 'negotiations', 'to', 'revive', 'a', 'nuclear', 'deal', 'from', '2015', '.', 'It', 'makes', 'for', 'a', 'fiendish', 'dilemma', 'that', 'the', 'former', 'French', 'president', ',', 'Nicolas', 'Sarkozy', ',', 'encapsulated', 'thus', ':', '`', 'an', 'Iranian', 'bomb', 'or', 'the', 'bombing', 'of', 'Iran', \"'\", '.', 'Iran', 'does', \"n't\", 'yet', 'have', 'a', 'nuclear', 'weapon', '.', 'But', 'the', 'situation', 'is', 'in', 'many', 'ways', 'worse', 'than', 'in', 'the', 'past', '.', 'The', 'country', 'is', 'closer', 'than', 'ever', 'to', 'being', 'able', 'to', 'make', 'a', 'nuke', '.', 'One', 'expert', ',', 'David', 'Albright', ',', 'puts', 'the', '`', 'breakout', \"time'—the\", 'time', 'needed', 'to', 'make', 'one', 'bomb', '’', 's', 'worth', 'of', 'highly', 'enriched', 'uranium—at', 'just', 'one', 'month', '.', 'Mr', 'Albright', '’', 's', 'book', 'is', 'one', 'of', 'the', 'most', 'detailed', 'accounts', 'of', 'the', 'Iranian', 'programme', ',', 'drawing', 'on', 'years', 'of', 'inspections', 'by', 'the', 'International', 'Atomic', 'Energy', 'Agency', 'and', 'a', 'trove', 'of', 'Iranian', 'documents', 'obtained', 'by', 'Israeli', 'intelligence', '.', 'Another', 'problem', 'is', 'that', 'the', 'credibility', 'of', 'American', 'diplomacy', 'has', 'been', 'damaged', 'by', 'Donald', 'Trump', '’', 's', 'repudiation', 'of', 'the', 'nuclear', 'deal', '.', 'His', 'successor', ',', 'Joe', 'Biden', ',', 'is', 'trying', 'to', 'revive', 'it', 'but', 'Iran', 'doesn', '’', 't', 'seem', 'to', 'be', 'interested', '.', 'Robert', 'Malley', ',', 'America', \"'s\", 'special', 'envoy', 'on', 'Iran', ',', 'discusses', 'the', 'impasse', 'in', 'a', 'recent', 'interview', '.', 'Iran', 'claims', 'it', 'seeks', 'nuclear', 'energy', 'only', 'for', 'civilian', 'purposes', '.', 'The', 'trouble', 'is', 'that', 'the', 'technology', 'used', 'to', 'make', 'low-enriched', 'uranium', 'fuel', 'for', 'nuclear', 'power', 'stations', 'is', 'also', 'used', 'to', 'make', 'highly-enriched', 'uranium', 'for', 'weapons', '.', 'Whatever', 'its', 'ultimate', 'aim', ',', 'Iran', '’', 's', 'regime', 'attaches', 'great', 'national', 'pride', 'to', 'its', 'mastery', 'of', 'nuclear', 'technology', ':', '“', 'our', 'moon', 'shot', '”', 'is', 'how', 'one', 'Iranian', 'official', 'put', 'it', ',', 'according', 'to', '“', 'The', 'Back', 'Channel', '”', ',', 'a', 'memoir', 'by', 'William', 'Burns', ',', 'the', 'head', 'of', 'the', 'CIA', 'who', ',', 'as', 'a', 'diplomat', ',', 'helped', 'negotiate', 'the', 'nuclear', 'deal', '.']\n" ] } ], "source": [ "# word tokenization using nltk\n", "print(nltk.word_tokenize(EE1))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[An, alarming, problem, looms, once, more, :, Iran, is, rapidly, advancing, its, nuclear, programme, ;, Israel, is, threatening, military, action, against, it, ;, and, America, is, seeking, a, diplomatic, solution, ., Anton, La, Guardia, ,, The, Economist, ’s, diplomatic, editor, ,, wrote, about, the, international, crisis, this, poses, ., Al, -, Monitor, ,, a, publication, that, covers, the, Middle, East, ,, has, been, monitoring, the, negotiations, to, revive, a, nuclear, deal, from, 2015, ., It, makes, for, a, fiendish, dilemma, that, the, former, French, president, ,, Nicolas, Sarkozy, ,, encapsulated, thus, :, `, an, Iranian, bomb, or, the, bombing, of, Iran, ', ., Iran, does, n't, yet, have, a, nuclear, weapon, ., But, the, situation, is, in, many, ways, worse, than, in, the, past, ., The, country, is, closer, than, ever, to, being, able, to, make, a, nuke, ., One, expert, ,, David, Albright, ,, puts, the, `, breakout, time'—the, time, needed, to, make, one, bomb, ’s, worth, of, highly, enriched, uranium, —, at, just, one, month, ., Mr, Albright, ’s, book, is, one, of, the, most, detailed, accounts, of, the, Iranian, programme, ,, drawing, on, years, of, inspections, by, the, International, Atomic, Energy, Agency, and, a, trove, of, Iranian, documents, obtained, by, Israeli, intelligence, ., Another, problem, is, that, the, credibility, of, American, diplomacy, has, been, damaged, by, Donald, Trump, ’s, repudiation, of, the, nuclear, deal, ., His, successor, ,, Joe, Biden, ,, is, trying, to, revive, it, but, Iran, does, n’t, seem, to, be, interested, ., Robert, Malley, ,, America, 's, special, envoy, on, Iran, ,, discusses, the, impasse, in, a, recent, interview, ., Iran, claims, it, seeks, nuclear, energy, only, for, civilian, purposes, ., The, trouble, is, that, the, technology, used, to, make, low, -, enriched, uranium, fuel, for, nuclear, power, stations, is, also, used, to, make, highly, -, enriched, uranium, for, weapons, ., Whatever, its, ultimate, aim, ,, Iran, ’s, regime, attaches, great, national, pride, to, its, mastery, of, nuclear, technology, :, “, our, moon, shot, ”, is, how, one, Iranian, official, put, it, ,, according, to, “, The, Back, Channel, ”, ,, a, memoir, by, William, Burns, ,, the, head, of, the, CIA, who, ,, as, a, diplomat, ,, helped, negotiate, the, nuclear, deal, .]\n" ] } ], "source": [ "# word tokenization using spaCy\n", "docEE = nlp(EE1) \n", "# This simple step creates a Doc object which has lot of information, some of which we will use subsequently \n", "print([token for token in docEE])" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "scrolled": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "['An alarming problem looms once more: Iran is rapidly advancing its nuclear programme; Israel is threatening military action against it; and America is seeking a diplomatic solution.',\n", " 'Anton La Guardia, The Economist’s diplomatic editor, wrote about the international crisis this poses.',\n", " 'Al-Monitor, a publication that covers the Middle East, has been monitoring the negotiations to revive a nuclear deal from 2015.',\n", " \"It makes for a fiendish dilemma that the former French president, Nicolas Sarkozy, encapsulated thus: `an Iranian bomb or the bombing of Iran'.\",\n", " \"Iran doesn't yet have a nuclear weapon.\",\n", " 'But the situation is in many ways worse than in the past.',\n", " 'The country is closer than ever to being able to make a nuke.',\n", " \"One expert, David Albright, puts the `breakout time'—the time needed to make one bomb’s worth of highly enriched uranium—at just one month.\",\n", " 'Mr Albright’s book is one of the most detailed accounts of the Iranian programme, drawing on years of inspections by the International Atomic Energy Agency and a trove of Iranian documents obtained by Israeli intelligence.',\n", " 'Another problem is that the credibility of American diplomacy has been damaged by Donald Trump’s repudiation of the nuclear deal.',\n", " 'His successor, Joe Biden, is trying to revive it but Iran doesn’t seem to be interested.',\n", " \"Robert Malley, America's special envoy on Iran, discusses the impasse in a recent interview.\",\n", " 'Iran claims it seeks nuclear energy only for civilian purposes.',\n", " 'The trouble is that the technology used to make low-enriched uranium fuel for nuclear power stations is also used to make highly-enriched uranium for weapons.',\n", " 'Whatever its ultimate aim, Iran’s regime attaches great national pride to its mastery of nuclear technology: “our moon shot” is how one Iranian official put it, according to “The Back Channel”, a memoir by William Burns, the head of the CIA who, as a diplomat, helped negotiate the nuclear deal.']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Sentence tokenization using nltk\n", "nltk.sent_tokenize(EE1)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[An alarming problem looms once more: Iran is rapidly advancing its nuclear programme; Israel is threatening military action against it; and America is seeking a diplomatic solution.,\n", " Anton La Guardia, The Economist’s diplomatic editor, wrote about the international crisis this poses.,\n", " Al-Monitor, a publication that covers the Middle East, has been monitoring the negotiations to revive a nuclear deal from 2015.,\n", " It makes for a fiendish dilemma that the former French president, Nicolas Sarkozy, encapsulated thus: `an Iranian bomb or the bombing of Iran'.,\n", " Iran doesn't yet have a nuclear weapon.,\n", " But the situation is in many ways worse than in the past.,\n", " The country is closer than ever to being able to make a nuke.,\n", " One expert, David Albright, puts the `breakout time'—the time needed to make one bomb’s worth of highly enriched uranium—at just one month.,\n", " Mr Albright’s book is one of the most detailed accounts of the Iranian programme, drawing on years of inspections by the International Atomic Energy Agency and a trove of Iranian documents obtained by Israeli intelligence.,\n", " Another problem is that the credibility of American diplomacy has been damaged by Donald Trump’s repudiation of the nuclear deal.,\n", " His successor, Joe Biden, is trying to revive it,\n", " but Iran doesn’t seem to be interested.,\n", " Robert Malley, America's special envoy on Iran, discusses the impasse in a recent interview.,\n", " Iran claims it seeks nuclear energy only for civilian purposes.,\n", " The trouble is that the technology used to make low-enriched uranium fuel for nuclear power stations is also used to make highly-enriched uranium for weapons.,\n", " Whatever its ultimate aim, Iran’s regime attaches great national pride to its mastery of nuclear technology: “our moon shot” is how one Iranian official put it, according to “The Back Channel”, a memoir by William Burns, the head of the CIA who, as a diplomat, helped negotiate the nuclear deal.]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Sentence tokenization using spaCy\n", "[x for x in docEE.sents]" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['F.D.A.', 'Panel Recommends Booster for Many Moderna Vaccine Recipients.', 'Those eligible for the extra shot would include adults over 65 and others at high risk — the same groups now eligible for a Pfizer-BioNTech boost.', 'By Sharon LaFraniere and Noah Weiland, Oct. 14, 2021']\n", "\n", "[F.D.A. Panel Recommends Booster for Many Moderna Vaccine Recipients., Those eligible for the extra shot would include adults over 65 and others at high risk — the same groups now eligible for a Pfizer-BioNTech boost., By Sharon LaFraniere and Noah Weiland, Oct. 14, 2021]\n" ] } ], "source": [ "# Sentence tokenization: Further example\n", "# With the NYT headline example, the nltk model thinks F.D.A. is a sentence of its own!\n", "# This example is NOT to be used to conclude one toolkit's superiority over other. \n", "# More rigorous/exhaustive experiments are needed for that, and each excels for some or other tasks.\n", "# The only point I want to make with this example is the potential limitations in general of these black-box tools!\n", "\n", "print(nltk.sent_tokenize(NYT_Headline)) # nltk \n", "print()\n", "docNYT = nlp(NYT_Headline) # spaCy \n", "print([x for x in docNYT.sents])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "> Normalization: Map text to a canonical representation. \n", "> - Examples:\n", "> * b4, Before, befor, before\n", "> * US, U.S.A., USA, United States, United States of America, ... \n", "> * 5th of November, 5 Nov., November 5, ...\n", "> * synonyms: Pfizer–BioNTech COVID-19 vaccine, Comirnaty, BNT162b2, ... \n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "> Normalization: Map text to a canonical representation. \n", "> - Some standard tools: \n", "> * Stemming: usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. \n", "> * Lemmatization: usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.\n", "> * Definitions from Introduction to Information Retrieval\n", " by Manning et al. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# Stemming and Lemmatization example with a list of words\n", "input_list = \"resting restful restless being is was am goodness goods\"\n", "words_list = input_list.lower().split(' ')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NLTK PorterStemmer ['rest', 'rest', 'restless', 'be', 'is', 'wa', 'am', 'good', 'good']\n", "NLTK SnowStemmer ['rest', 'rest', 'restless', 'be', 'is', 'was', 'am', 'good', 'good']\n" ] } ], "source": [ "# nltk includes multiple stemming algorithms, e.g., Porter, Snowball, ...\n", "## see more at https://www.nltk.org/api/nltk.stem.html\n", "porter = nltk.PorterStemmer()\n", "print(\"NLTK PorterStemmer\", [porter.stem(t) for t in words_list])\n", "##\n", "snowball = nltk.SnowballStemmer(\"english\")\n", "print(\"NLTK SnowStemmer\", [snowball.stem(t) for t in words_list])" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NLTK Wordnet Lemmatizer ['resting', 'restful', 'restless', 'being', 'is', 'wa', 'am', 'goodness', 'good']\n" ] } ], "source": [ "# Wordnet based lemmatizer in nltk\n", "WNlemma = nltk.WordNetLemmatizer()\n", "print(\"NLTK Wordnet Lemmatizer\", [WNlemma.lemmatize(x) for x in words_list])" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[(10960894369163974213, 'rest'), (12859622076218287646, 'restful'), (15810880026682126118, 'restless'), (3899131925553995529, 'being'), (10382539506755952630, 'be'), (10382539506755952630, 'be'), (10382539506755952630, 'be'), (13871556783787251893, 'goodness'), (5711639017775284443, 'good')]\n" ] } ], "source": [ "# spaCy also returns hash of a lemma if you want e.g., convenient for aggregating, Map-Reduce, etc)\n", "list_variations=nlp(input_list)\n", "print([(x.lemma, x.lemma_) for x in list_variations])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Ungraded task 6.1: Write a function which takes as input text, and returns a list of lemmatized tokens with stopwords and punctuations filtered out. \n", "\n", "Example output: ['alarming', 'problem', 'loom', 'Iran', ...]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Making sense of sentence structure: Part-of-speech (POS) tagging\n", "\n", "> Note: Need to be careful about which kinds of normalizations are carried out in earlier stages, since some of those have destructive effects on the semantics essential for making sense of the sentence, e.g., US versus us.\n", "\n", "\n", "> May need to detect/determine the language before this step (since the grammar rules are language dependent)." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "F.D.A. Panel Recommends Booster for Many Moderna Vaccine Recipients. Those eligible for the extra shot would include adults over 65 and others at high risk — the same groups now eligible for a Pfizer-BioNTech boost. By Sharon LaFraniere and Noah Weiland, Oct. 14, 2021\n", "\n", "[('F.D.A', 'NNP'), ('.', '.'), ('Panel', 'NNP'), ('Recommends', 'VBZ'), ('Booster', 'NNP'), ('for', 'IN'), ('Many', 'NNP'), ('Moderna', 'NNP'), ('Vaccine', 'NNP'), ('Recipients', 'NNP'), ('.', '.'), ('Those', 'DT'), ('eligible', 'JJ'), ('for', 'IN'), ('the', 'DT'), ('extra', 'JJ'), ('shot', 'NN'), ('would', 'MD'), ('include', 'VB'), ('adults', 'NNS'), ('over', 'IN'), ('65', 'CD'), ('and', 'CC'), ('others', 'NNS'), ('at', 'IN'), ('high', 'JJ'), ('risk', 'NN'), ('—', 'VBP'), ('the', 'DT'), ('same', 'JJ'), ('groups', 'NNS'), ('now', 'RB'), ('eligible', 'VBP'), ('for', 'IN'), ('a', 'DT'), ('Pfizer-BioNTech', 'JJ'), ('boost', 'NN'), ('.', '.'), ('By', 'IN'), ('Sharon', 'NNP'), ('LaFraniere', 'NNP'), ('and', 'CC'), ('Noah', 'NNP'), ('Weiland', 'NNP'), (',', ','), ('Oct.', 'NNP'), ('14', 'CD'), (',', ','), ('2021', 'CD')]\n" ] } ], "source": [ "print(NYT_Headline +\"\\n\")\n", "NYT_postags=nltk.pos_tag(nltk.word_tokenize(NYT_Headline))\n", "print(NYT_postags)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NNP: noun, proper, singular\n", " Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos\n", " Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA\n", " Shannon A.K.C. Meltex Liverpool ...\n" ] } ], "source": [ "nltk.help.upenn_tagset(NYT_postags[0][1])\n", "# For full list, use nltk.help.upenn_tagset()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "scrolled": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[('F.D.A.', 'F.D.A.', 'PROPN', 'NNP'),\n", " ('Panel', 'Panel', 'PROPN', 'NNP'),\n", " ('Recommends', 'recommend', 'VERB', 'VBZ'),\n", " ('Booster', 'Booster', 'PROPN', 'NNP'),\n", " ('for', 'for', 'ADP', 'IN'),\n", " ('Many', 'many', 'ADJ', 'JJ'),\n", " ('Moderna', 'Moderna', 'PROPN', 'NNP'),\n", " ('Vaccine', 'Vaccine', 'PROPN', 'NNP'),\n", " ('Recipients', 'Recipients', 'PROPN', 'NNP'),\n", " ('.', '.', 'PUNCT', '.'),\n", " ('Those', 'those', 'DET', 'DT'),\n", " ('eligible', 'eligible', 'ADJ', 'JJ'),\n", " ('for', 'for', 'ADP', 'IN'),\n", " ('the', 'the', 'DET', 'DT'),\n", " ('extra', 'extra', 'ADJ', 'JJ'),\n", " ('shot', 'shot', 'NOUN', 'NN'),\n", " ('would', 'would', 'AUX', 'MD'),\n", " ('include', 'include', 'VERB', 'VB'),\n", " ('adults', 'adult', 'NOUN', 'NNS'),\n", " ('over', 'over', 'ADP', 'IN'),\n", " ('65', '65', 'NUM', 'CD'),\n", " ('and', 'and', 'CCONJ', 'CC'),\n", " ('others', 'other', 'NOUN', 'NNS'),\n", " ('at', 'at', 'ADP', 'IN'),\n", " ('high', 'high', 'ADJ', 'JJ'),\n", " ('risk', 'risk', 'NOUN', 'NN'),\n", " ('—', '—', 'PUNCT', ':'),\n", " ('the', 'the', 'DET', 'DT'),\n", " ('same', 'same', 'ADJ', 'JJ'),\n", " ('groups', 'group', 'NOUN', 'NNS'),\n", " ('now', 'now', 'ADV', 'RB'),\n", " ('eligible', 'eligible', 'ADJ', 'JJ'),\n", " ('for', 'for', 'ADP', 'IN'),\n", " ('a', 'a', 'DET', 'DT'),\n", " ('Pfizer', 'Pfizer', 'PROPN', 'NNP'),\n", " ('-', '-', 'PUNCT', 'HYPH'),\n", " ('BioNTech', 'BioNTech', 'PROPN', 'NNP'),\n", " ('boost', 'boost', 'NOUN', 'NN'),\n", " ('.', '.', 'PUNCT', '.'),\n", " ('By', 'by', 'ADP', 'IN'),\n", " ('Sharon', 'Sharon', 'PROPN', 'NNP'),\n", " ('LaFraniere', 'LaFraniere', 'PROPN', 'NNP'),\n", " ('and', 'and', 'CCONJ', 'CC'),\n", " ('Noah', 'Noah', 'PROPN', 'NNP'),\n", " ('Weiland', 'Weiland', 'PROPN', 'NNP'),\n", " (',', ',', 'PUNCT', ','),\n", " ('Oct.', 'Oct.', 'PROPN', 'NNP'),\n", " ('14', '14', 'NUM', 'CD'),\n", " (',', ',', 'PUNCT', ','),\n", " ('2021', '2021', 'NUM', 'CD')]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# with spaCy\n", "[(token.text, token.lemma_, token.pos_, token.tag_) for token in docNYT]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Caution: Ambiguities" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "nltk: No verb!\n", "[('British', 'JJ'), ('Left', 'NNP'), ('Waffles', 'NNP'), ('on', 'IN'), ('Falkland', 'NNP'), ('Islands', 'NNP')]\n", "\n", "spaCy: The second meaning!\n", "[('British', 'NNP'), ('Left', 'VBD'), ('Waffles', 'NNP'), ('on', 'IN'), ('Falkland', 'NNP'), ('Islands', 'NNPS')]\n" ] } ], "source": [ "# POS taggers typically round up the usual suspects (i.e., looks at the most common usage of words)\n", "\n", "guardian_headline=\"British Left Waffles on Falkland Islands\"\n", "# What does it mean?\n", "# To waffle: to speak or write at length in a vague or trivial manner, fail to make up one's mind.\n", "# The British party of the left rambles indecisively about Falkland Island policy.\n", "# The British forces left behind waffles (the pastry) on the Falkland Islands.\n", "\n", "print('nltk: No verb!')\n", "print(nltk.pos_tag(nltk.word_tokenize(guardian_headline)))\n", "print()\n", "print('spaCy: The second meaning!')\n", "docG=nlp(guardian_headline)\n", "print([(token.text, token.tag_) for token in docG])\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Classification & topic modeling" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "### Classification and topic modeling \n", "\n", "> Features \n", "> - words, word sequences, numbers, dates, which normalizations to do/not do, capitalization, POS, grammatical structure \n", ">\n", "> See an example of features used for spam detection \n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Classification/Sentiment Analysis\n", "\n", "> We will use an IMDB reviews dataset from Kaggle for exploring a simple (only two classes, using balanced data) example of classification\n", "> https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "positive 25000\n", "negative 25000\n", "Name: sentiment, dtype: int64\n", " review sentiment\n", "8134 That's the worst film I saw since a long time.... 0\n", "5960 I remember this bomb coming out in the early 8... 0\n", "19706 Okay, I'll admit it--I am a goof-ball and I oc... 1\n", "12826 This film lingered and lingered at a small mov... 1\n", "39656 Two things haunt you throughout L'intrus (The ... 1\n" ] } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "imdb_df = pd.read_csv('data/IMDBPolarDataset.csv')\n", "# Data from https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews\n", "print(imdb_df['sentiment'].value_counts())\n", "#print()\n", "#print(imdb_df.sample(3))\n", "#print()\n", "imdb_df['sentiment'] = np.where(imdb_df['sentiment'] == 'positive', 1, 0)\n", "print(imdb_df.sample(5))\n", "imdb_df=imdb_df.sample(frac=0.2, random_state=0) \n", "# For scalability, it may make sense to use a smaller sample of data for the whole ML pipeline\n", "# You may have to experiment with the sample size to find a good balance" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Train and test \n", "- Cross-validation\n", " * Discussed in greater detail later in the module" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "# Split data into training and test sets\n", "X_train, X_test, y_train, y_test = train_test_split(imdb_df['review'], \n", " imdb_df['sentiment'], \n", " random_state=0,test_size=0.2)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### detour: CountVectorizer\n", "> Need to represent the information (features) in a manner that the ML algorithm can \"consume\"" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['and' 'document' 'first' 'fourth' 'is' 'not' 'one' 'second' 'the' 'this']\n", "[[0 1 1 0 1 0 0 0 1 1]\n", " [0 2 0 0 1 0 0 1 1 1]\n", " [1 0 0 1 1 1 1 0 1 1]\n", " [0 1 1 0 1 0 0 0 1 1]]\n", "(4, 10)\n" ] } ], "source": [ "# detour \n", "## Example code from https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html\n", "\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "corpus = [\n", " 'This is the first document.',\n", " 'This document is the second document.',\n", " 'And this is not the fourth one.',\n", " 'Is this the first document?',\n", "]\n", "vectorizer = CountVectorizer()\n", "X = vectorizer.fit_transform(corpus)\n", "print(vectorizer.get_feature_names_out())\n", "print(X.toarray())\n", "print(X.toarray().shape)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['document' 'fourth' 'second']\n", "[[1 0 0]\n", " [2 0 1]\n", " [0 1 0]\n", " [1 0 0]]\n", "(4, 3)\n" ] } ], "source": [ "# if you want to exclude stop words\n", "## be careful for what you ask for: e.g., we lost 'not'\n", "vectorizerWSW = CountVectorizer(stop_words='english')\n", "X_WSW = vectorizerWSW.fit_transform(corpus)\n", "print(vectorizerWSW.get_feature_names_out())\n", "print(X_WSW.toarray())\n", "print(X_WSW.toarray().shape)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# if you want to customize your stopwords, you may want to start with the default, and see what you want to add/remove\n", "from sklearn.feature_extraction import text \n", "#print(sorted(list(text.ENGLISH_STOP_WORDS)))\n", "set_to_remove={\"no\",\"nor\",\"not\",\"none\",\"nooooo!\"}\n", "new_set=text.ENGLISH_STOP_WORDS - set_to_remove" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['document' 'fourth' 'not' 'second']\n", "[[1 0 0 0]\n", " [2 0 0 1]\n", " [0 1 1 0]\n", " [1 0 0 0]]\n", "(4, 4)\n" ] } ], "source": [ "## you can provide your own customized list\n", "vectorizerWSW = CountVectorizer(stop_words=new_set)\n", "X_WSW = vectorizerWSW.fit_transform(corpus)\n", "print(vectorizerWSW.get_feature_names_out())\n", "print(X_WSW.toarray())\n", "print(X_WSW.toarray().shape)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['document' 'document second' 'fourth' 'not' 'not fourth' 'second'\n", " 'second document']\n", "[[1 0 0 0 0 0 0]\n", " [2 1 0 0 0 1 1]\n", " [0 0 1 1 1 0 0]\n", " [1 0 0 0 0 0 0]]\n" ] } ], "source": [ "# if you want n-grams\n", "vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(1, 2), stop_words=new_set)\n", "# you can use 'char' and ‘char_wb’ options for analyzer for character (and respectively, characters restricted to word boundaries) based n-grams\n", "X2 = vectorizer2.fit_transform(corpus)\n", "print(vectorizer2.get_feature_names_out())\n", "print(X2.toarray())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Back to the IMDB data" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of tokens: 47536\n", "Runtime for vectorization fitting for Model 1: 1.4932281970977783 seconds\n", "Number of tokens: 660629\n", "Runtime for vectorization fitting for Model 2: 7.169947862625122 seconds\n", "Number of tokens: 30295\n", "Runtime for vectorization fitting for Model 1: 7.134037256240845 seconds\n" ] } ], "source": [ "import time\n", "start = time.time()\n", "cv = CountVectorizer().fit(X_train)\n", "# Other vectorizers are possible, e.g., TF-IDf \n", "print(f\"Number of tokens: {len(cv.get_feature_names_out())}\")\n", "end = time.time()\n", "print(f\"Runtime for vectorization fitting for Model 1: {end - start} seconds\")\n", "# Using words and word bi-grams \n", "start = time.time()\n", "cv2 = CountVectorizer(analyzer='word', ngram_range=(1, 2)).fit(X_train)\n", "print(f\"Number of tokens: {len(cv2.get_feature_names_out())}\")\n", "end = time.time()\n", "print(f\"Runtime for vectorization fitting for Model 2: {end - start} seconds\")\n", "# Remove tokens that don't appear in at least 5 reviews, or if it appears in more than 50% of the reviews. \n", "# Use upto word tri-grams.\n", "start = time.time()\n", "cv3 = CountVectorizer(min_df=5, max_df=0.5, ngram_range=(1, 3), stop_words=new_set).fit(X_train)\n", "print(f\"Number of tokens: {len(cv3.get_feature_names_out())}\")\n", "end = time.time()\n", "print(f\"Runtime for vectorization fitting for Model 1: {end - start} seconds\")" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Runtime for transforming training data for Model 1: 2.2858214378356934 seconds\n", "Runtime for transforming training data for Model 2: 5.156646728515625 seconds\n", "Runtime for transforming training data for Model 3: 3.6959545612335205 seconds\n" ] } ], "source": [ "start = time.time()\n", "X_train_vec = cv.transform(X_train)\n", "end = time.time()\n", "print(f\"Runtime for transforming training data for Model 1: {end - start} seconds\")\n", "#\n", "start = time.time()\n", "X_train_vec2 = cv2.transform(X_train)\n", "end = time.time()\n", "print(f\"Runtime for transforming training data for Model 2: {end - start} seconds\")\n", "#\n", "start = time.time()\n", "X_train_vec3 = cv3.transform(X_train)\n", "end = time.time()\n", "print(f\"Runtime for transforming training data for Model 3: {end - start} seconds\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Detour: Logistic Regression" ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Third party content:\n" ] }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.linear_model import LogisticRegression \n", "print(\"Third party content:\")\n", "# The StatQuest channel has many nice, easy to understand at an intuitive level videos. \n", "# Check them up for a quick introduction/brush-up of the underlying ideas (without the mathematical details)\n", "HTML('')" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Runtime for training Model 1: 9.124244213104248 seconds\n", "Runtime for training Model 2: 56.488693714141846 seconds\n", "Runtime for training Model 3: 1.4204068183898926 seconds\n" ] } ], "source": [ "# Train the models \n", "start = time.time()\n", "model1 = LogisticRegression(max_iter=1000)\n", "model1.fit(X_train_vec, y_train)\n", "end = time.time()\n", "print(f\"Runtime for training Model 1: {end - start} seconds\")\n", "###\n", "start = time.time()\n", "model2 = LogisticRegression(max_iter=1000)\n", "model2.fit(X_train_vec2, y_train)\n", "end = time.time()\n", "print(f\"Runtime for training Model 2: {end - start} seconds\")\n", "###\n", "start = time.time()\n", "model3 = LogisticRegression(max_iter=1000)\n", "model3.fit(X_train_vec3, y_train)\n", "end = time.time()\n", "print(f\"Runtime for training Model 3: {end - start} seconds\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Detour: Accuracy, Confusion Matrix and ROC/AUC" ] }, { "cell_type": "code", "execution_count": 91, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Third party content on : AUC (Area Under The Curve) / ROC (Receiver Operating Characteristics)\n" ] }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(\"Third party content on : AUC (Area Under The Curve) / ROC (Receiver Operating Characteristics)\")\n", "HTML('')" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "from sklearn.metrics import accuracy_score\n", "from sklearn.metrics import roc_auc_score\n", "from sklearn.metrics import confusion_matrix \n", "from sklearn.metrics import ConfusionMatrixDisplay\n", "from sklearn.metrics import f1_score\n", "import matplotlib as mpl \n", "#import matplotlib.cm as cm \n", "import matplotlib.pyplot as plt \n", "#import itertools" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "scrolled": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AUC (model with words): 0.8600351134074928\n", "Accuracy (model with words): 0.859\n", "Runtime for predicting with Model 1: 0.661902904510498 seconds\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Predict the transformed test documents\n", "start = time.time()\n", "predictions1 = model1.predict(cv.transform(X_test))\n", "print('AUC (model with words): ', roc_auc_score(y_test, predictions1))\n", "print(\"Accuracy (model with words): \", accuracy_score(y_test, predictions1))\n", "end = time.time()\n", "print(f\"Runtime for predicting with Model 1: {end - start} seconds\")\n", "ConfusionMatrixDisplay.from_predictions(y_test,predictions1,normalize=\"true\",cmap=plt.cm.Blues)\n", "# Absolute numbers can be used by removing attribute normalize=\"true\"" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAQwAAADzCAYAAABzPyjrAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAAaDklEQVR4nO3dd7hU1bnH8e/vnEMVpAgYrMGGXVQsWBBbxBJN1Khc21WvGmOs1yQac8USvaZojBrbtREVjY1Yo3IRYolBRRQpFq4FRUC6gggceO8fsw8eCczZAzNnn5nz++TZDzN79uz9Do95WXvttdariMDMLI2qrAMws/LhhGFmqTlhmFlqThhmlpoThpml5oRhZqnVZB2AWXNWveaGEbULUh0bC6Y/GxH9SxxSXk4YZhmK2q9ptfkxqY79evQNXUocToOcMMyyJEDKOorUnDDMsqby6Up0wjDLmlsYZpaOoKo66yBSc8Iwy5LwLYmZpSXfkphZAdzCMLPU3MIws1TkTk8zK4RvScwsHTlhmFkBqtyHYWZplNk4jPKJ1KxSSem2VKfSeZLGSRor6X5JrSV1ljRU0vvJn53qHX+RpImS3pV0QEPnd8Iwy1TylCTN1tCZpHWBs4HeEbE1UA0cA1wIDIuITYFhyXskbZl8vhXQH7hJUt4LOWGYZU1V6bZ0aoA2kmqAtsBnwGHAoOTzQcAPkteHAQ9ExMKI+BCYCOyc7+ROGGZZSns7kuKWJCImA78HJgFTgLkR8RywdkRMSY6ZAnRLvrIu8Em9U3ya7FupJtXpqZo2oZbtsw6jom3Tc/2sQ6h4n0z6mFkzZ6R/9JG+9dBF0uv13t8WEbctO02ub+IwoAcwB3hI0nH5rryCfXlLITathNGyPa16HpV1GBXtmRHXZh1Cxevfr09hX0g/NHxGRPTO8/l+wIcRMT13Wj0K7AZMk9Q9IqZI6g58nhz/KVD/X5D1yN3CrJRvScwypWL2YUwCdpXUVpKAfYEJwOPAickxJwKPJa8fB46R1EpSD2BT4NV8F2hSLQyzZkcUbS5JRIyU9DDwBlALjAZuA9oBD0o6hVxS+VFy/DhJDwLjk+PPjIgl+a7hhGGWqeIODY+IgcDA5XYvJNfaWNHxVwJXpj2/E4ZZ1jy93cxSK6Oh4U4YZllzC8PMUvECOmZWCLmFYWZp5ColOmGYWRpixQO0mygnDLNMyS0MM0vPCcPMUquq8jgMM0vDfRhmlpbch2FmhXDCMLPUnDDMLDUnDDNLRyBXPjOzNNzpaWYFccIws/TKJ184YZhlSuXVwiifMalmFaqqqirV1hBJPSW9WW/7QtK5LsZsViHqOj3TbA2JiHcjoldE9AJ2BL4ChuBizGYVRCm3wuwL/F9EfEwRizG7D8MsS4X1YeStrbqcY4D7k9ffKsYsqX4x5n/W+055FWM2a44KSBgN1VatO19L4FDgooYOXcG+vMWYfUtilrFi9WHUcyDwRkRMS95PS4ow42LMZmVOVUq1FWAA39yOgIsxm1WGVWg9NHS+tsD+wOn1dl+NizGbVYZiJoyI+ApYa7l9M3ExZrPKUE4jPZ0wzLJWPvnCCcMsa25hmFkqElR5AR0zS8cL6JhZAcooXzhh5HPGgL05/ge7QQTjJ37GmZffy8JFtZx61F6celRfapcsZehLYxl4Q24czFabrMO1Fw2gfbvWxNJgnxN/y8JFtRn/iqbt5795gOGvjGetju145u6fA3DWZX/mg0m5wYhfzFvAmu3a8NQdF/DWhI/55e8fAiAIzvn3Azhgz20zi71Y3MJISOoP/BGoBm6PiKtLeb1i6t61A6cfvRe7Hn0lXy9czJ1Xnczh39uRT6bM4qC9tmGPAf/NosW1dOnUDoDq6ipuvfxEfjzwz4x9fzKdOqzB4tq8Y2AMOLL/Tpzwwz244KrBy/bdMPCEZa+vvOkx2q/RGoDNenTnsVvPo6amms9nfsHBp/yefftsRU1N3hnZTZvKq4VRsqHhybz6P5Eb174lMCCZf182amqqad2qBdXVVbRt3ZKp0+dy8hF7ct2goSxanGs5zJg9D4B9dtmccRMnM/b9yQDMnjufpUvzzuMxYOftNqZj+7Yr/CwieHr4W3x/3x0AaNO65bLksHDR4rJ6HLkyAqqrlWprCko5l2RnYGJEfBARi4AHyM2/LwtTps/lhnuH8fYTV/DO367ki/kLGD7yHTbZsBt9em3M0Lsu4Mlbz2H7LTcAYOMNuxEBD19/JiPu+QVnH79fxr+g/L025gPW6tSOHut1XbbvzfEfc8C//4YDT/odvz7/yPJuXSRKMPmsZEqZMNYFPqn3vsG59k1Jh/ZtOKjvNvQ6bCBbHHgxbVu35KgDd6KmuoqO7duy/0m/55I//pW7rjoZgJrqanbdbiNO+6+7OfA/ruXgftvRd6fNMv4V5e3xYaM5NGld1Om15YY8e/cv+Out53HzfcNYuHBxRtEVSXJLkmZrCkqZMFLNtZd0mqTXJb0etQtKGE5h+u28OR9/NpOZc+ZRu2QpTwx/i5237cHkz+fwxPC3AHhj/McsjWCtju34bNocXh49kVlz57Ng4WKG/mMc2/Vcv4Gr2MrU1i7h2RfHcPDevVb4+SYbrk3b1i1598OpjRtYkQm3MOqkmmsfEbdFRO+I6K2aNiUMpzCfTp1F72160KZVCwD22qkn7344jadHjFnWcth4g260bFHDzDnzGPbP8Wy1ybq0Sfo8dt9hk7L/jzlLL496j4036Eb3bh2X7ftkykxqk47kyVNn8cEn01nvO51WcoZyUbw1PRtDKZ+SvAZsmsyzn0xuybB/K+H1imrUuI95fNhoRtz7C5YsWcqYdz9l0JCXiQhuvORY/vHAL1m0eAlnXHoPAHO/XMBNg59n2J9/DhEMfXkcz708LuNf0fSdffk9jHxzIrPnzme3Iy/jnJMO4OiDd+XJ59/k+/t8+3bk9bc/5JbBw6iprqaqSlx+7hF07tguo8iLp4nkglQUUbqefEkHAdeRe6x6ZzKVdqWq2naLVj2PKlk8Bh+MuDbrECpe/359eGv0qFRpoO26PWPz029Odd7RA/cdlWaJvlIq6TiMiHgaeLqU1zArZ3V9GOXCIz3NMlZG+cIJwyxrbmGYWWpllC+cMMyyVG7rYbjMgFmmijsOQ1JHSQ9LekfSBEl9XIzZrIIUeWj4H4FnImJzYDtgAi7GbFY5itXCkLQm0Be4AyAiFkXEHIpYjNkJwyxLhU0+61I37yrZTlvubBsB04G7JI2WdLukNViuGDNQvxhzQRNE3elplqECB241VIy5BtgBOCsiRkr6I8ntR57LL8/FmM2asqoqpdpS+BT4NCJGJu8fJpdAXIzZrFIUqw8jIqYCn0jqmezal1zdVBdjNqsIxV8c5yzgPkktgQ+Ak8g1DFyM2azcqch1SSLiTWBF/RwuxmxWCTw03MxSqyqjjOGEYZahcptL4oRhlrEyyhcrTxiSbiDPII6IOLskEZk1M5WyHsbrjRaFWTNWRvli5QkjIgbVfy9pjYiYX/qQzJoPkXu0Wi4aHOmZzKcfT26aLJK2k3RTySMzaw4kqqvSbU1BmqHh1wEHADMBIuItclNozawIyqlUYqqnJBHxyXIdM3mHj5pZOqLyxmF8Imk3IJLx6WeT3J6Y2eoro3yR6pbkx8CZ5BbWmAz0St6bWRFUVG3ViJgBHNsIsZg1O02pfyKNNE9JNpL0hKTpkj6X9JikjRojOLPmoFpKtTUFaW5JBgMPAt2BdYCHgPtLGZRZc1JOtyRpEoYi4p6IqE22e2lg3T8zSyf3lCTd1hTkm0vSOXk5XNKFwAPkEsXRwFONEJtZ5WtCrYc08nV6jiKXIOp+zen1PgvgilIFZdaclFG+yDuXpEdjBmLWXFVKC2MZSVsDWwKt6/ZFxJ9LFZRZcyEo6jwRSR8BX5IbjV0bEb2T7oW/AN8FPgKOiojZyfEXAackx58dEc/mO3+ax6oDgRuSbW/gt8Chq/ZzzGx5SrkVYO+I6FWv6FGj1lY9ktyKw1Mj4iRyBV5bFRa/ma2IlJtLkmZbDY1aW3VBRCwFapNir5+Tq+FoZkVQ5NmqATwnaVS92quNWlv1dUkdgf8h9+RkHg1URzKz9Aro9Owiqf5KeLdFxG3LHbN7RHwmqRswVNI7+S69gn15x1ilmUvyk+TlLZKeAdaMiDENfc/MGiYKWhynoWLMRMRnyZ+fSxpC7hZjmqTuETGlZLVVJe2w/AZ0BmqS12a2ulLejqRphEhaQ1L7utfA94CxNFJt1WvyfBbAPg3+ggJtv8UGvDzyxmKf1urp1Of8rEOoeAvf+7Sg44s4DmNtYEhyvhpgcEQ8I+k1Sl1bNSL2Ls5vMLN80jx5SCMiPiD3FHP5/TNxbVWz8icqcKSnmZVOU5mJmoYThlmGpOIODS+1NEPDJek4SZck7zeQlHc0mJmlV07rYaTpb7kJ6AMMSN5/CfypZBGZNTOVVpdkl4jYQdJogIiYnZQbMLPVVIl1SRYnM9gCQFJXYGlJozJrRor1WLUxpIn1emAI0E3SlcBLwFUljcqsGamoW5KIuE/SKHIDPwT8ICJc+cysCKSmU2g5jQYThqQNgK+AJ+rvi4hJpQzMrLkoo3yRqg/jKb5ZDLg10AN4l9wqPWa2Giqu0zMitqn/PpmpevpKDjezApVRvih8pGdEvCFpp1IEY9bsNKFBWWmk6cOoPx+6CtgBmF6yiMyaEUGTqZuaRpoWRvt6r2vJ9Wk8UppwzJqfimlhJAO22kXEzxopHrNmpyKmt0uqiYhaL8dnVjp1xZjLRb4Wxqvk+ivelPQ48BAwv+7DiHi0xLGZVb4mNIozjTR9GJ2BmeTW8KwbjxGAE4ZZEVTKOIxuyROSsXy7ijs0ULvAzNLJ1VbNOor08oVaDbRLtvb1XtdtZrbaRFXKLfUZpWpJoyU9mbzvLGmopPeTPzvVO/YiSRMlvSvpgIbOna+FMSUiLk8dpZkVLLcIcNFPew4wAVgzeV9XjPlqSRcm73+xXDHmdYD/lbRZvlID+VoY5XNjZVauUi7Pl/ZJiqT1gIOB2+vtbpRizCusY2BmxVXk6u3XAT/n24tcFa0Y80oTRkTMShuhma2aXKenUm0kxZjrbad961zSIcDnETGqgMsvb/WKMZtZaRXQh9FQMebdgUMlHURuKYo1Jd1LYxRjNrPSE7n/E6bZGhIRF0XEehHxXXKdmc9HxHE0UjFmMys1NcpckqspdTFmM2scpUgXETECGJG8djFms0pQcUv0mVlpVcpsVTMrOVXGehhmVnp1T0nKhROGWcbcwjCz1MonXThhmGWrccZhFI0ThlmGKrHMgJmVUPmkCycMs8yVUQPDCSOfn15+L8++NJYundrzyl8uBmD23Pmc/Ms7mTRlFht078xd/30KHddsy6LFtZx31f2MnjCJqqoqrv7PI9hjx80y/gXl4Yxj+nL893eFCMb/3xTOvPIBzj1+X044bFdmzp4HwBW3PM3QVybQb6fNGPiTg2nZooZFi2u55MYneHHUxIx/warLPVYtn4xRskfAku6U9LmksaW6RqkNOGRXHr7+zG/t+8OgofTdqSejHh1I35168odBzwEwaMjLAPzjgYsZcuNP+dV1Q1i6dOm/nNO+rXvXDpz+oz3Z5+Q/sNtxv6OquorD99segJsf+Dt9T7yGvidew9BXJgAwc+58BvzsDnY/7nf85Ir7uWXgsVmGXxRSuq0pKOWYkbuB/iU8f8ntvsMmdFqz7bf2/e3vYxhwyC4ADDhkF54eMQaAdz+cSt+degLQtXN7OrRrw+gJkxo34DJVU11F61YtqK6uom3rFkydMXelx7793mSmzvgCgAkfTKV1yxpatqhurFBLIN1qW01lvknJEkZEvABU3Kpdn8/6ku906QDAd7p0YPrsLwHYetN1+dsLb1Nbu4SPJ8/gzXc+YfK02VmGWhamTJ/LDYNH8PaQ/+KdJy7li3lfM/zV9wA49cg9eOmeC7jh4qPp0L7Nv3z30L23Zcx7k1m0OO+M7Cat7pakmKuGl1Lmo1IlnVa35Nj0GeVbFP64Q/uwTreO7H3Cb7no2kfYedse1FSX8798jaND+zYctOfW9Dri12zx/Utp27olRx2wI3c++jLbH3kle55wDdNmfMGvzz70W9/bvMfaXPqTQzjvNw9lFHmRpLwdaSINjOwTRkTcFhG9I6J31y5dsw6nQd06t1/WZJ46Yy5dO+WK29fUVHPV+Ufw4uCLGHzN6cz9cgEbrd/0f0/W+u20GR9PmcXMOfOpXbKUJ/7+Njtv812mz57H0qVBRDDosX+y4xYbLPvOOl07cM/VJ3HGFYP5aPLMDKMvDieMCta/7zbc/+RIAO5/ciQH7rUtAF99vYj5CxYCMHzkBGpqqth8o+6ZxVkuPp06m95bbUibVi0A2Kv3prz70TTWXqv9smMO6bcNEz6YCsCa7Vrzl2tO5fKbn2bkmI+yCLnolPJ/TYEfq+ZxysV38fKo95k5Zx5bHfwrLjztIM47cX9OuuhO7n38FdZbuxN3X30KADNmfckRZ/2JqirRvWtHbrnsxAbObgCjxk/i8eFvMWLQ+SypXcqY9yYz6LFXuP6io9lms3WJCCZNmbXs1uPUI/egx3pr8bOT9udnJ+0PwOHn3sqM5PFruSm36u2KKE2ZVEn3A/2ALsA0YGBE3JHvOzvu2DteHvl6SeKxnE59zs86hIq3cPx9LJ0/LVUa6Ll1r7jlkWGpzrvP5l1GNbBqeMmVrIUREQNKdW6zStJUbjfScB+GWYbqbkmKUSpRUmtJr0p6S9I4SZcl+4tWjNkJwyxTabs8U7VCFgL7RMR2QC+gv6Rd+aYY86bAsOQ9yxVj7g/cJCnvWAAnDLMsFXEcRuTU9f62SLagkYoxm1kjUMqNBmqrAkiqlvQmuXKIQyNiJEUsxuzHqmYZKnABnYZqq5JULuslqSMwRNLWDVz+X06R7/xuYZhlrYAmRloRMYdc5bP+JMWYAVyM2azMFavTU1LXpGWBpDbAfsA7uBizWeUo4jyR7sCg5ElHFfBgRDwp6RVcjNmsMhQrX0TEGGD7Fex3MWazSiBcZsDM0mpCU9fTcMIwy1gZ5QsnDLPMlVHGcMIwy1TTWRwnDScMs4y5D8PMUsk9Jck6ivScMMwy5lsSM0vNLQwzS62M8oUThlmmVmEmapacMMwy5j4MM0ul3OqSOGGYZc0Jw8zS8i2JmaXmx6pmlloZ5QsnDLMseQEdM0vPC+iYWSHKKF+4zIBZ5opUl0TS+pKGS5qQFGM+J9nvYsxmlaGoxZhrgf+MiC2AXYEzk4LLLsZsVimKWIx5SkS8kbz+EphArlaqizGbVYK6BXRSJowGizEvO6/0XXI1SlyM2aySFDDSs8FizACS2gGPAOdGxBd5Htu6GLNZuSnWLUnuXGpBLlncFxGPJrtdjNmsUhSreLtyTYk7gAkRcW29j1yM2awiFHfg1u7A8cDbkt5M9v0SuBoXYzarFMXJGBHxUp6TuRizWbnzAjqr4Y03Rs1o00IfZx1HAboAM7IOosKV49/xhoUc7LkkqygiumYdQyEkvZ7mMZetuubwd+wFdMwsvfLJF04YZlkro3zhhLGabss6gGagov+OJagqo04MJ4zVEBEV/R9zU9As/o7LJ184YZhlrYzyhYeGrypJ/ZNFRyZKujDreCqNpDslfS5pbNaxlFox55KUmhPGKkgWGfkTcCCwJTAgWYzEiuducou6VLiiLqBTck4Yq2ZnYGJEfBARi4AHyC1GYkUSES8As7KOo9QKXA8jc04Yq6bghUfMVqacEoY7PVdNwQuPmK1MU7ndSMMJY9UUvPCI2Qo1odZDGr4lWTWvAZtK6iGpJbmVlx/POCYrQ2kXz2kqOcUJYxVERC3wU+BZciszPxgR47KNqrJIuh94Begp6dNk8ZfKVEYZw7ckqygingaezjqOShURA7KOobF4aLiZpVY+6cIJwyx7ZZQxnDDMMlZOj1UV4eEDZlmR9Ay5ZQjTmBERmQ6Xd8Iws9T8WLXEJC2R9KaksZIektR2Nc51t6Qjk9e355vwJqmfpN1W4RofSfqXf/FWtn+5Y+YVeK1LJV1QaIyWHSeM0lsQEb0iYmtgEfDj+h8mM18LFhH/ERHj8xzSDyg4YZjl44TRuF4ENkn+9R8uaTC5KlXVkn4n6TVJYySdDrnSd5JulDRe0lN8U3UbSSMk9U5e95f0hqS3JA1LKnf/GDgvad3sKamrpEeSa7wmaffku2tJek7SaEm3kqLPXtJfJY2SNG75CuKSrkliGSapa7JvY0nPJN95UdLmRfnbtMYXEd5KuAHzkj9ryNW0PIPcv/7zgR7JZ6cBv0petwJeB3oAhwNDgWpgHWAOcGRy3AigN9CV3MzZunN1Tv68FLigXhyDgT2S1xuQq78JcD1wSfL6YHKT6Lqs4Hd8VLe/3jXaAGOBtZL3ARybvL4EuDF5PQzYNHm9C/D8imL01vQ3P1YtvTb16ly+SK5Y7m7AqxHxYbL/e8C2df0TQAdyhXH7AvdHrt7lZ5KeX8H5dwVeqDtXRKxsDYn9gC31zajCNSW1T65xePLdpyTNTvGbzpb0w+T1+kmsM4GlwF+S/fcCj0pql/zeh+pdu1WKa1gT5IRRegsiolf9Hcn/cebX3wWcFRHPLnfcQTQ8bV4pjoHc7WefiFiwglhSPyqT1I9c8ukTEV9JGgG0XsnhkVx3zvJ/B1ae3IfRNDwLnCGpBYCkzSStAbwAHJP0cXQH9l7Bd18B9pLUI/lu52T/l0D7esc9R27CHMlxvZKXLwDHJvsOBDo1EGsHYHaSLDYn18KpUwXUtZL+DXgpIr4APpT0o+QakrRdA9ewJsoJo2m4HRgPvJEsensrudbfEOB94G3gZuDvy38xIqaT6wN5VNJbfHNL8ATww7pOT+BsoHfSqTqeb57WXAb0lfQGuVujSQ3E+gxQI2kMcAXwz3qfzQe2kjQK2Ae4PNl/LHBKEt84vJxh2fLALTNLzS0MM0vNCcPMUnPCMLPUnDDMLDUnDDNLzQnDzFJzwjCz1JwwzCy1/weM5ZB65vBMwgAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Alternative: Plotting using a precomputed confusion matrix\n", "cm = confusion_matrix(y_test, predictions1, labels=model1.classes_) \n", "# confusion_matrix can also use attribute normalize=\"true\"\n", "fig, ax = plt.subplots(figsize=(4,4))\n", "ConfusionMatrixDisplay(confusion_matrix=cm,display_labels=model1.classes_).plot(cmap=plt.cm.Blues,ax=ax) " ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AUC (with word bigrams): 0.8801952770163418\n", "Accuracy (with word bigrams): 0.8795\n", "Runtime for predicting with Model 2: 2.022491455078125 seconds\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Using model 2 (words and word bi-grams)\n", "start = time.time()\n", "predictions2 = model2.predict(cv2.transform(X_test))\n", "print('AUC (with word bigrams): ', roc_auc_score(y_test, predictions2))\n", "print(\"Accuracy (with word bigrams): \", accuracy_score(y_test, predictions2))\n", "end = time.time()\n", "print(f\"Runtime for predicting with Model 2: {end - start} seconds\")\n", "ConfusionMatrixDisplay.from_predictions(y_test,predictions2,normalize=\"true\",cmap=plt.cm.Blues)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AUC (with filtering and upto trigrams): 0.8735486675233028\n", "Accuracy (with filtering and upto trigrams): 0.873\n", "Runtime for predicting with Model 3: 1.1680829524993896 seconds\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "### Using words bi/tri-grams and frequency based filtering \n", "start = time.time()\n", "predictions3 = model3.predict(cv3.transform(X_test))\n", "print('AUC (with filtering and upto trigrams): ', roc_auc_score(y_test, predictions3))\n", "print(\"Accuracy (with filtering and upto trigrams): \", accuracy_score(y_test, predictions3))\n", "end = time.time()\n", "print(f\"Runtime for predicting with Model 3: {end - start} seconds\")\n", "ConfusionMatrixDisplay.from_predictions(y_test,predictions3,normalize=\"true\",cmap=plt.cm.Blues)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Some other popular/baseline models for text classification" ] }, { "cell_type": "code", "execution_count": 92, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Third party content:\n" ] }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# from nltk.classify import SklearnClassifier\n", "from sklearn.naive_bayes import MultinomialNB\n", "print(\"Third party content:\")\n", "HTML('')" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Third party content:\n" ] }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.svm import SVC\n", "# The following YouTube video on Suppor Verctor Machines provides a very nice high level overview\n", "# Even if you are familiar with SVM/SVC, you may gain some nice perspectives \n", "# The presenter has a sense of humor (Bam!) which is a bit weird, but as long as you like/ignore it, the technical coverage for the intuitions is nice.\n", "print(\"Third party content:\")\n", "HTML('')" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Runtime for fitting NB: 0.014167547225952148 seconds\n", "AUC (Naive): 0.8308722567024445\n", "Runtime for predicting with NB: 0.8845677375793457 seconds\n" ] } ], "source": [ "start = time.time()\n", "modelNB = MultinomialNB().fit(X_train_vec,y_train)\n", "end = time.time()\n", "print(f\"Runtime for fitting NB: {end - start} seconds\")\n", "#\n", "start = time.time()\n", "predictionsNB = modelNB.predict(cv.transform(X_test))\n", "print('AUC (Naive): ', roc_auc_score(y_test, predictionsNB))\n", "end = time.time()\n", "print(f\"Runtime for predicting with NB: {end - start} seconds\")\n" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Runtime for fitting NB: 0.025911569595336914 seconds\n", "AUC (Naive): 0.8501495775073887\n", "Runtime for predicting with NB: 1.344125509262085 seconds\n" ] } ], "source": [ "start = time.time()\n", "modelNB = MultinomialNB().fit(X_train_vec3,y_train)\n", "end = time.time()\n", "print(f\"Runtime for fitting NB: {end - start} seconds\")\n", "start = time.time()\n", "predictionsNB = modelNB.predict(cv3.transform(X_test))\n", "print('AUC (Naive): ', roc_auc_score(y_test, predictionsNB))\n", "end = time.time()\n", "print(f\"Runtime for predicting with NB: {end - start} seconds\")" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Runtime for fitting SVC: 48.139628410339355 seconds\n", "AUC (SVC): 0.8429310982003627\n", "Runtime for predicting with SVC: 9.894097089767456 seconds\n" ] } ], "source": [ "start = time.time()\n", "modelSVC = SVC(kernel=\"linear\").fit(X_train_vec,y_train)\n", "end = time.time()\n", "print(f\"Runtime for fitting SVC: {end - start} seconds\")\n", "#\n", "start = time.time()\n", "predictionsSVC = modelSVC.predict(cv.transform(X_test))\n", "print('AUC (SVC): ', roc_auc_score(y_test, predictionsSVC))\n", "end = time.time()\n", "print(f\"Runtime for predicting with SVC: {end - start} seconds\")" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Runtime for fitting SVC: 39.08421754837036 seconds\n", "AUC (SVC): 0.8604116861746717\n", "Runtime for predicting with SVC: 8.268787860870361 seconds\n" ] } ], "source": [ "start = time.time()\n", "modelSVC = SVC(kernel=\"linear\").fit(X_train_vec3,y_train)\n", "end = time.time()\n", "print(f\"Runtime for fitting SVC: {end - start} seconds\")\n", "start = time.time()\n", "predictionsSVC = modelSVC.predict(cv3.transform(X_test))\n", "print('AUC (SVC): ', roc_auc_score(y_test, predictionsSVC))\n", "end = time.time()\n", "print(f\"Runtime for predicting with SVC: {end - start} seconds\")" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "#from sklearn.datasets import fetch_20newsgroups\n", "#newsgroups_train = fetch_20newsgroups(subset='train')\n", "#print(newsgroups_train.target_names)\n", "# print(newsgroups_train.data[10])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### A few words on ... \n", "\n", "> Word embedding: Dimensionality reduction\n", "\n", "\n", "> Using pretrained models" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# The following code on using embedding for vectorization is adapted from:\n", "# https://github.com/practical-nlp\n", "\n", "import os\n", "import wget\n", "import gzip\n", "import shutil\n", "\n", "gn_vec_path = \"GoogleNews-vectors-negative300.bin\"\n", "gn_vec_zip_path = \"GoogleNews-vectors-negative300.bin.gz\"\n", "if not os.path.exists(gn_vec_path):\n", " if not os.path.exists(gn_vec_zip_path): \n", " wget.download(\"https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz\")\n", " with gzip.open(gn_vec_zip_path, 'rb') as f_in:\n", " with open(gn_vec_path, 'wb') as f_out:\n", " shutil.copyfileobj(f_in, f_out)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Numver of words in vocablulary: 3000000\n" ] } ], "source": [ "from gensim.models import Word2Vec, KeyedVectors\n", "pretrainedpath = gn_vec_path\n", "w2v_model = KeyedVectors.load_word2vec_format(pretrainedpath, binary=True) #load the model\n", "print(\"Numver of words in vocablulary: \",len(list(w2v_model.index_to_key)))#Number of words in the vocabulary." ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "300\n" ] }, { "data": { "text/plain": [ "array([ 0.06005859, -0.07763672, -0.01135254, 0.08544922, -0.08154297,\n", " -0.13671875, -0.21875 , -0.05859375, -0.06176758, 0.14257812],\n", " dtype=float32)" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(len(w2v_model['Singapore']))\n", "w2v_model['Singapore'][:10]" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('Hong_Kong', 0.7355004549026489), ('Malaysia', 0.7280135154724121), ('Kuala_Lumpur', 0.6947176456451416), ('Singaporean', 0.680225670337677), ('SIngapore', 0.665448784828186), (\"S'pore\", 0.6562163233757019), ('Malaysian', 0.6307142972946167), ('Inkster_Miyazato', 0.6058071851730347), ('Exchange_Ltd._S##.SG', 0.6021658778190613), ('GP_Pte_Ltd', 0.6005238890647888)]\n", "\n", "[('malaysia', 0.647721529006958), ('hong_kong', 0.6318016052246094), ('malaysian', 0.6025472283363342), ('australia', 0.597445011138916), ('uae', 0.5960760116577148), ('india', 0.5947676301002502), ('uk', 0.5883470773696899), ('chinese', 0.5872371792793274), ('usa', 0.583607017993927), ('simon', 0.5695350766181946)]\n" ] } ], "source": [ "print(w2v_model.most_similar('Singapore'))\n", "print()\n", "print(w2v_model.most_similar('singapore'))" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.060058594\n", "0.06933594\n" ] } ], "source": [ "print(w2v_model['Singapore'][0])\n", "print(w2v_model['singapore'][0])" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.546599805355072\n", "0.4534002\n" ] } ], "source": [ "print(w2v_model.distance('Singapore','Japan'))\n", "print(w2v_model.similarity('Singapore','Japan'))" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package stopwords to\n", "[nltk_data] C:\\Users\\datta\\AppData\\Roaming\\nltk_data...\n", "[nltk_data] Package stopwords is already up-to-date!\n" ] } ], "source": [ "# may need to refactor the below code with other codes in this notebook\n", "nltk.download('stopwords')\n", "from nltk.tokenize import word_tokenize\n", "from nltk.corpus import stopwords\n", "from string import punctuation\n", "\n", "def preprocess_corpus(texts):\n", " mystopwords = set(stopwords.words(\"english\"))\n", " def remove_stops_digits(tokens):\n", " #Nested function that lowercases, removes stopwords and digits from a list of tokens\n", " return [token.lower() for token in tokens if token.lower() not in mystopwords and not token.isdigit()\n", " and token not in punctuation]\n", " #This return statement below uses the above function to process twitter tokenizer output further. \n", " return [remove_stops_digits(word_tokenize(text)) for text in texts]" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "def embedding_feats(list_of_lists):\n", " DIMENSION = 300 ## This is essentially coming from the embedding space \n", " zero_vector = np.zeros(DIMENSION)\n", " feats = []\n", " for tokens in list_of_lists:\n", " feat_for_this = np.zeros(DIMENSION)\n", " count_for_this = 0 + 1e-5 # to avoid divide-by-zero \n", " for token in tokens:\n", " if token in w2v_model:\n", " feat_for_this += w2v_model[token]\n", " count_for_this +=1\n", " if(count_for_this!=0):\n", " feats.append(feat_for_this/count_for_this) \n", " else:\n", " feats.append(zero_vector)\n", " return feats" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "Xtrain_prepro = preprocess_corpus(X_train)\n", "Xtest_prepro = preprocess_corpus(X_test)\n", "train_vect = embedding_feats(Xtrain_prepro)\n", "test_vect = embedding_feats(Xtest_prepro)" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AUC (with pretrained embedding): 0.8476652989196567\n", "Accuracy (with pretrained embedding): 0.847\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "model1.fit(train_vect, y_train) # Recall: It was the basic logistic regression model\n", "predictions = model1.predict(test_vect)\n", "print('AUC (with pretrained embedding): ', roc_auc_score(y_test, predictions))\n", "print(\"Accuracy (with pretrained embedding): \", accuracy_score(y_test, predictions))\n", "ConfusionMatrixDisplay.from_predictions(y_test,predictions,normalize=\"true\",cmap=plt.cm.Blues)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Topic modeling and document similarity" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# A few more toy documents\n", "EE2 =\"Colin Powell, America’s former secretary of state and chairman of the joint chiefs of staff, died from complications related to covid-19, aged 84. He was the first black man to hold either position. George H.W. Bush appointed him as the youngest person ever to serve as America’s highest-ranking military officer in 1989. Under George W. Bush he was the county’s top diplomat from 2001 to 2005.\"\n", "EE3 = \"The world’s biggest streaming service disappointed Wall Street last quarter when it revealed tepid growth in subscribers. Its earnings report today should tell a different tale. The main driver of subscriptions is new content. Though scarce earlier this year because of the disruption caused by covid-19, the pipeline is filling up again. Old favourites like “Money Heist” and “Sex Education” are back with new episodes, and “Squid Game”, a dystopic South Korean thriller, has proved to be Netflix’s biggest-ever hit.\"\n", "doc1 = nlp(EE1) \n", "doc2 = nlp(EE2)\n", "doc3 = nlp(EE3)" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Similarity between documents 1 2 : 0.029378258437261235\n", "Similarity between documents 1 3 : 0.0\n", "Similarity between documents 2 3 : 0.024661696851234052\n" ] } ], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "\n", "corpus = [EE1,EE2,EE3]\n", "vect = TfidfVectorizer(min_df=1, stop_words=\"english\") \n", "tfidf = vect.fit_transform(corpus) \n", "pairwise_similarity = tfidf * tfidf.T \n", "\n", "import itertools\n", "for i,j in itertools.product(range(3), range(3)):\n", " if iKey ideas/intuitions: \n", "1. Each document is a mix of topics (following some \"unknown\" distribution)\n", " * Topics are not explicitly defined, but we have an \"estimate\" of the number of topics\n", "2. Each topic is a mix of words (again, following some \"unknown\" distribution) " ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Third party content:\n" ] }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The following YouTube video on Latent Dirichlet allocation provides a very nice high level overview\n", "# Even if you are familiar with LDA and topic modeling, you may gain some nice perspectives \n", "print(\"Third party content:\")\n", "HTML('')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![LDA Big Idea.png](pics/LDA-overview.png)\n", "Image source: Above mentioned YouTube video" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "7769 total train documents\n", "3019 total test documents\n", "['training/1', 'training/10', 'training/100', 'training/1000', 'training/10000', 'training/10002', 'training/10005', 'training/10008', 'training/10011', 'training/10014']\n", "['MOST', 'EC', 'STATES', 'SAID', 'TO', 'BE', 'AGAINST', 'OILS', '/', 'FATS', 'TAX', 'A', 'majority', 'of', 'European', 'Community', '(', 'EC', ')', 'member', 'states', 'are', 'either', 'against', 'or', 'have', 'strong', 'reservations', 'over', 'a', 'tax', 'on', 'both', 'imported', 'and', 'domestically', '-', 'produced', 'oils', 'and', 'fats', 'proposed', 'by', 'the', 'European', 'Commission', ',', 'senior', 'diplomats', 'said', '.', 'They', 'said', 'a', 'special', 'committee', 'of', 'agricultural', 'experts', 'from', 'EC', 'member', 'states', 'had', 'voiced', 'strong', 'objections', 'over', 'the', 'measure', 'during', 'a', 'meeting', 'charged', 'with', 'preparing', 'the', 'ground', 'for', 'the', 'annual', 'EC', 'farm', 'price', '-', 'fixing', 'which', 'begins', 'next', 'Monday', '.', 'They', 'added', 'that', 'only', 'France', 'and', 'Italy', 'had', 'indicated', 'they', 'would', 'support', 'the', 'Commission', 'proposal', 'which', 'would', 'lead', 'to', 'a', 'tax', 'initially', 'of', '330', 'Ecus', 'per', 'tonne', 'during', 'the', '1987', '/', '88', 'price', 'round', '.']\n" ] } ], "source": [ "from nltk.corpus import reuters\n", "documents=reuters.fileids()\n", "train_docs = list(filter(lambda doc: doc.startswith(\"train\"), documents));\n", "test_docs = list(filter(lambda doc: doc.startswith(\"test\"), documents));\n", "print(str(len(train_docs)) + \" total train documents\");\n", "print(str(len(test_docs)) + \" total test documents\");\n", "print(train_docs[0:10])\n", "print(reuters.words(train_docs[10])[0:len(reuters.words(train_docs[10]))])" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# Some of the code fragments below on LDA for topic modeling is adapted from https://radimrehurek.com/gensim/models/ldamodel.html\n", "\n", "from gensim.corpora.dictionary import Dictionary\n", "from gensim.test.utils import common_corpus\n", "import gensim\n", "from gensim import corpora, models\n", "\n", "import random" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['alarming', 'problem', 'loom', 'Iran', 'rapidly', 'advance', 'nuclear', 'programme', 'Israel', 'threaten', 'military', 'action', 'America', 'seek', 'diplomatic', 'solution', 'Anton', 'La', 'Guardia', 'Economist', 'diplomatic', 'editor', 'write', 'international', 'crisis', 'pose', 'Al', 'Monitor', 'publication', 'cover', 'Middle', 'East', 'monitor', 'negotiation', 'revive', 'nuclear', 'deal', '2015', 'fiendish', 'dilemma', 'french', 'president', 'Nicolas', 'Sarkozy', 'encapsulate', 'iranian', 'bomb', 'bombing', 'Iran', 'Iran', 'nuclear', 'weapon', 'situation', 'way', 'bad', 'past', 'country', 'close', 'able', 'nuke', 'expert', 'David', 'Albright', 'breakout', \"time'—the\", 'time', 'need', 'bomb', 'worth', 'highly', 'enrich', 'uranium', '—', 'month', 'Mr', 'Albright', 'book', 'detailed', 'account', 'iranian', 'programme', 'draw', 'year', 'inspection', 'International', 'Atomic', 'Energy', 'Agency', 'trove', 'iranian', 'document', 'obtain', 'israeli', 'intelligence', 'problem', 'credibility', 'american', 'diplomacy', 'damage', 'Donald', 'Trump', 'repudiation', 'nuclear', 'deal', 'successor', 'Joe', 'Biden', 'try', 'revive', 'Iran', 'interested', 'Robert', 'Malley', 'America', 'special', 'envoy', 'Iran', 'discuss', 'impasse', 'recent', 'interview', 'Iran', 'claim', 'seek', 'nuclear', 'energy', 'civilian', 'purpose', 'trouble', 'technology', 'use', 'low', 'enriched', 'uranium', 'fuel', 'nuclear', 'power', 'station', 'use', 'highly', 'enrich', 'uranium', 'weapon', 'ultimate', 'aim', 'Iran', 'regime', 'attach', 'great', 'national', 'pride', 'mastery', 'nuclear', 'technology', 'moon', 'shoot', 'iranian', 'official', 'accord', 'Channel', 'memoir', 'William', 'Burns', 'head', 'CIA', 'diplomat', 'helped', 'negotiate', 'nuclear', 'deal']\n" ] } ], "source": [ "# Below is a module which can be used as a possible solution for ungraded task 6.1 \n", "# Note: Most of the text_scrubber code is from https://towardsdatascience.com/text-normalization-with-spacy-and-nltk-1302ff430119\n", "# The punctuation removal module in there was inadequate to my liking, and I replaced it with something more robust\n", "\n", "def text_scrubber(text):\n", " import string \n", " doc = nlp(text)\n", " \n", "# Tokenization and lemmatization \n", " lemma_list = []\n", " for token in doc:\n", " lemma_list.append(token.lemma_)\n", " \n", " # Filter the stopwords\n", " filtered_sentence =[] \n", " for word in lemma_list:\n", " lexeme = nlp.vocab[word]\n", " if lexeme.is_stop == False:\n", " filtered_sentence.append(word) \n", " \n", " # Filter the punctuations \n", " for punc in string.punctuation:\n", " filtered_sentence=list(filter(lambda a: a != punc, filtered_sentence))\n", " \n", " return filtered_sentence\n", " \n", "print(text_scrubber(EE1))" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "scrolled": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[['CHINA', 'SOYBEAN', 'OUTPUT', 'SLIGHTLY', 'USDA', 'REPORT', 'China', 'soybean', 'crop', 'year', 'forecast', 'mln', 'tonnes', 'slightly', 'mln', 'estiamted', 'year', 'Agriculture', 'Department', 'officer', 'Peking', 'field', 'report', 'report', 'dated', 'April', 'Chinese', 'imports', 'year', 'projected', '300', '000', 'tonnes', 'unchanged', 'year', 'level', 'Exports', 'forecast', 'increase', 'mln', 'tonnes', '800', '000', 'tonnes', 'exported', 'year', 'report', 'Imports', 'soybean', 'oil', 'estimated', '200', '000', 'tonnes', 'unchanged', 'year'], ['DATA', 'MEASUREMENT', 'CORP', 'DMCB', '4TH', 'QTR', 'Shr', 'cts', 'cts', 'Net', '516', '063', '328', '468', 'Revs', 'mln', 'mln', 'NOTE', 'Shrs', 'reflect', 'stock', 'split']]\n" ] } ], "source": [ "# Prepare the initial set of documents to be used for LDA topic modeling\n", "docs=[[y for y in reuters.words(x) if len(text_scrubber(y))>0 and len(y)>2] for x in random.sample(train_docs, 100)]\n", "print(random.sample(docs,2))" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Numer of tokens: 3176\n", "Some random tokens: [('discipline', 2249), ('478', 2021), ('soybeans', 846), ('IMPORT', 2405), ('housing', 635), ('metals', 1891), ('Reagan', 552), ('volatily', 1862), ('holders', 1045), ('delay', 88), ('violent', 2283), ('October', 1259), ('inquiries', 2165), ('estimation', 1575), ('member', 139), ('options', 402), ('somebody', 2593), ('Federal', 870), ('wonderful', 3157), ('Terms', 1358)]\n" ] } ], "source": [ "dct = Dictionary(docs)\n", "common_corpus = [dct.doc2bow(text) for text in docs]\n", "lda = models.ldamodel.LdaModel(common_corpus, num_topics=20)\n", "print(\"Numer of tokens:\", len(dct.token2id))\n", "print(\"Some random tokens:\", random.sample(list(dct.token2id.items()),20))\n", "inv_map = {v: k for k, v in dct.token2id.items()} \n", "# In theory there's an id2token function, but I had problem with it" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[(1, 0.14542003), (2, 0.14289911), (4, 0.042802583), (6, 0.20522758), (8, 0.06020278), (9, 0.018474184), (10, 0.121667735), (12, 0.081962936), (13, 0.030592296), (17, 0.14429606)]\n", "[(0, 0.10591578), (1, 0.52794284), (2, 0.31601062)]\n", "[(1, 0.08655172), (8, 0.09062548), (12, 0.24896422), (13, 0.33639774), (17, 0.10311288), (19, 0.09310416)]\n" ] } ], "source": [ "# Let's look at the topic distribution of some documents\n", "new_docs=[text_scrubber(EE1),text_scrubber(EE2),text_scrubber(EE3)]\n", "new_corpus = [dct.doc2bow(text) for text in new_docs]\n", "topic_prob=[lda[new_corpus[i]] for i in range(len(new_docs))] # topic probability distribution\n", "for x in topic_prob:\n", " print(x)" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('mln', 0.017560491), ('year', 0.013100874), ('bpd', 0.0078046345), ('1987', 0.0055634426), ('pct', 0.005427841), ('beet', 0.004666399), ('market', 0.0043878434), ('land', 0.0041444087), ('cane', 0.004132074), ('farmers', 0.0039663226)]\n" ] } ], "source": [ "most_likely_topic=sorted(topic_prob[0], key=lambda x: x[1], reverse=True)[0][0]\n", "print([(inv_map[int(x[0])], x[1]) for x in lda.show_topic(most_likely_topic, topn=10)])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![LDA-simplex](pics/lda-space.png)\n", "Image source: Above mentioned YouTube video" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Few practicalities to consider\n", "\n", "- The suitability of the training corpus\n", "- Correct guess, or, more pragmatically, a proper exploration of the number of topics" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Named entities " ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('Iran', 'GPE')\n", "('Israel', 'GPE')\n", "('America', 'GPE')\n", "('Anton La Guardia', 'PERSON')\n", "('Al-Monitor', 'PERSON')\n", "('the Middle East', 'LOC')\n", "('2015', 'DATE')\n", "('French', 'NORP')\n", "('Nicolas Sarkozy', 'PERSON')\n", "('Iranian', 'NORP')\n", "('Iran', 'GPE')\n", "('Iran', 'GPE')\n", "('One', 'CARDINAL')\n", "('David Albright', 'PERSON')\n", "('just one month', 'DATE')\n", "('Albright', 'PERSON')\n", "('Iranian', 'NORP')\n", "('years', 'DATE')\n", "('the International Atomic Energy Agency', 'ORG')\n", "('Iranian', 'NORP')\n", "('Israeli', 'NORP')\n", "('American', 'NORP')\n", "('Donald Trump’s', 'PERSON')\n", "('Joe Biden', 'PERSON')\n", "('Iran', 'GPE')\n", "('Robert Malley', 'PERSON')\n", "('America', 'GPE')\n", "('Iran', 'GPE')\n", "('Iran', 'GPE')\n", "('Iran', 'GPE')\n", "('one', 'CARDINAL')\n", "('Iranian', 'NORP')\n", "('The Back Channel”', 'WORK_OF_ART')\n", "('William Burns', 'PERSON')\n", "('CIA', 'ORG')\n" ] } ], "source": [ "#doc1 = nlp(EE1)\n", "for ent in doc1.ents:\n", " print((ent.text, ent.label_))" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "'Countries, cities, states'" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# label explanation\n", "spacy.explain(\"GPE\")" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
An alarming problem looms once more: \n", "\n", " Iran\n", " GPE\n", "\n", " is rapidly advancing its nuclear programme; \n", "\n", " Israel\n", " GPE\n", "\n", " is threatening military action against it; and \n", "\n", " America\n", " GPE\n", "\n", " is seeking a diplomatic solution. \n", "\n", " Anton La Guardia\n", " PERSON\n", "\n", ", The Economist’s diplomatic editor, wrote about the international crisis this poses. \n", "\n", " Al-Monitor\n", " PERSON\n", "\n", ", a publication that covers \n", "\n", " the Middle East\n", " LOC\n", "\n", ", has been monitoring the negotiations to revive a nuclear deal from \n", "\n", " 2015\n", " DATE\n", "\n", ". It makes for a fiendish dilemma that the former \n", "\n", " French\n", " NORP\n", "\n", " president, \n", "\n", " Nicolas Sarkozy\n", " PERSON\n", "\n", ", encapsulated thus: `an \n", "\n", " Iranian\n", " NORP\n", "\n", " bomb or the bombing of \n", "\n", " Iran\n", " GPE\n", "\n", "'. \n", "\n", " Iran\n", " GPE\n", "\n", " doesn't yet have a nuclear weapon. But the situation is in many ways worse than in the past. The country is closer than ever to being able to make a nuke. \n", "\n", " One\n", " CARDINAL\n", "\n", " expert, \n", "\n", " David Albright\n", " PERSON\n", "\n", ", puts the `breakout time'—the time needed to make one bomb’s worth of highly enriched uranium—at \n", "\n", " just one month\n", " DATE\n", "\n", ". Mr \n", "\n", " Albright\n", " PERSON\n", "\n", "’s book is one of the most detailed accounts of the \n", "\n", " Iranian\n", " NORP\n", "\n", " programme, drawing on \n", "\n", " years\n", " DATE\n", "\n", " of inspections by \n", "\n", " the International Atomic Energy Agency\n", " ORG\n", "\n", " and a trove of \n", "\n", " Iranian\n", " NORP\n", "\n", " documents obtained by \n", "\n", " Israeli\n", " NORP\n", "\n", " intelligence. Another problem is that the credibility of \n", "\n", " American\n", " NORP\n", "\n", " diplomacy has been damaged by \n", "\n", " Donald Trump’s\n", " PERSON\n", "\n", " repudiation of the nuclear deal. His successor, \n", "\n", " Joe Biden\n", " PERSON\n", "\n", ", is trying to revive it but \n", "\n", " Iran\n", " GPE\n", "\n", " doesn’t seem to be interested. \n", "\n", " Robert Malley\n", " PERSON\n", "\n", ", \n", "\n", " America\n", " GPE\n", "\n", "'s special envoy on \n", "\n", " Iran\n", " GPE\n", "\n", ", discusses the impasse in a recent interview. \n", "\n", " Iran\n", " GPE\n", "\n", " claims it seeks nuclear energy only for civilian purposes. The trouble is that the technology used to make low-enriched uranium fuel for nuclear power stations is also used to make highly-enriched uranium for weapons. Whatever its ultimate aim, \n", "\n", " Iran\n", " GPE\n", "\n", "’s regime attaches great national pride to its mastery of nuclear technology: “our moon shot” is how \n", "\n", " one\n", " CARDINAL\n", "\n", " \n", "\n", " Iranian\n", " NORP\n", "\n", " official put it, according to “\n", "\n", " The Back Channel”\n", " WORK_OF_ART\n", "\n", ", a memoir by \n", "\n", " William Burns\n", " PERSON\n", "\n", ", the head of the \n", "\n", " CIA\n", " ORG\n", "\n", " who, as a diplomat, helped negotiate the nuclear deal.
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "spacy.displacy.render(doc1, style=\"ent\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### spaCy entities:\n", "\n", "|Entity type| Description.|\n", "| --- | --- |\n", "|PERSON| People, including fictional.|\n", "|NORP| Nationalities or religious or political groups.|\n", "|FAC| Buildings, airports, highways, bridges, etc.|\n", "|ORG| Companies, agencies, institutions, etc.|\n", "|GPE| Countries, cities, states.|\n", "|LOC| Non-GPE locations, mountain ranges, bodies of water.|\n", "|PRODUCT| Objects, vehicles, foods, etc. (Not services.)|\n", "|EVENT| Named hurricanes, battles, wars, sports events, etc.|\n", "|WORK_OF_ART| Titles of books, songs, etc.|\n", "|LAW| Named documents made into laws.|\n", "|LANGUAGE| Any named language.|\n", "|DATE| Absolute or relative dates or periods.|\n", "|TIME| Times smaller than a day.|\n", "|PERCENT| Percentage, including ”%“.|\n", "|MONEY| Monetary values, including unit.|\n", "|QUANTITY| Measurements, as of weight or distance.|\n", "|ORDINAL| “first”, “second”, etc.|\n", "|CARDINAL| Numerals that do not fall under another type.|\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Other associated tasks\n", "- Relationship extraction/Entity association\n", "- Co-reference resolution\n", "- Word (sense) disambiguation" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Recommended reading: Sandia report surveying Entity-Relation Extraction Tools " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "##### Ungraded task 6.2: \n", "\n", "Using the Kaggle spam dataset [Enron1], build a pipeline to train and classify spam emails, and benchmark the performance of your classifier. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Mulling over ML mechanics" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "### Lifecycle of ML modeling\n", "\n", "\"modelcycle\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Model training, tuning and evaluation \n", "\n", "> How well does the model do with data not seen during fitting?\n", "> - for data which has already been seen, a memory based model would be always \"perfect\"\n", " \n", "\"bigpic\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Cross validation\n", "Variants: \n", "- K-fold\n", "- Stratified: Each fold maintains proportional representation of classes\n", " * sklearn.model_selection.StratifiedKFold\n", "- Leave one out\n", "\n", "\"bigpic\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Validation curve\n", "\n", "\"validation\n", "\n", "> In this plot (see source for details and source-code) we see the training scores and validation scores of an SVM for different values of the kernel parameter $\\gamma$, from which we can idenitfy the regions of under-fitting and over-fitting. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Data leakage\n", "\n", "> Leakage is essentially the introduction of information about the data mining target, which should not be legitimately available to mine from. \n", "> - This may lead to very good performance during the model tuning and evaluation, and yet, lead to poor performance in production environment\n", "> * Recommeded reading: Leakage in Data Mining:\n", "Formulation, Detection, and Avoidance by Kaufman et al. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Holdout data for evaluation\n", "Separate from the data used in cross-validation\n", "\n", "\"bigpic\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Accuracy\n", "Limitations of accuracy as a metric\n", "\n", "
\n", "\"classifier
\n", "Image source: Adapted from/edited over an example of t-SNE manifold learning for dimensionality reduction using sklearn's breast cancer dataset
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Confusion matrix\n", "Note: These confusion matrices are NOT for the breast cancer data.\n", "\n", "![Confusion matrix examples](pics/confusionmatrixexamples.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Image adapted from Wikipedia | False positive and negative\n", ":-------------------------:|:-------------------------:\n", "\"bigpic\" |
False positive: An error in binary classification, where a specific classification label is assigned to a test subject, while it in fact does not belong to said class.
e.g., If a person is tested positive for Covid-19 infection, even if the person is in fact not infected.




False negative: When the classifier determines that a test subject does not have a particular label, while it in fact belongs to said class.
e.g., If a fraudulent credit card transaction is considered legitimate and thus not prevented.
" ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "scrolled": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Example code to get the confusion matrix using ConfusionMatrixDisplay\n", "# Using previously used dataset and model\n", "predictions = model3.predict(cv3.transform(X_test))\n", "# We are reusing the logistic regression model with 300 dimensions embedding of features\n", "cm_notnorm = confusion_matrix(y_test, predictions)\n", "cm_norm = confusion_matrix(y_test, predictions, normalize=\"true\")\n", "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))\n", "fig.suptitle('Confusion matrix - Absolute numbers & Normalized respectively')\n", "ConfusionMatrixDisplay(cm_notnorm).plot(cmap=plt.cm.Blues, ax=ax1)\n", "ConfusionMatrixDisplay(cm_norm).plot(cmap=plt.cm.Greys, ax=ax2)\n", "#ConfusionMatrixDisplay.from_predictions(y_test,predictions,normalize=\"true\",cmap=plt.cm.Greys)" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "array([[0.74753222, 0.25246778],\n", " [0.97268901, 0.02731099],\n", " [0.01440663, 0.98559337],\n", " ...,\n", " [0.06684778, 0.93315222],\n", " [0.99675985, 0.00324015],\n", " [0.94421511, 0.05578489]])" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Instead of the predictions as-is, one can obtain the probability for an individual sample to belong to a given class \n", "# By default, a threshold of 0.5 is used by the predict() function\n", "model3.predict_proba(cv3.transform(X_test))" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# However, we can use the predict_proba() to customize the threshold\n", "y_pred_03 = (model3.predict_proba(cv3.transform(X_test))[:,1] >= 0.3).astype(bool)\n", "y_pred_07 = (model3.predict_proba(cv3.transform(X_test))[:,1] >= 0.7).astype(bool)" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Threshop 0.3\n", " precision recall f1-score support\n", "\n", " 0 0.92 0.79 0.85 1039\n", " 1 0.80 0.93 0.86 961\n", "\n", " accuracy 0.86 2000\n", " macro avg 0.86 0.86 0.86 2000\n", "weighted avg 0.87 0.86 0.86 2000\n", "\n", "Threshop 0.5\n", " precision recall f1-score support\n", "\n", " 0 0.89 0.86 0.88 1039\n", " 1 0.85 0.89 0.87 961\n", "\n", " accuracy 0.87 2000\n", " macro avg 0.87 0.87 0.87 2000\n", "weighted avg 0.87 0.87 0.87 2000\n", "\n", "Threshop 0.7\n", " precision recall f1-score support\n", "\n", " 0 0.85 0.91 0.88 1039\n", " 1 0.89 0.83 0.86 961\n", "\n", " accuracy 0.87 2000\n", " macro avg 0.87 0.87 0.87 2000\n", "weighted avg 0.87 0.87 0.87 2000\n", "\n" ] } ], "source": [ "# Classification report for customized thresholds\n", "from sklearn.metrics import classification_report\n", "print(\"Threshop 0.3\\n\", classification_report(y_test,y_pred_03))\n", "print(\"Threshop 0.5\\n\", classification_report(y_test,predictions))\n", "print(\"Threshop 0.7\\n\", classification_report(y_test,y_pred_07))" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAzkAAAFFCAYAAADPbSOKAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAABNPElEQVR4nO3deZwU5bn3/881DKjIogioLEEERMBHUBaXYxRNZFEjGqMBjWjUo+S4PSfGo788J3uMeqJGE/AgRkSiQlwwEEXQmEPQo4YdFURBUBlQERRxB8br90fVjD3NLD1jVXdV9/f9evWL6a7qqrt7er7UVfddd5u7IyIiIiIiUizKCt0AERERERGRKKnIERERERGRoqIiR0REREREioqKHBERERERKSoqckREREREpKioyBERERERkaKiIkdEUsHM5pnZReHP55jZExFv/wAzczMrj3K7DezTzOxuM3vfzBZ8he183cxeibJthWJmXzOzj8ysWROeu6+ZzTezD83s5jja91WY2eNmdl6B2/BzM7s3D/s538yeaeJz622jmb1uZt9seutEpBSoyBERoPrA4R0z2zPjsYvMbF4Bm1Urd7/P3YcVuh0ROAY4Eeji7kOauhF3f9rde0fXrHjkcnDq7m+6eyt3r2zCLi4GNgNt3P2qJjUyRu4+0t3vydf+zGyomVXka38iIkmiIkdEMpUDV37VjYQ9FMqXhnUDXnf3jwvdkCSIoBetG7DSm/At13H24KX17yGfvZoiIlFLXeiKSKx+C/zIzPaqbaGZHW1mC83sg/DfozOWzTOz68zsf4FPgAPD4V//ZmarwyFEvzKzHmb2nJltM7MHzKxF+Py9zexRM3s3HL71qJl1qaMd1UNhzOw/wuFNVbcdZjYlXNbWzO4ys7fMbIOZ/bpqGJSZNTOzm8xss5mtBU6u740xs65mNiNs3xYzGx8+XmZm/2lmb5jZJjObamZtw2VVQ+DOM7M3w339v3DZhcAfgaPCdv+itiE+4fN7hj+fZGYrw/dyg5n9KHy8xhl7M+sT/j62mtkKMzs1Y9kUM5tgZo+F2/mnmfWo4zVXtf/7ZrY+/L2MM7PBZvZCuP3xGev3MLO/h+/PZjO7r+qzZGZ/Ar4G/DV8vf+Rsf0LzexN4O8Zj5WbWTszqzCzb4XbaGVma8xsbC1tnQKcB1R9Hr5pZruZ2a1mtjG83Wpmu2W+Z2Z2jZm9Ddydtb3dwtd3SMZjHczsUzPr2NDn1Wr/e8gcclnf52aXHhjL6AUzsyFmtsiCv6F3zOyWWt6PPYHHgU725d9Gp3Bxi3B/H4afj0FZ+7nGzF4APg5/D0ea2bPh+7HczIZmrH++ma0Nt7XOzM7JasdN4fuzzsxGZjzeycxmmdl74e/0X7NfQ8a654bv0xYL/34yljX4XohIiXJ33XTTTTeA14FvAjOAX4ePXQTMC39uB7wPnEvQ4zMmvL9PuHwe8CbQL1zeHHBgFtAmfPxz4CngQKAtsBI4L3z+PsAZQEugNfAg8JeM9s0DLgp/Ph94ppbX0BXYCJwU3v8LcAewJ9ARWABcEi4bB6wKn9MO+J+wveW1bLcZsBz4Xbit3YFjwmUXAGvC19QqfP/+FC47INzmncAeQP/wPehT2+uo7XWFz+8Z/vwW8PXw572Bw8OfhwIV4c/Nw/b8GGgBnAB8CPQOl08B3gOGhL+n+4DpdXwmqto/MXzNw4DPwve1I9AZ2AQcF67fk2D43W5AB2A+cGv2Z6yW7U8N39c9Mh4rD9cZBrwd7u9O4KF6PsNTCD+74f1fAs+Hz+0APAv8KuM92wncGLZ3j1q2Nxm4LuP+pcCcRnxes/8e5vHlZ7i+z03177O29w54Djg3/LkVcGQd70dt2/l5+Ds8ieBzfT3wfNZ+lhH8XewR/o63hOuXhb/fLeH7uSewjS8/W/sD/TI+yzuAfw338wOCv00Ll/8DuJ3gczUAeBf4RkYb7w1/7gt8BBwb/p5uCX9vjXovdNNNt9K7qSdHRLL9FLjczDpkPX4ysNrd/+TuO919GkGR8K2Mdaa4+4pw+Y7wsRvdfZu7rwBeAp5w97Xu/gHBmebDANx9i7s/7O6fuPuHwHXAcbk22sz2IDj4vs3dZ5vZvsBI4P+6+8fuvomgSBkdPuUsggPw9e7+HsHBXl2GAJ2Aq8NtfebuVT0u5wC3hK/pI+D/A0ZbzaE+v3D3T919OUGx1D/X15VlB9DXzNq4+/vuvqSWdY4kONi7wd23u/vfgUcJitIqM9x9gbvvJChyBjSw31+Fr/kJ4GNgmrtvcvcNwNN8+Ttc4+5Puvvn7v4uwQFpLr/Dn4fv66fZC8J9PkhQHJ8MXJLD9qqcA/wybOu7wC8IivQqXwA/C9u7y76B+6n5vp0dPpbr57W2v4fMtjX0uanLDqCnmbV394/c/fkcnpPpGXef7cF1T39i18/j78O/i0+B7wGzw/W/cPcngUUERQ8E7+EhZraHu78V/p1XecPd7wz3cw9BEbSvmXUluB7tmvBztYygVzPzd1PlO8Cj7j7f3T8HfhLuM6r3QkSKlIocEanB3V8iOCi+NmtRJ+CNrMfeIDjTW2V9LZt8J+PnT2u53wrAzFqa2R3hsJRtBL0Ae1nus2zdBbzi7jeG97sRnD1/Kxxms5WgV6djxuvJbG/2a8vUleCAbWcty7LflzcIztzvm/HY2xk/f0L4mpvgDIKDyzfM7B9mdlQd7Vnv7pkHgtm/p8a2J9ffYUczm27BULptwL1A+wa2DbV/bjJNAg4B7nb3LTlsr0ptv5tOGfffdffP6nn+34E9zOwIM+tGUAw+Ajl/Xut7Xbl8bupyIXAQsMqCYaOn5PCcTNm//92ziqvMdncDzqz6Gwr/jo4B9vfgWrLvEvSKvmXBEMiDa9uPu38S/tiK4LW/FxaHVbI/o1Vq/J2G+8z8DHzV90JEipSKHBGpzc8IhplkHnRsJDjgyfQ1YEPG/UZf8J3hKqA3cIS7tyEYngJgDT3RzK4Nn3thxsPrCYaGtXf3vcJbG3fvFy5/i6B4qfK1enaxHvhaHWfZs9+XrxEMp3mnlnUb8jHB8CcAzGy/zIXuvtDdRxEUan8BHqijPV2t5oXu2b+nuFxP8Bk4NPwdfo+av7+6Ph91fm7CouEOgiFtP7Dw+qQc1fa72ZjLfgHCQvEBgt6cswl6FKoOzHP5vNa3/fo+N9mfg2YEw8Oq2rXa3ccQfA5uBB6yjFkRc3199ch83nqCYXR7Zdz2dPcbwrbMdfcTCXppVhEMKWzIRqCdmbXOeKyuz2iNv1Mza0kwVJBw/7m+FyJSYlTkiMgu3H0N8GfgioyHZwMHmdnZ4cXI3yUYL/9oRLttTdArsNXM2hEUWg0KL2a+Ajgtc8iRu78FPAHcbGZtwgu9e5hZ1ZCiB4ArzKyLme3Nrj1XmRYQHGzdYGZ7mtnuZvYv4bJpwL+bWXczawX8BvhzHb0+DVkO9DOzAWa2O8G1CVWvs4UF3w/UNhz6tA2obZrlfxIcJP+HmTUPLxL/FjC9Ce1prNYE109sNbPOwNVZy98huAalMX4c/nsBcBMwtRG9e9OA/7RgwoD2BEMxG/sdMfcT9FacE/5cpUmf16y21fW5eZWgd+VkM2sO/CfB9SgAmNn3zKxDWIRtDR+u7bPwDrCPhRMaNNG9wLfMbLgFk3XsbsHECF0s+F6iU8Oi4nOC332DU3+7+3qC66OuD7d3KMEJivtqWf0h4BQzO8aCSUp+ScaxSyPeCxEpMSpyRKQuvyS4sBgIrkEATiE4g70F+A/gFHffHNH+biW40HkzwcXic3J83ncJznK/bF/OIjUxXDaW4OL7lQSTJDxEcMYZgjPOcwkKiyUEF37XKrym4FsEF9a/CVSE+4Xg4vQ/EQxXWkdwUfflObY9ez+vErzvfwNWA9lfpngu8Ho4PGocQU9J9ja2A6cSXI+0meDi7rHuvqopbWqkXwCHAx8Aj7Hre3o9QdGx1cKZ4epjZgOBHxK0v5LgTL1Tf0Ga6dcE14+8ALxI8Hv+dY7PBcDdq4rGTgTXkFW5laZ9XqvU+bkJr1f7N4LrVDaE+8+cbW0EsMLMPgJuA0bXNuwu/J1PA9aG73mn7HUaEhYkowiKzXcJenauJjh+KCPIg40Ek1kcF7Y7F2MIJpnYSDAE8Gfh9T7Z+19BMOHD/QQnGt6nCe+FiJSeqllOREREREREioJ6ckREREREpKioyBERERERkaKiIkdERERERIqKihwRERERESkqKnJERERERKSoqMgREREREZGioiJHRERERESKioocEREREREpKipyRERERESkqKjIERERERGRoqIiR0REREREioqKHBERERERKSoqckREREREpKioyBERERERkaKiIkdERERERIqKihwRERERESkqKnJERERERKSoqMgREREREZGioiJHRERERESKioocEREREREpKipyRERERESkqKjIERERERGRoqIiR0REREREioqKHBERERERKSoqckREREREpGDMbISZvWJma8zs2lqW721mj5jZC2a2wMwOaWibKnJEEsjMvAm3OYVut4gkj/JERKLQxCxpME/MrBkwARgJ9AXGmFnfrNV+DCxz90OBscBtDbVXRY5IQplZo25A+0K3WUSSSXkiIlFobJbkmCdDgDXuvtbdtwPTgVFZ6/QFngJw91XAAWa2b30bVZEjklBNCBERkVopT0QkCk0schrSGVifcb8ifCzTcuDbYRuGAN2ALvVtVEWOiIiIiIjEpb2ZLcq4XZy1vLZKyLPu3wDsbWbLgMuBpcDO+nZa3tTWiki8dDZVRKKiPBGRKDQxSza7+6B6llcAXTPudwE2Zq7g7tuA74dtMGBdeKuTenJEEsjMKCsra9Qtx+1GPnuJiCRbXHkiIqWlKVmSY54sBHqZWXczawGMBmZl7XuvcBnARcD8sPCpk3pyRBIq6jOvGbOXnEhw1mShmc1y95UZq1XNXnK6mR0crv+NSBsiInmnnhwRiUIcWeLuO83sMmAu0AyY7O4rzGxcuHwi0AeYamaVwErgwoa2qyJHJKFiCJLq2UvC7VfNXpJZ5PQFrodg9hIzO8DM9nX3d6JujIjkj4ocEYlCXFni7rOB2VmPTcz4+TmgV2O2qT5pkYRKy+wlIpJ8ml1NRKIQ0+xqsVBPjkhCNSEY2pvZooz7k9x9UuYma3lObbOX3BbOXvIiOcxeIiLJp8JFRKKQpixRkSOSQE08+1GQ2UtEJNkKfTZVRIpD2rJERY5IQsUww1H17CXABoLZS87OXMHM9gI+Cb9xOKfZS0Qk+TRjmohEIU1ZoiJHJKGiPlsS1+wlIpJ8aTr7KiLJlaYsUZEjklAxTdMY+ewlIpJ8aTowEZHkSlOWqMgRSaC0jXsVkeRSnohIFNKWJSpyRBIqTUEiIsmmPBGRKKQpS1TkiCRUmoJERJJNeSIiUUhTlqjIEUmoNM1gIiLJpjwRkSikKUtU5IgkUNrGvYpIcilPRCQKacsSFTkiCZWmIBGRZFOeiEgU0pQl6elzEhERERERyYF6ckQSKk1nS0Qk2ZQnIhKFNGWJihyRhEpTkIhIsilPRCQKacoSFTkiCZWmIBGRZFOeiEgU0pQlKnJEEsjMUjVNo4gkl/JERKKQtixRkSOSUGk6WyIiyaY8EZEopClLVOSIJFSagkREkk15IiJRSFOWpKfPSaTEVH3pVq43EZG6xJEnZjbCzF4xszVmdm0ty9ua2V/NbLmZrTCz70f+wkQkrxqbJYU8PlGRUwBm9nMzuzcP+znfzJ5p4nPrbaOZvW5m32x666Q+aQoRKSzliTQkjjwxs2bABGAk0BcYY2Z9s1a7FFjp7v2BocDNZtYi2lcnUVKeSH2akiUqcoqMmX2UcfvCzD7NuH9OodsXNwvcaGZbwtt/WR2fcjPra2aLzOz98Pa3Wv6jLElpCRGJl/KkUXlygJl51nv2k3y3OYliyJMhwBp3X+vu24HpwKisdRxoHf6+WgHvATujfF3SOMqTRuXJOVnv1ydhvgzMd7uTREVOiXP3VlU34E3gWxmP3deYbZlZGq+buhg4DegPHAqcAlxSx7obge8A7YD2wCyC/yxLXllZWaNuUpyUJ43Kkyp7ZbxHv4q5fakQQ550BtZn3K8IH8s0HuhDkPMvAle6+xdRvB5pGuVJ7nni7vdlvV//BqwFluSprYnU2Cwp5PGJjowKp4WZTTWzDy0YqzyoaoEFXa3XmNkLwMdmVm5mR5rZs2a21YLxzUMz1j/fzNaG21qXfTbGzG4Ke0nWmdnIjMc7mdksM3vPgjHV/1pXY83sXDN7Izzz8f8aeG3nATe7e4W7bwBuBs6vbUV33+rur7u7AwZUAj0b2H5JSMuZEkkE5YnUqwl50t6CXvaq28XZm6xlN551fziwDOgEDADGm1mbiF+aRE95Uvdzp4bHKyWrsVlSyOMTFTmFcypBj8VeBL0X47OWjwFODpfvCzwG/Jqgx+NHwMNm1sHM9gR+D4x099bA0QT/qVQ5AniFoJfkv4C77MtP3DSCs2+dCHpTfmNm38huqAXDx/4bODdcdx+gSz2vrR+wPOP+8vCxOpnZVuAz4A/Ab+pbtxSkKUQkEZQnNb1hZhVmdreZtW9g3aLXxDzZ7O6DMm6TsjZbAXTNuN+FoMcm0/eBGR5YA6wDDo7rdUpklCe77qcbcCwwtaF1i1lTsiTX4xOLYSITFTmF84y7z3b3SuBPBF2nmX7v7uvd/VPge8DscP0v3P1JYBFwUrjuF8AhZraHu7/l7isytvOGu98Z7uceYH9gXzPrChwDXOPun7n7MuCPBEGR7TvAo+4+390/B34S7rMurYAPMu5/ALSyej7p7r4X0Ba4DFhaz7ZLhoocaQTlSWAzMBjoBgwEWgONGoJTrGLIk4VALzPrbsFkAqMJDogzvQl8I9z/vkBvguE+kmzKk12NBZ5293UNrFf04ihyLKaJTFTkFM7bGT9/AuxuNce3Zo517gacaUFX8Naw1+MYYH93/xj4LjAOeMvMHjOzzDNl1ftx90/CH1sRnPF4z90/zFj3DXYdU024bnV7wn1uqee1fQRkDkloA3zUUBdvuN2JwFQz61jfuqVARY40gvIk2NZH7r7I3Xe6+zsEJ02GmYZIRZ4n7r6T4P2dC7wMPODuK8xsnJmNC1f7FXC0mb0IPEVw0Lo5ppco0VGe7GosQSFW8mLqyYllIpM0XjRWKjL/4NYDf3L3WsekuvtcYK6Z7UHQZXwn8PUGtr8RaGdmrTOC5GvAhlrWfYvg4lEAzKwlQZdwXVYQnPlZEN7vHz6WizKgJUGYbcrxOUVJkwlIhEo1T6ped8mfBYgjT9x9NjA767GJGT9vBIZFvmMptJLKEzP7F4Ji6qEG2l0SYjo2qW0ikyOy1hlP0Fu8kaCX/rsNTWSio6h0uBf4lpkNN7NmZra7mQ01sy5mtq+ZnWrB2NfPCc5SVDa0QXdfDzwLXB9u71DgQmof2vEQcIqZHRN2Df6S+j87U4EfmllnM+sEXAVMqW1FMzvRzA4LX1cb4BbgfYIzgyUrpjMlIlDceXKEmfU2szIz24fgeoB57v5BbeuXCuWJxKho8yTDecDDWT1LJakpWRLmSUEmMlGRkwLhH/wo4MfAuwTV7tUEv78ygj/SjQRdd8cRTHOYizHAAeFzHwF+Fo6nzd7/CoKxkPcTnDV5n6DKrssdwF8Jpgx9ieCixDuqFlpwwVjVDCt7EVxg+AHwGsHMaiPc/bMcX4M0gukbyktekefJgcAc4MNw3c/DdolIDIo8TzCz3YGz0FC1r6ogE5lYw8MQRSTf9thjD+/evXujnvPyyy8vdvdBdS234MK+V4ETCQJlITDG3VdmrPNjoK27X2NmHQhmvtkvHCMrIikUR56ISOlpSpZATscn5QTHJ98gGJa4EDg7c6IKM/tv4B13/7kFE5ksAfrXd52frskRSagYxr1WX9gHYGZVF/atzFhH31AuUoR0jZ+IRCGm6/t2mlnVRCbNgMlVE5mEyycSTGQyxYKJTIwcJjJRkSOSUDGMi4/lwj4RST5dZyMiUYgrS+KYyERFjkgCmVlTzpa0N7NFGfcnZY17bcyFfScAPYAnzexpd9/W2MaISDI0MU9ERGpIW5aoyBFJqCacLdncwBj6XC/suyH8zoA1ZlZ1Yd8CRCS11JMjIlFIU5aoyBFJqBiCpPobygku7BsNnJ21TtU3lD9t+oZykaKRpgMTEUmuNGVJooqc8vJyb9GiRaGbUTIOOuigQjehZKxfv54tW7bknAxxdAnHdWFfEpmZp6lLPe0GDBhQ6CaUlCVLlmx29w65rp+2ISZJY2aahjaPBg4cWOgmlJTFixfnnCdpy5JEFTktWrTg4IPrnfJaIjRnzpxCN6FkDBvW+C/9juNsSal8Q3lZWRm77757oZtRMp5//vlCN6GktGjR4o3GPidNZ1+ltC1atKjhlSQyZtaoPElTliSqyBGRL6XpbImIJJvyRESikKYsUZEjklBpOlsiIsmmPBGRKKQpS1TkiCRQ2sa9ikhyKU9EJAppyxIVOSIJlaazJSKSbMoTEYlCmrJERY5IQqUpSEQk2ZQnIhKFNGWJihyRBEpbl7CIJJfyRESikLYsUZEjklBpOlsiIsmmPBGRKKQpS1TkiCRUms6WiEiyKU9EJAppyhIVOSIJlaazJSKSbMoTEYlCmrIkPeWYiIiIiIhIDtSTI5JAZpaqsyUiklzKExGJQtqyREWOSEKladyriCSb8kREopCmLFGRI5JQaTpbIiLJpjwRkSikKUtU5IgkUNrmoheR5FKeiEgU0pYlKnJEEipNZ0tEJNmUJyIShTRliYockYRKU5CISLIpT0QkCmnKEhU5IgmVpi5hEUm2OPLEzEYAtwHNgD+6+w1Zy68GzgnvlgN9gA7u/l7kjRGRvEjTsYmKHJEESts0jSKSXHHkiZk1AyYAJwIVwEIzm+XuK6vWcfffAr8N1/8W8O8qcETSK23HJipyRBIqTWdLRCTZYsiTIcAad18LYGbTgVHAyjrWHwNMi7oRIpJfaTo2UZEjklBpOlsiIskWQ550BtZn3K8Ajqhj3y2BEcBlUTdCRPIrTccmKnJEEiht0zSKSHI1MU/am9mijPuT3H1S5mZreY7Xsa1vAf+roWoi6RbnsUkc1/ipyBFJqDSdLRGRZGtCnmx290H1LK8Aumbc7wJsrGPd0WiomkhRiOPYJK5r/FTkiCSUihwRiUoMebIQ6GVm3YENBIXM2bXsty1wHPC9qBsgIvkX07FJLNf4aTyMSEJVzWKS6y3HbY4ws1fMbI2ZXVvL8qvNbFl4e8nMKs2sXeQvTkTyKuo8cfedBNfYzAVeBh5w9xVmNs7MxmWsejrwhLt/HMsLE5G8amyWhHnS3swWZdwuztpsbdf4da5j/1XX+D3cUFvVkyOSQJryVUSiEte0r+4+G5id9djErPtTgCmR71xE8u4rZElDw19jucZPRY5IQsVwUKIpX0VKlIa/ikgUYsqSWK7x03A1keJRkO5gERERka+g+ho/M2tBUMjMyl4p4xq/mblsVD05IgkVw2xImvJVpESpJ0dEohDT0NedZlZ1jV8zYHLVNX7h8qphsI26xk9FjkhCxRAkmvJVpESpyBGRKMSVJXFc46ciRyShNOWriERFRY6IRCFNWaIiRySB4vhW4bi6g0Uk2eL8lnIRKR1pyxIVOSIJpSlfRSQqaTr7KiLJlaYsUZEjklBpChIRSTbliYhEIU1ZoiJHJKHSFCQikmzKExGJQpqyREWOSALF9Q3lIlJ6lCciEoW0ZYmKHJGESlOQiEiyKU9EJAppyhIVOSIJlaYgEZFkU56ISBTSlCUqckQSKk1BIiLJpjwRkSikKUtU5IgkVJqCRESSTXkiIlFIU5ak5xt9CuDoo49mxowZzJw5k/PPP3+X5WPHjmXatGlMmzaNBx54gIULF9KmTRsAxowZwwMPPMCDDz7I2Wfv8qXyUou///3vHH300RxxxBH8/ve/32X5Qw89xNChQxk6dCgnn3wyK1asqF525ZVX0rdvX4499th8Njk2VRf3NeYmyXbiiSeydOlSXnjhBa666qpdlrdp04YHH3yQ559/noULF3LuuedWL2vbti333nsvS5YsYfHixQwZMiSfTU+luXPn0q9fP/r06cN//dd/7bJ81apVfP3rX6dVq1bccsstuyyvrKxk8ODBnHbaaXlobbyUJ8Vl+PDhrFq1itWrV3PNNdfssnyvvfZixowZLF++nH/+85/069evetldd93FO++8w4svvpjPJqfOnDlz6N27Nz179uSGG27YZbm7c8UVV9CzZ08OPfRQlixZUr3stttu45BDDqFfv37ceuut1Y8/+OCD9OvXj7KyMhYtWpSPlxG5pmRJIfMk1iLHzEaY2StmtsbMro1zX1ErKyvjmmuu4fLLL+eMM85gxIgRdO/evcY6U6dOZcyYMYwZM4bx48ezZMkStm3bRo8ePTj99NMZO3Yso0eP5utf/zpdu3Yt0CtJh8rKSq699lruv/9+nn76aR555BFeeeWVGut069aNv/zlL8ybN48f/vCHNQ4UR48ezfTp0/Pd7FilJUTyJe15csstt3D66aczcOBAzjzzTA4++OAa61x88cWsWrWKI488kpEjR/Kb3/yG5s2bA/Db3/6WJ598ksMPP5wjjzxyl78NqamyspIrr7ySv/71ryxfvpw///nPrFy5ssY67dq143e/+x3//u//Xus2/vCHP+zyO0oz5cmX0p4lEyZMYOTIkfTt25cxY8bQp0+fGuv8+Mc/ZtmyZfTv35+xY8dy2223VS+bMmUKI0aMyHezU6WyspJLL72Uxx9/nJUrVzJt2rRd8uPxxx9n9erVrF69mkmTJvGDH/wAgJdeeok777yTBQsWsHz5ch599FFWr14NwCGHHMKMGTNSfzJWRQ5gZs2ACcBIoC8wxsz6xrW/qB1yyCFUVFSwYcMGdu7cydy5cxk6dGid6w8fPpw5c+YA0L17d1588UU+++wzKisrWbx4MSeccEKeWp5OS5YsoXv37hxwwAG0aNGC0047rfr9rDJ48GD22msvAAYOHMhbb71Vveyoo46qXlYs0hIi+ZD2PBk0aBBr167l9ddfZ8eOHTz00EOccsopu6zXqlUrAPbcc0/ef/99du7cSevWrfmXf/kX7rnnHgB27NjBBx98kNf2p83ChQvp0aMHBx54IC1atOCss87ir3/9a411OnbsyKBBg6oLyUwVFRU8/vjjXHDBBflqcuyUJ4G0Z8mQIUNYs2YN69atY8eOHUyfPp1Ro0bVWKdv37489dRTALzyyisccMABdOzYEYCnn36a9957L+/tTpMFCxbQs2fP6vwYPXo0M2fOrLHOzJkzGTt2LGbGkUceydatW3nrrbd4+eWXOfLII2nZsiXl5eUcd9xxPPLIIwD06dOH3r17F+IlRUpFTmAIsMbd17r7dmA6MKqB5yRGhw4dePvtt6vvb9q0qToksu2+++4cffTR1aHy2muvcfjhh9O2bVt23313jjnmGPbdd9+8tDut3n77bTp16lR9v1OnTjXe/2z3339/0ReOaQmRPEl1nnTq1ImKiorq+xs2bGD//fevsc7EiRPp3bs3r732GgsWLODqq6/G3enevTubN2/mjjvu4Nlnn2XChAm0bNky3y8hVTZs2ECXLl2q73fu3JmNGzfm/PyrrrqK66+/nrKy4hnRrTypluos6dy5M+vXr6++X1FRQefOnWuss3z5cr797W8DwcnBbt261fh7kPpt2LChxuibLl26sGHDhpzWOeSQQ5g/fz5btmzhk08+Yfbs2TV+X8VARU6gM5D5m60IH0uF2n4p7l7rusceeyzLly9n27ZtAKxbt44pU6Zw++23M378eF599VUqKytjbW/a1fXe1uaZZ57h/vvv5yc/+UmMLZKEKfo8+eY3v8mLL75Ijx49OOqoo7jlllto3bo1zZo1Y8CAAdx5550cffTRfPLJJ7Ve0yNfqi1Pcv2P9rHHHqNjx44cfvjhUTdLkqHos+SGG25g7733ZunSpVx++eUsXbqUnTt35quJqZdLftS1Tp8+fbjmmms48cQTGTFiBP3796e8XHN8FUqc73xt/6Ps8qkws4uBi4Fahw0UyqZNm9hvv/2q73fs2JF333231nWHDRu2y9CqmTNnVndvXnbZZbzzzjvxNbYI7L///jXOtG7cuLHG+19lxYoV/PCHP2TatGm0a9cun03MuyI/m9pYDeZJZpYk7b2rrWchu6fy3HPP5eabbwZg7dq1vPHGGxx00EHVw2arLlR95JFHVOQ0oEuXLg32nNXl2Wef5dFHH2XOnDl89tlnbNu2jfPOO696uGBaJe1vooAafWySJBUVFbv0IGT3Un744Yc1hlquW7eOdevW5a2NadelS5ddessyR5o0tM6FF17IhRdeCATXRxVbL1qasiTOnpwKIPNq+y7ALuMF3H2Suw9y90FJqnZXrFhB165d6dSpE+Xl5QwfPpx//OMfu6zXqlUrBg4cyLx582o8vvfeewOw3377cfzxx+9SBElNhx12WPWB3fbt2/nLX/7C8OHDa6xTUVHBBRdcwIQJE+jRo0eBWpofaeoOzpMG8yQzS5L2fixevJgePXrQrVs3mjdvzne+8x0ee+yxGuusX7+++rq/jh070qtXL15//XXeeecdKioq6NWrFwBDhw5l1apV+X4JqTJo0KDq6xa2b9/OAw88UOs1ULW57rrrWLduHatXr+bee+/l+OOPL4oCR3lSrdHHJnlrWQ4WLlxIr169OOCAA2jevDmjR49m1qxZNdZp27Zt9Unjiy66iPnz5/Phhx8WormpNHjwYFavXl2dH9OnT+fUU0+tsc6pp57K1KlTcXeef/552rZtW30iZdOmTQC8+eabzJgxgzFjxuT9NcSlKVlSyDyJs6pYCPQys+7ABmA0kJq5lCsrK7nxxhuZMGECZWVlzJo1i7Vr13LGGWcA8PDDDwNw/PHH8/zzz/PZZ5/VeP5NN91E27Zt2blzJzfeeKMCpgHl5eVcf/31jB49msrKSsaMGcPBBx9cfXBx3nnncfPNN/P+++9XT5lZXl7OE088AcAll1zCs88+y3vvvceAAQO4+uqrOeeccwr2eqJQ5AcajZX6PLnqqquYOXMmzZo1Y+rUqbz88svVZ/vuuusubrjhBiZNmsSCBQswM37yk5+wZcsWAH70ox8xefJkWrRowbp16xg3blwhX07ilZeXc+utt3LyySfzxRdfcN5559GvXz8mTZoEBDPZvf322xx11FFs27aNsrIy/vCHP7B8+fLqrwEoNsqTaqnPkssuu4y5c+fSrFkzJk+ezMqVK7nkkksAuOOOO+jTpw9Tp06lsrKSlStXVucMBNezDh06lPbt27N+/Xp+9rOfMXny5EK9nEQqLy9n/PjxDB8+nMrKSi644AL69evHxIkTARg3bhwnnXQSs2fPpmfPnrRs2ZK77767+vlnnHEGW7ZsoXnz5kyYMKH6pPcjjzzC5ZdfzrvvvsvJJ5/MgAEDmDt3bkFe41eRpiyxxlwL0eiNm50E3Ao0Aya7+3X1rd+yZUsvpik7k069S/kzbNgwli1blnMydOzY0c8888xG7eP2229fnLSzjlFqTJ40a9bMd99993w1reRt3bq10E0oKS1atGjU37rypKbGHpuYWXwHSrKLOI9LZVdmlvPfelOyBAqXJ7GOD3P32cDsOPchUqzSdLYkH5QnIk2nPPmSskSk6dKUJcm5CEZEakhTkIhIsilPRCQKacoSFTkiCVToi/VEpHgoT0QkCmnLkuL5pjORIpOW2UtEJPniyBMzG2Fmr5jZGjO7to51hprZMjNbYWa7TlEqIqmi2dVE5CtT4SIiUYk6T8ysGTABOJFgWuaFZjbL3VdmrLMXcDswwt3fNLOOkTZCRPIuTccm6skRSai0nCkRkeSLIU+GAGvcfa27bwemA6Oy1jkbmOHubwK4+6ZIX5SI5F1cPTlx9AyryBFJqLSEiIgkXwx50hlYn3G/Inws00HA3mY2z8wWm9nYiF6OiBRIHEVORs/wSKAvMMbM+matsxdBz/Cp7t4PaHAuaw1XE0mgOHpnNLxEpDQ1MU/am9mijPuT3H1S5mZreU72F5yUAwOBbwB7AM+Z2fPu/mpjGyMihRfjyJHqnuFwP1U9wysz1ml0z7CKHJGEiiFIYgkREUm+JuTJ5ga+vK8C6JpxvwuwsZZ1Nrv7x8DHZjYf6A+oyBFJqSYemzR00qS2nuEjsrZxENDczOYBrYHb3H1qfTtVkSOSUGVlkY8mjSVERCT5YsiThUAvM+sObABGE5wkyTQTGG9m5UALgrz5XdQNEZH8aWKWNHTSJJaeYRU5Igml4SUiEpWoe4bdfaeZXQbMBZoBk919hZmNC5dPdPeXzWwO8ALwBfBHd38p0oaISF7FNFwtlp5hFTkiCdTEca8aXiIiu4hrHL27zwZmZz02Mev+b4HfRr5zEcm7GK/JiaVnWLOriZSO6hAxsxYEITIra52ZwNfNrNzMWhKEyMt5bqeIiIiUCHffCVT1DL8MPFDVM5zRO/wyUNUzvIAceobVkyOSUBpeIiJRiensq4iUmLiyJI6eYRU5Igml4SUiEhUVOSIShTRliYockYRKU5CISLIpT0QkCmnKEhU5IgmVpiARkWRTnohIFNKUJSpyRBIoxhlMRKTEKE9EJAppyxIVOSIJlaYgEZFkU56ISBTSlCV1Fjlm9gd2/aLAau5+RSwtEhEgXUHSEOWJSGEpT0QkCmnKkvp6chbVs0xEYpamIMmB8kSkgJQnIhKFNGVJnUWOu9+Ted/M9gy/BV1EYmZmlJUVz3f1Kk9ECkd5IiJRSFuWNNhSMzvKzFYSfuu5mfU3s9tjb5lIiau6wC/XWxooT0QKQ3kiIlFobJYUMk9yKcduBYYDWwDcfTlwbIxtEhGK86AE5YlIQShPRCQKaSpycppdzd3XZzWyMp7miEiVFB1oNIryRCT/lCciEoU0ZUkuRc56MzsacDNrAVxB2DUsIvFJU5A0gvJEpACUJyIShTRlSS5FzjjgNqAzsAGYC1waZ6NESl2hu3hjpDwRyTPliYhEIW1Z0mCR4+6bgXPy0BYRyZCmIMmV8kSkMJQnIhKFNGVJLrOrHWhmfzWzd81sk5nNNLMD89E4kVKWlgv7GkN5IlIYyhMRiUKaJh7IZXa1+4EHgP2BTsCDwLQ4GyUiRUt5IiJRUZ6ISJ1yKXLM3f/k7jvD272Ax90wkVKXljMljaQ8ESkA5YmIRCFNPTl1XpNjZu3CH//HzK4FphOEx3eBx/LQNpGSlqIDjQYpT0QKS3kiIlFIU5bUN/HAYoLQqHo1l2Qsc+BXcTVKpNSZGWVluXS0pobyRKRAlCciEoW0ZUmdRY67d89nQ0SkpjSdLWmI8kSksJQnIhKFNGVJLt+Tg5kdAvQFdq96zN2nxtUoEUlXkDSG8kQk/5QnIhKFNGVJg0WOmf0MGEoQIrOBkcAzgEJEJEZpCpJcKU9ECkN5IiJRSFOW5DKw7jvAN4C33f37QH9gt1hbJVLi0jR7SSMpT0TyTHkiIlFoSpYkcna1DJ+6+xdmttPM2gCbAH3ZlkjMUnSg0RjKE5ECUJ6ISBTSlCW59OQsMrO9gDsJZjRZAiyIs1EiEs/3WpjZCDN7xczWhFOvZi8famYfmNmy8PbTiF+W8kSkAJQnIhKFuHpy4siTBnty3P3fwh8nmtkcoI27v5BTi0WkyaI+W2JmzYAJwIlABbDQzGa5+8qsVZ9291Mi3XlIeSJSGMoTEYlCHD05ceVJfV8Genh9y9x9Sa47EZHGiyFIhgBr3H1tuP3pwCggO0QipzwRKSzliYhEIabharHkSX09OTfXs8yBE77KjkWkbjFdrNcZWJ9xvwI4opb1jjKz5cBG4EfuviKCfStPRApEeSIiUYhxIoFY8qS+LwM9vimt/Cr69u3LokWL8r3bkpWmi8dKURO+Vbi9mWX+AU1y90kZ92v7hXvW/SVAN3f/yMxOAv4C9GpsQ3bZSZ7z5LDDDlOW5JGyJPmUJ0132GGHMX/+/HzusqR16dKl0E2QejQhS6BAeZLTl4GKSP414cBxs7sPqmd5BdA1434XgrMh1dx9W8bPs83sdjNr7+6bG9sYEUkO5YmIRKGJJ7UKkidNKsdEJH4xzF6yEOhlZt3NrAUwGpiVtc/9LNyYmQ0hyIgtEb80Eckz5YmIRCGm2dViyRP15IgkUBzjXt19p5ldBswFmgGT3X2FmY0Ll08k+HK9H5jZTuBTYLS7Z3cZi0iKKE9EJApxXZMTV540WOSEVdM5wIHu/ksz+xqwn7trLnqRGDVx3Gu93H02MDvrsYkZP48Hxke+45DyRKQwlCciEoU4sgTiyZNcWno7cBQwJrz/IcFc1iIijaU8EZGoKE9EpE65DFc7wt0PN7OlAO7+fjheTkRiVKQzVilPRApAeSIiUUhTluRS5Oyw4JtIHcDMOgBfxNoqEUlVkDSC8kSkAJQnIhKFNGVJLkXO74FHgI5mdh3BhT//GWurREpcjF+4VWjKE5E8U56ISBTSliUNFjnufp+ZLQa+QfBlPae5+8uxt0ykxKUpSHKlPBEpDOWJiEQhTVmSy+xqXwM+Af6a+Zi7vxlnw0RKXVwzmBSS8kSkMJQnIhKFNGVJLsPVHiMY72rA7kB34BWgX4ztEilpaesSbgTliUieKU9EJAppy5Jchqv9n8z7ZnY4cElsLRIRIF1dwrlSnogUhvJERKKQpizJpSenBndfYmaD42iMiHwpTUHSVMoTkfxQnohIFNKUJblck/PDjLtlwOHAu7G1SESAdI17zZXyRKQwlCciEoU0ZUkuPTmtM37eSTAG9uF4miMikL5xr42gPBHJM+WJiEQhbVlSb5ETfslWK3e/Ok/tEZFQmoIkF8oTkcJRnohIFNKUJXUWOWZW7u47wwv5RCTP0hQkDVGeiBSW8kREopCmLKmvJ2cBwfjWZWY2C3gQ+LhqobvPiLltIiUtTUGSA+WJSAEpT0QkCmnKklyuyWkHbAFO4Mv56B1QiIjExMxSdXFfIyhPRPJMeSIiUUhbltRX5HQMZy55iS/Do4rH2ioRSdXZkhwoT0QKSHkiIlFIU5bUV+Q0A1pRMzyqKEREYpamIMmB8kSkgJQnIhKFNGVJfUXOW+7+y7y1RESKmfJERKKiPBGRBtVX5KSnVBMpQmk6W5KDonoxImmjPBGRKKQpS+orcr6Rt1aISA1pu7gvB8oTkQJRnohIFNKWJXUWOe7+Xj4bIiI1pelsSUOUJyKFpTwRkSikKUtymUJaRAogTUEiIsmmPBGRKKQpS1TkiCRUmoJERJJNeSIiUUhTlqRnYJ1IiTGzRt1y3OYIM3vFzNaY2bX1rDfYzCrN7DuRvSARKZg48kRESk9js6SQeaIiRySBqi7ua8wth202AyYAI4G+wBgz61vHejcCcyN+WSJSAHHkiYiUnqZkSa55EsdJWA1XE0moGM5+DAHWuPvacPvTgVHAyqz1LgceBgZH3QARKQz1zohIFOLIkoyTsCcCFcBCM5vl7itrWS/nk7A6XSOSUE3oDm5vZosybhdnbbIzsD7jfkX4WOY+OwOnAxPjfG0ikl8a/ioiUYhpuFr1SVh33w5UnYTNVnUSdlMuG1VPjkhCNeFsyWZ3H1TfJmt5zLPu3wpc4+6VOvMrUjyi/nuO68yriCRbTMcGtZ2EPSJrv1UnYU8gx5EmKnJEEqhq3GvEKoCuGfe7ABuz1hkETK/qGQJOMrOd7v6XqBsjIvkRU55o+KtIifkKWdLezBZl3J/k7pMyN13Lc77ySVgVOSIJFcPZkoVALzPrDmwARgNnZ67g7t0z9j8FeFQFjkj6xZAnsZx5FZFka2KWNDTSJJaTsCpyREqEu+80s8sIho00Aya7+wozGxcu13U4IlKlIGdeRaQkxXISVkWOSELFcVDg7rOB2VmP1VrcuPv5kTdARAoihmv8NPxVpATFdGwSy0lYFTkiCaUznyISFQ1/FZEoxHVsEsdJWBU5IglU6G8JFpHiEUeeaPirSOlJ27GJihwRERFpNA1/FZEkU5EjklBpOlsiIsmmPBGRKKQpSyKfOD/t5syZQ+/evenZsyc33HDDLsvdnSuuuIKePXty6KGHsmTJEgBeeeUVBgwYUH1r06YNt956KwBXX301Bx98MIceeiinn346W7duzeMrSo/hw4ezatUqVq9ezTXXXLPL8r322osZM2awfPly/vnPf9KvXz8AdtttN/75z3+ybNkyXnrpJX7+85/nueXxiOMbyqVwGsqWVatWcdRRR7Hbbrtx00031Vh2wQUX0LFjRw455JB8NTf1GsqTNm3aMGvWrOrcOP/886uXXXHFFbz44ou89NJLXHnllXlsdXyUJ8XjySef5LDDDqN///7cfPPNuyx/5ZVXOOGEE9hnn3247bbbqh9/9dVXOfroo6tvnTp1YsKECflseioNHTqUf/zjHzzzzDNceumluyxv3bo1d999N0888QRPPfUUZ511FhAcmzz66KPVj1911VX5bnosGpslhcyT2IocM5tsZpvM7KW49hG1yspKLr30Uh5//HFWrlzJtGnTWLmy5veaPf7446xevZrVq1czadIkfvCDHwDQu3dvli1bxrJly1i8eDEtW7bk9NNPB+DEE0/kpZde4oUXXuCggw7i+uuvz/trS7qysjImTJjAyJEj6du3L2PGjKFPnz411vnxj3/MsmXL6N+/P2PHjq0O788//5wTTjihusAcMWIERxxxRG27SZW0hEg+pDFPMuWSLe3ateP3v/89P/rRj3Z5/vnnn8+cOXPy1dzUyyVPLr30UlauXMmAAQMYOnQoN998M82bN6dfv37867/+K0OGDKF///6ccsop9OzZs0CvJDrKky+lOU8qKyu56qqrmDFjBgsXLuShhx5i1apVNdZp164dv/3tb7niiitqPH7QQQfx7LPP8uyzz/L000+zxx578K1vfSufzU+dsrIyfv3rX3Puuedy/PHHM2rUKHr16lVjnfPOO4/Vq1czbNgwzjzzTH7605/SvHlzPv/8c8466yyGDRvG8OHDGTp0KIcffniBXkl0VOQEpgAjYtx+5BYsWEDPnj058MADadGiBaNHj2bmzJk11pk5cyZjx47FzDjyyCPZunUrb731Vo11nnrqKXr06EG3bt0AGDZsGOXlwcjAI488koqKivy8oBQZMmQIa9asYd26dezYsYPp06czatSoGuv07duXp556CgjOVB1wwAF07NgRgI8//hiA5s2b07x5c9yzv64hfdISInkyhZTlSaZcsqVjx44MHjyY5s2b7/L8Y489lnbt2uWruamXS564O61btwagVatWvPfee+zcuZM+ffrw/PPP8+mnn1JZWck//vGP6hNWaaY8qWEKKc2TRYsWceCBB9K9e3datGjBGWecwaOPPlpjnQ4dOjBw4MBas6TKvHnz6N69O1/72tfibnKqDRgwgNdff50333yTHTt2MHPmTIYNG1ZjHXdnzz33BGDPPfdk69at7Ny5E4BPPvkEgPLycsrLy0vy2KQoixx3nw+8F9f247Bhwwa6dv1y2v8uXbqwYcOGRq8zffp0xowZU+s+Jk+ezMiRIyNsdXHo3Lkz69d/+eXZFRUVdO7cucY6y5cv59vf/jYAgwcPplu3bnTp0gUIzrYsXbqUTZs28eSTT7JgwYL8NV5il8Y8yZRLbkh0csmT8ePH06dPHzZu3MiLL77IlVdeibvz0ksvVReVe+yxByeddFKN352kX5rz5K233qrxWe7cufMuJ1pz8dBDD3HmmWdG2bSitP/++9d4f99++23233//GutMmTKFXr16sXjxYv72t7/x05/+tLqYKSsrY+7cuSxfvpynn36apUuX5rX9pU7X5GSorcLOrkAbWmf79u3MmjWr1vC47rrrKC8v55xzzomgtcWltko/+72+4YYb2HvvvVm6dCmXX345S5curT5b8sUXX3DYYYfRpUsXhgwZUn29Tlql6UyJNCyXbJHo5JInw4cPZ9myZXTq1IkBAwYwfvx4WrduzapVq7jxxht58sknmTNnDsuXL6/OmbRSnhSPKLJk+/btzJ49uyh6KAsh+3cwdOhQVqxYwcCBAxk+fDi//vWvadWqFRAcmwwfPpzBgwczYMAAevfuXYgmR6YpWVKUPTm5MrOLzWyRmS169913C9qWLl267HL2r1OnTo1a5/HHH+fwww9n3333rfG8e+65h0cffZT77rtP/4HUoqKiYpcz3Rs31vzy7A8//JALLriAww47jLFjx9KhQwfWrVtXY50PPviAefPmMWJEKkci1JCWEEmKJGVJtlyyRaKTS558//vfZ8aMGQC89tprrFu3joMPPhgIetwHDhzIcccdx3vvvcfq1avz1/iYKE8aJzNPNm/eXOjmVOvUqVONXuANGzaw3377NWobTzzxBAMGDKge7i11e+utt2r03Oy33368/fbbNdY566yzePzxxwF4/fXXWb9+/S7X8W3bto3nnnuOoUOHxt7muKnIaQR3n+Tug9x9UIcOHQralsGDB7N69WrWrVvH9u3bmT59OqeeemqNdU499VSmTp2Ku/P888/Ttm3bGn8A06ZN22Wo2pw5c7jxxhuZNWsWLVu2zMtrSZuFCxfSq1cvDjjgAJo3b87o0aOZNWtWjXXatm1bPcb4oosuYv78+Xz44Ye0b9+etm3bArD77rvzzW9+c5cLMdMoLSGSFEnKkmy5ZItEJ5c8efPNN/nGN74BBNdD9e7dm7Vr1wLBNQ0AXbt25dvf/jbTpk3L7wuIgfKkcTLzpH379oVuTrWBAwfy2muv8frrr7N9+3YefvhhTj755EZt46GHHuI73/lOTC0sLsuXL6d79+507dqV5s2bM2rUKJ588ska62zYsIFjjjkGgPbt29OjRw/eeOMN2rVrR5s2bYDg2OSYY45hzZo1eX8NUUtTkaPvyclQXl7O+PHjGT58OJWVlVxwwQX069ePiROD7zYbN24cJ510ErNnz6Znz560bNmSu+++u/r5n3zyCU8++SR33HFHje1edtllfP7555x44olAMPlA1TYlUFlZyWWXXcbcuXNp1qwZkydPZuXKlVxyySUA3HHHHfTp04epU6dSWVnJypUrufDCC4FgzOw999xDs2bNKCsr44EHHuCxxx4r5MuJhA40ikcu2fL2228zaNAgtm3bRllZGbfeeisrV66kTZs2jBkzhnnz5rF582a6dOnCL37xi+rPv+wqlzz51a9+xZQpU3jhhRcwM6655hq2bNkCwMMPP8w+++zDjh07uPTSS4ti2n/lSXEoLy/npptu4rTTTuOLL77g3HPPpU+fPtx1110AXHjhhbzzzjsce+yxfPjhh5SVlXH77bezcOFC2rRpwyeffMLf//73GlNLS90qKyv5yU9+wn333UdZWRl//vOfefXVV/ne974HwL333sttt93GLbfcwt/+9jcAfvOb3/D+++/Tp08ffve739GsWTPMjEcffbR68qQ0S1OWWFwzPZjZNGAo0B54B/iZu99V33MGDRrkixYtiqU9sqs0fVCLgbvn/Ib36dPHp06d2qjtDxkyZLG7D2p0w1KgsXmiLMkvZUneNepvXXlSU2Pz5PDDD/f58+fnqXVSNWxU8mPDhg05/603JUugcHkSW0+Ou9c+vZiINKjQXbxJozwRaTrlSU3KE5GmSVuWaLiaSEKlKUhEJNmUJyIShTRlScEnHhAREREREYmSenJEEipNZ0tEJNmUJyIShTRliYockYRKU5CISLIpT0QkCmnKEhU5IgmVpiARkWRTnohIFNKUJbomR0REREREiop6ckQSKG3TNIpIcilPRCQKacsSFTkiCZWmIBGRZFOeiEgU0pQlKnJEEipNQSIiyaY8EZEopClLdE2OSEJVdQvnestxmyPM7BUzW2Nm19ayfJSZvWBmy8xskZkdE/kLE5G8iyNPRKT0NDZLCpkn6skRSaiog8HMmgETgBOBCmChmc1y95UZqz0FzHJ3N7NDgQeAgyNtiIjknQoXEYlCmrJEPTkipWMIsMbd17r7dmA6MCpzBXf/yN09vLsn4IiIiIikjIockQRqYndw+3CIWdXt4qzNdgbWZ9yvCB/L3vfpZrYKeAy4IK7XKCL5kabhJSKSXE3JkkIOp9dwNZGEasKBxmZ3H1TfJmt5bJeeGnd/BHjEzI4FfgV8s7ENEZFkUeEiIlGII0viGk6vIkckoWIIkgqga8b9LsDGulZ29/lm1sPM2rv75qgbIyL5oyJHRKIQU5ZUD6cP91E1nL66yHH3jzLWz2k4vYariSRUDN3BC4FeZtbdzFoAo4FZWfvsaeHGzOxwoAWwJeKXJiJ5lpbhJSKSbGkaTq+eHJGEivpsibvvNLPLgLlAM2Cyu68ws3Hh8onAGcBYM9sBfAp8N2MiAhFJqajzJK7hJSKSbE3MkoIMp1eRI5JAcV386+6zgdlZj03M+PlG4MbIdywiBRNTnsQyvEREkivGiUliGU6v4WoiIiLSWJqtUUSiEstwevXkiCSULhQWkag0IU/am9mijPuT3H1S5iZreY5maxQpcjGNMollOL2KHJGEUpEjIlGJYUp6zdYoUoLiOjaJYzi9hquJiIhIY2m2RhFJNPXkiIiISKNotkYRSToVOSIJpeFqIhIVzdYoIlFI07GJihyRhEpTkIhIsilPRCQKacoSFTkiCRTjXPQiUmKUJyIShbRliSYeEBERERGRoqKeHJGEStPZEhFJNuWJiEQhTVmiIkckodIUJCKSbMoTEYlCmrJEw9VERERERKSoqCdHJKHSdLZERJJNeSIiUUhTlqjIEUmoNAWJiCSb8kREopCmLFGRI5JAaZumUUSSS3kiIlFIW5bomhwRERERESkq6skRSag0nS0RkWRTnohIFNKUJSpyRBIqTUEiIsmmPBGRKKQpS1TkiCRUmoJERJJNeSIiUUhTluiaHBERERERKSrqyRFJqDSdLRGRZFOeiEgU0pQlKnJEEiht0zSKSHIpT0QkCmnLEg1XEykhZjbCzF4xszVmdm0ty88xsxfC27Nm1r8Q7RQRERH5KhLVk7N48eLNZvZGodvRBO2BzYVuRIlI63vdrdANMLNmwATgRKACWGhms9x9ZcZq64Dj3P19MxsJTAKOyH9rvxplieQore93wfOklCxdunRz69atlSfSkLS+30WbJ4kqcty9Q6Hb0BRmtsjdBxW6HaWglN7rGLqEhwBr3H1tuP3pwCigushx92cz1n8e6BJ1I/JBWSK5KKX3O01DTJJGeSK5KJX3O01ZkqgiR0S+FEOQdAbWZ9yvoP5emguBx6NuhIjkX5oOTEQkudKUJSpyRBKqCUHS3swWZdyf5O6TMjdZy3O8jn0fT1DkHNPYRohI8qTpwEREkitNWaIiJxqTGl5FIqL3um6bG+gqrwC6ZtzvAmzMXsnMDgX+CIx09y3RNlEaoM93fun9lmKmz3d+6f3+CsxsBHAb0Az4o7vfkLX8HOCa8O5HwA/cfXl921SRE4Gss+USo1J6r2M4W7IQ6GVm3YENwGjg7Kx9fg2YAZzr7q9G3QCpXyl9vpOglN7vNJ19lWiU0uc7CUrl/Y4jS+KaGElFjkgCxTEXvbvvNLPLgLkEZ0omu/sKMxsXLp8I/BTYB7g93P/OUriQUqSYpe27LUQkmWLMklgmRtL35HwFDX3niETHzCab2SYze6nQbUkzd5/t7ge5ew93vy58bGJY4ODuF7n73u4+ILypwMkT5Un+KE+kmClL8kt5EonaJkbqXM/6OU2MpCKniTK61kYCfYExZta3sK0qalOAEYVuRD5VnTHJ9SbppTzJuykoT5QnRUhZUhBTKKE8aWyWhHnS3swWZdwuzt5sLbtqaGKka2pbnklFTtNVd625+3agqmtNYuDu84H3Ct2OfNJBSUlRnuSR8iSaPGmox8DMzjGzF8Lbs2bWP/IXJtmUJXlWannSxCJns7sPyrhlX7/U2ImRRuUyMZKKnKZrbNeaiEhdlCeSKjn2GFRdKHwo8Cs0+1Q+KEskjaonRjKzFgQTI83KXMGaMDGSJh5oupy71kSaQr0zJUV5IrGKIU9iuVBYvjJlicQqjmOTuCZGUpHTdDl1rYmI5EB5ImlTW49BfdO55nShsHxlyhJJJXefDczOemxixs8XARc1Zpsqcpquwe8cEWkqXWdTcpQnEpsm5kl7M1uUcX9S1jj6plwofExjGyGNpiyR2KTt2ETX5DSRu+8EqrrWXgYecPcVhW1V8TKzacBzQG8zqzCzCwvdJpGoKE/yS3mSk4JcKCxfjbIk/5QnyaWenK+gtq41iYe7jyl0G/ItTWdL5KtTnuSP8iQSDfYYNOVCYfnqlCX5VWp5kqZjExU5IiIi0ihxXSgsIhIVFTkiIiLSaHFcKCwiEhUVOSIJlaYuYRFJNuWJiEQhTVmiiQdERERERKSoqCdHJKHSdLZERJJNeSIiUUhTlqgnJ0/MrNLMlpnZS2b2oJm1/ArbmmJm3wl//qOZ9a1n3aFmdnQT9vG6mbXP9fGsdT5q5L5+bmY/amwbRUqV8qTe9ZUnIo2gPKl3feVJiqnIyZ9P3X2Aux8CbAfGZS40s2ZN2ai7X+TuK+tZZSjQ6BCRwqv60q1cb1JSlCfSKMoTqYfyRHLW2CwpZJ6oyCmMp4Ge4VmM/zGz+4EXzayZmf3WzBaa2QtmdgmABcab2UozewzoWLUhM5tnZoPCn0eY2RIzW25mT5nZAQRh9e/hWZqvm1kHM3s43MdCM/uX8Ln7mNkTZrbUzO6g9m+zrsHM/mJmi81shZldnLXs5rAtT5lZh/CxHmY2J3zO02Z2cCTvpkhpU54oT0SiojxRnhQNXZOTZ2ZWDowE5oQPDQEOcfd14R/iB+4+2Mx2A/7XzJ4ADgN6A/8H2BdYCUzO2m4H4E7g2HBb7dz9PTObCHzk7jeF690P/M7dn7Hgi9rmAn2AnwHPuPsvzexkoEYo1OGCcB97AAvN7OHwG633BJa4+1Vm9tNw25cBk4Bx7r7azI4AbgdOaMLbWPRMZ1MlB8oT5UkulCeSC+WJ8qQhacsSFTn5s4eZLQt/fhq4i6CbdoG7rwsfHwYcauF4VqAt0As4Fpjm7pXARjP7ey3bPxKYX7Utd3+vjnZ8E+ib8SFtY2atw318O3zuY2b2fg6v6QozOz38uWvY1i3AF8Cfw8fvBWaYWavw9T6Yse/dctiHiOxKeaI8EYmK8kR5UpRU5OTPp+4+IPOB8I/p48yHgMvdfW7WeicB3sD2LYd1IBiieJS7f1pLW3J5ftX6QwkC6Sh3/8TM5gG717G6h/vdmv0eiEiTKE+UJyJRUZ4oT4qSrslJlrnAD8ysOYCZHWRmewLzgdEWjIndHzi+luc+BxxnZt3D57YLH/8QaJ2x3hMEXbOE6w0If5wPnBM+NhLYu4G2tgXeDwPkYIIzNVXKgKqzPWcTdDNvA9aZ2ZnhPszM+jewj5JmKbmwTxJLeSLVlCfyFSlPBNDEA9J0fyQYz7rEzF4C7iDobXsEWA28CPw38I/sJ7r7uwTjVGeY2XK+7I79K3C6hRf2AVcAgyy4cHAlX86i8gvgWDNbQtAt/WYDbZ0DlJvZC8CvgOczln0M9DOzxQRjWn8ZPn4OcGHYvhXAqBzeExFpGuWJiERFeSKpY+459wCKSJ4MHDjQn3vuuUY9Z7fddlvs7oNiapKIpJTyRESi0JQsgcLliXpyRERERESkqKjIEUmoOMa8WvBdBa+Y2Rozu7aW5Qeb2XNm9rnpW55FikZaxtCLSLKl6Zocza4mUiIs+NbqCcCJQAXBdwfMyvpG6vcIxkWflv8WioiIiERDPTkipWMIsMbd17r7dmA6WRdXuvsmd18I7ChEA0VERESioJ4ckYSKoYu3M7A+434FcETUOxGR5NEQNBGJQpqyREWOSPFob2aLMu5PcvdJGfdrSyZNrygiIiJFR0WOSAI18WK9zQ1M0VgBdM243wXY2NidiEi6FPriXxEpDmnLEl2TI1I6FgK9zKy7mbUARgOzCtwmERERKXFxzP6qnhyRhIr6bIm77zSzy4C5QDNgsruvMLNx4fKJZrYfsAhoA3xhZv8X6Ovu2yJtjIjkVZrOvopIcsWRJXHN/qoiR6SEuPtsYHbWYxMzfn6bYBibiIiISD5Uz/4KYGZVs79WFznuvgnYZGYn57pRFTkiCaUzryISFeWJiEQhpiyJZfZXFTkiIiIiIhKXgsz+qiJHJKF05lVEoqI8EZEoNDFLCjL7q2ZXExERERGRQoll9ldz13cBiiSNmc0B2jfyaZvdfUQc7RGR9FKeiEgUmpglkEOemNlJwK18OfvrdfXN/gp8RAOzv6rIERERERGRoqLhaiIiIiIiUlRU5IiIiIiISFFRkSMiIiIiIkVFRY6IiIiIiBQVFTkiIiIiIlJUVOSIiIiIiEhRUZEjIiIiIiJFRUWOiIiIiIgUlf8fK+dhsZX85XgAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Confusion matrices (normalized) for customized thresholds\n", "cm_norm_03 = confusion_matrix(y_test, y_pred_03, normalize=\"true\")\n", "cm_norm_07 = confusion_matrix(y_test, y_pred_07, normalize=\"true\")\n", "fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(14, 5))\n", "fig.suptitle('Normalized confusion matrix for various thresholds')\n", "ax1.title.set_text('Threshold 0.3')\n", "ax2.title.set_text('Threshold 0.5')\n", "ax3.title.set_text('Threshold 0.7')\n", "ConfusionMatrixDisplay(cm_norm_03).plot(cmap=plt.cm.Greys, ax=ax1)\n", "ConfusionMatrixDisplay(cm_norm).plot(cmap=plt.cm.Greys, ax=ax2)\n", "ConfusionMatrixDisplay(cm_norm_07).plot(cmap=plt.cm.Greys, ax=ax3)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### On prediction probabilities\n", "\n", "- Probabilistic, e.g., Naïve Bayes \n", "\n", "- Non-probabilsitic, e.g., k-Nearest Neighbors (KNN), SVM\n", " - \"probabilities\" may be inferred, e.g., based on distance from decision boundaries\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Trade-off: Precision & Recall \n", "\n", "Relationship: Selected and relevant elements | Precision and Recall\n", ":-------------------------:|:-------------------------:\n", "\"bigpic\" | \"Sampling\"\n", "Image adapted from https://en.wikipedia.org/wiki/Precision_and_recall" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- When trying to detect malignant tumors, will you prioritize precision or recall?\n", "\n", "- When trying to detect fraudulent credit-card transactions in real-time, will you prioritize precision or recall?\n", "\n", "- When making recommendations for jobs, what will give a person a better user experience?\n", " * Recall earlier discussions\n", " * Data Jujitsu" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Precision recall curve" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "precision: [0.4805 0.45368421 0.42444444 0.39235294 0.35875 0.32133333\n", " 0.28357143 0.23769231 0.18583333 0.14454545 0.108 0.08333333\n", " 0.05875 0.05142857 0.04166667 0.032 0.0275 0.01666667\n", " 0.02 0.02 1. ] \n", "\n", " recall: [1. 0.89698231 0.7950052 0.69406868 0.59729448 0.50156087\n", " 0.41311134 0.32154006 0.23204995 0.16545265 0.11238293 0.0780437\n", " 0.04890739 0.03746098 0.02601457 0.01664932 0.01144641 0.00520291\n", " 0.00416233 0.00208117 0. ]\n" ] } ], "source": [ "# Precision recall curve\n", "from sklearn.metrics import precision_recall_curve\n", "from sklearn.metrics import PrecisionRecallDisplay\n", "predictions_prob=[x[0] for x in model3.predict_proba(cv3.transform(X_test))]\n", "precision, recall, _ = precision_recall_curve(y_test, predictions_prob)\n", "print(\"precision:\",precision[::100],\"\\n\\n\", \"recall:\",recall[::100])" ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "scrolled": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "PrecisionRecallDisplay.from_estimator(model3,cv3.transform(X_test),y_test)\n", "# What's the ideal point on/shape of the precision recall curve?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### $F$-score\n", "\n", "- Harmonic mean of precision and recall: $F_1=2 \\cdot \\frac{precision \\cdot recall}{precision + recall}$\n", "- If recall is considered $\\beta$ times as important as precision: $F_{\\beta} = (1+\\beta^2) \\cdot \\frac{precision \\cdot recall}{(\\beta^2 \\cdot precision) + recall}$\n", "\n", "Recall: We used it in the Network Science module, to quantify the quality of clusters" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 0.89 0.86 0.88 1039\n", " 1 0.85 0.89 0.87 961\n", "\n", " accuracy 0.87 2000\n", " macro avg 0.87 0.87 0.87 2000\n", "weighted avg 0.87 0.87 0.87 2000\n", "\n" ] } ], "source": [ "print(classification_report(y_test,predictions))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Macro and Micro-averages\n", "\n", "\n", "* Macro: \n", " * Metric is computed independently for each class. \n", " * Each class is given equal weight when averaging. \n", " * Classes with small population have disproportionate influence\n", "\n", "* Micro: Each instance is given equal weight, and the metric is computed across all samples. \n", " * In the event of class-imbalance, size of classes have proportional influence\n", " " ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.8729491796718687\n", "0.8729999999999999\n" ] } ], "source": [ "# We are using a perfectly balanced data, so we don't expect to see much differences in this example\n", "# But here's how to compute the macro/micro metrics, using an average attribute.\n", "# Check manual for a few other choices, e.g. weighted\n", "print(f1_score(y_test, predictions, average='macro'))\n", "print(f1_score(y_test, predictions, average='micro'))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Receiver operating characteristic (ROC) & Area Under Curve (AUC)\n", "Etymology: Method's origins in operators of military radar receivers \n", "\n", "> The ROC curve plots the true positive rate (TPR), a.k.a., Recall or Sensitivity, along Y-axis, and the false positive rate (FPR) along X-axis, by varying decision thresholds. \n", "> - Determining the ROC in the multi-class case is non-trivial\n", "\n", "> ROC aides discard suboptimal models\n", "> ![ROC-zoomed.png](pics/ROCZoom.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "> AUC statistic is used for model comparison\n", "\n", "
\"ROC/AUC\"
\n", "Image source: Wikipedia article on ROC
" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "from sklearn.metrics import roc_curve, auc\n", "# following code is adapted from \n", "# https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html\n", "y_score = model3.decision_function(cv3.transform(X_test))\n", "# Compute ROC curve and ROC area for each class\n", "fpr, tpr, _ = roc_curve(y_test, y_score)\n", "roc_auc = auc(fpr, tpr)" ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure()\n", "plt.plot(fpr, tpr, color='dodgerblue',\n", " lw=4, label='AUC = %0.3f' % roc_auc)\n", "plt.plot([0, 1], [0, 1], color='firebrick', lw=2, linestyle='--')\n", "plt.xlim([0, 1.01])\n", "plt.ylim([0, 1.01])\n", "plt.xlabel('False Positive Rate')\n", "plt.ylabel('True Positive Rate')\n", "plt.title('Receiver operating characteristic')\n", "plt.legend(loc=\"lower right\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['accuracy', 'adjusted_mutual_info_score', 'adjusted_rand_score', 'average_precision', 'balanced_accuracy', 'completeness_score', 'explained_variance', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'fowlkes_mallows_score', 'homogeneity_score', 'jaccard', 'jaccard_macro', 'jaccard_micro', 'jaccard_samples', 'jaccard_weighted', 'max_error', 'mutual_info_score', 'neg_brier_score', 'neg_log_loss', 'neg_mean_absolute_error', 'neg_mean_absolute_percentage_error', 'neg_mean_gamma_deviance', 'neg_mean_poisson_deviance', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'neg_median_absolute_error', 'neg_root_mean_squared_error', 'normalized_mutual_info_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'rand_score', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc', 'roc_auc_ovo', 'roc_auc_ovo_weighted', 'roc_auc_ovr', 'roc_auc_ovr_weighted', 'top_k_accuracy', 'v_measure_score']\n" ] } ], "source": [ "# measures supported in sklearn\n", "from sklearn.metrics import SCORERS\n", "print(sorted(list(SCORERS.keys())))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Multi-class data\n", "\n", "\n", "- Some classifiers can natively support multiple classes\n", " - Check more details at https://scikit-learn.org/stable/modules/multiclass.html\n", "- Some classifiers can only handle binary classes natively\n", " - They can be re-purposed for multi-class classification.\n", " * One versus Rest: N classifiers \n", " * Caution: An implicit source of imbalance\n", " * One versus One: ${N \\choose 2}$ classifiers\n", "\n", "One versus Rest | One versus One\n", ":-------------------------:|:-------------------------:\n", "\"OVR\" | \"OVO\"" ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "{False: 5, True: 145}" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Scikit-learn allows to choose, e.g., SVC uses decision_function_shape{‘ovo’, ‘ovr’}, default=’ovr’\n", "# An alternative is to explicitly use the multiclass library \n", "from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.svm import LinearSVC\n", "from sklearn import datasets\n", "iris_X, iris_y = datasets.load_iris(return_X_y=True)\n", "iris_X = StandardScaler().fit_transform(iris_X)\n", "# Caution: I have not used train/test splits in this example\n", "OvR=OneVsRestClassifier(LinearSVC(random_state=0)).fit(iris_X, iris_y).predict(iris_X)\n", "OvO=OneVsOneClassifier(LinearSVC(random_state=0)).fit(iris_X, iris_y).predict(iris_X)\n", "unique, counts = np.unique(OvR==OvO, return_counts=True)\n", "dict(zip(unique, counts))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Grid search\n", "Exploring the parameter space to identify good/optimal parameters\n", "- Parameter choices depend on the algorithm being used \n", "- See Scikit-learn examples for more details and sophisticated use cases. " ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "GridSearchCV(estimator=SVC(),\n", " param_grid={'C': range(1, 11, 2), 'kernel': ('linear', 'rbf')})" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# A very basic example of Grid search \n", "from sklearn.ensemble import RandomForestClassifier \n", "from sklearn.model_selection import GridSearchCV\n", "iris = datasets.load_iris()\n", "iris_X, iris_y = datasets.load_iris(return_X_y=True)\n", "parameters = {'kernel':('linear', 'rbf'), 'C':range(1,11,2)}\n", "# Remember: Parameter choices depend on the algorithm being used\n", "svc = SVC()\n", "clf = GridSearchCV(estimator=svc, param_grid=parameters)\n", "clf.fit(iris_X, iris_y)" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_C', 'param_kernel', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'split4_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score']\n" ] } ], "source": [ "results=clf.cv_results_\n", "# What all is available in the results?\n", "print(list(results.keys()))" ] }, { "cell_type": "code", "execution_count": 81, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "array([0.98 , 0.96666667, 0.97333333, 0.98 , 0.98 ,\n", " 0.98666667, 0.97333333, 0.98666667, 0.97333333, 0.98 ])" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results['mean_test_score']" ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "array([ 3, 10, 7, 3, 3, 1, 7, 1, 7, 3])" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results['rank_test_score']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Model tuning and evaluation wrap-up\n", "\n", "- Our discussion was anchored around supervised learning, and for Classification task in particular\n", " * Metrics for Regression are simpler, e.g., r2 score, mean absolute error, etc\n", " * For unsupervised learning, e.g., Clustering, some of the measures we saw here, e.g., F-score can be repurposed, if (a big IF!) the \"ground truth\" is known. \n", "- While many measures exist and several/all of them can be used to understand the general behaviour of the models, final evaluation and optimization may need to be use-case driven.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Transformations\n", "- Revisting representation phase of the ML-lifecycle\n", " - We have already discussed some basics: Data cleaning, removal of samples with missing values, syntactic conversion of information so that the algorithms can \"consume\" the data (e.g., CountVectorizer) \n", " - Standardization: Normalization, scaling, etc.\n", " * See sklearn.preprocessing\n", " * StandardScaler: Standardize features by removing the mean and scaling to unit variance.\n", " * MinMaxScaler: Transform features by scaling each feature to a given range." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Transformations\n", "- Revisting representation\n", " - Data embedding/dimensionality reduction: Principal Component Analysis (PCA), Linear discriminant analysis (LDA), manifold, etc.\n", " \n", "Recommended reading: Blog showcasing multiple dimensionality reduction techniques. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Philosophical tangent: PCA vs LDA \n", "PCA | LDA\n", ":-------------------------:|:-------------------------:\n", "\"bigpic\" | \"Sampling\"\n", "Image source https://sebastianraschka.com/. See also scikit-learn PCA & LDA usage example." ] }, { "cell_type": "code", "execution_count": 83, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# The following example of using PCA, Standardization and make_pipeline is adapted from \n", "# https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto-examples-preprocessing-plot-scaling-importance-py\n", "# It has been made available under License: BSD 3 clause by\n", "# Tyler Lanigan and Sebastian Raschka \n", "\n", "from sklearn.preprocessing import StandardScaler, MinMaxScaler\n", "from sklearn.decomposition import PCA\n", "from sklearn.naive_bayes import GaussianNB\n", "from sklearn import metrics\n", "from sklearn.datasets import load_wine\n", "from sklearn.pipeline import make_pipeline\n", "RANDOM_STATE = 42\n", "FIG_SIZE = (10, 4)" ] }, { "cell_type": "code", "execution_count": 84, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "features, target = load_wine(return_X_y=True)\n", "# 2:1 Train-Test split\n", "X_train_wine, X_test_wine, y_train_wine, y_test_wine = train_test_split(features, target,\n", " test_size=0.33,\n", " random_state=RANDOM_STATE) \n", "# Fit the raw data and predict using GNB following PCA.\n", "unscaled_clf = make_pipeline(PCA(n_components=2), GaussianNB())\n", "unscaled_clf.fit(X_train_wine, y_train_wine)\n", "pred_test2 = unscaled_clf.predict(X_test_wine)\n", "# Using PCA with 3 components\n", "unscaled_clf = make_pipeline(PCA(n_components=3), GaussianNB())\n", "unscaled_clf.fit(X_train_wine, y_train_wine)\n", "pred_test3 = unscaled_clf.predict(X_test_wine)" ] }, { "cell_type": "code", "execution_count": 85, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PCA (2 components) + GNB \n", " precision recall f1-score support\n", "\n", " 0 0.95 1.00 0.98 20\n", " 1 0.78 0.75 0.77 24\n", " 2 0.60 0.60 0.60 15\n", "\n", " accuracy 0.80 59\n", " macro avg 0.78 0.78 0.78 59\n", "weighted avg 0.79 0.80 0.79 59\n", "\n", "PCA (3 components) + GNB \n", " precision recall f1-score support\n", "\n", " 0 1.00 1.00 1.00 20\n", " 1 0.88 0.92 0.90 24\n", " 2 0.86 0.80 0.83 15\n", "\n", " accuracy 0.92 59\n", " macro avg 0.91 0.91 0.91 59\n", "weighted avg 0.91 0.92 0.91 59\n", "\n" ] } ], "source": [ "print(\"PCA (2 components) + GNB \\n\", metrics.classification_report(y_test_wine, pred_test2))\n", "print(\"PCA (3 components) + GNB \\n\", metrics.classification_report(y_test_wine, pred_test3))" ] }, { "cell_type": "code", "execution_count": 86, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Standardization + PCA (2 components) + GNB \n", " precision recall f1-score support\n", "\n", " 0 1.00 0.95 0.97 20\n", " 1 0.96 1.00 0.98 24\n", " 2 1.00 1.00 1.00 15\n", "\n", " accuracy 0.98 59\n", " macro avg 0.99 0.98 0.98 59\n", "weighted avg 0.98 0.98 0.98 59\n", "\n" ] } ], "source": [ "# Fit to data and predict using a pipeline of scaling/standardization, PCA and GNB.\n", "std_clf = make_pipeline(StandardScaler(), PCA(n_components=2), GaussianNB())\n", "# try using MinMaxScaler and see what you get!\n", "std_clf.fit(X_train_wine, y_train_wine)\n", "pred_test_std = std_clf.predict(X_test_wine)\n", "print(\"Standardization + PCA (2 components) + GNB \\n\", metrics.classification_report(y_test_wine, pred_test_std))" ] }, { "cell_type": "code", "execution_count": 87, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "PC 1 without scaling:\n", " [ 1.77738063e-03 -8.13532033e-04 1.41672133e-04 -5.43218358e-03\n", " 1.96719103e-02 1.04099788e-03 1.54504422e-03 -1.15638097e-04\n", " 6.52261761e-04 2.26846944e-03 1.61421529e-04 7.81011292e-04\n", " 9.99784964e-01]\n", "\n", "PC 1 with scaling:\n", " [ 0.14733424 -0.25027499 -0.01252058 -0.23440896 0.15738948 0.39369045\n", " 0.41565632 -0.27414911 0.33265958 -0.10517746 0.29234204 0.38195327\n", " 0.28245765]\n" ] } ], "source": [ "# Extract PCA from pipeline\n", "pca = unscaled_clf.named_steps['pca']\n", "pca_std = std_clf.named_steps['pca']\n", "\n", "# Show first principal components\n", "print('\\nPC 1 without scaling:\\n', pca.components_[0])\n", "print('\\nPC 1 with scaling:\\n', pca_std.components_[0])\n", "\n", "# Use PCA without and with scale on X_train data for visualization.\n", "X_train_transformed = pca.transform(X_train_wine)\n", "scaler = std_clf.named_steps['standardscaler']\n", "X_train_std_transformed = pca_std.transform(scaler.transform(X_train_wine))" ] }, { "cell_type": "code", "execution_count": 88, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=FIG_SIZE)\n", "for l, c, m in zip(range(0, 3), ('blue', 'red', 'green'), ('^', 's', 'o')):\n", " ax1.scatter(X_train_transformed[y_train_wine == l, 0],\n", " X_train_transformed[y_train_wine == l, 1],\n", " color=c, label='class %s' % l, alpha=0.5, marker=m)\n", "for l, c, m in zip(range(0, 3), ('blue', 'red', 'green'), ('^', 's', 'o')):\n", " ax2.scatter(X_train_std_transformed[y_train_wine == l, 0],\n", " X_train_std_transformed[y_train_wine == l, 1],\n", " color=c, label='class %s' % l, alpha=0.5, marker=m) \n", "ax1.set_title('Training dataset after PCA')\n", "ax2.set_title('Standardized training dataset after PCA')\n", "for ax in (ax1, ax2):\n", " ax.set_xlabel('1st principal component')\n", " ax.set_ylabel('2nd principal component')\n", " ax.legend(loc='upper right')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Ungraded task 6.3\n", "\n", "Consider the Iris dataset from sklearn. \n", "Build and evaluate classifiers by following the various principles discussed in this module, e.g., taking into account class imbalance (if any), difference in scales of features, number of dimensions, choice of classifier algorithms and parameters there-in. \n", "\n", "Share your code along with a summary of the classification report (and any other measures and tools, e.g. plotting the results from grid search) achieved by the \"best\" model you can train. \n", "\n", "> To get started: from sklearn.datasets import load_iris " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
\n", " That's it folks!
\n", "\"Do\n", "
\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }