# Scrape & extract data, Store documents 
<br><br>
<div style="background-color:rgba(128, 128, 0, 0.1); text-align:left; vertical-align: middle; padding:20px 0;">
<p style="font-size:134%;color:Deep Teal;">SC 4125: Developing Data Products</p>
    <p style="font-size:100%;color:Olive;">Module-3: Data from crawls & APIs; NoSQL/MongoDB Data Store</p><br>

    
<br> 
by <a href="https://personal.ntu.edu.sg/anwitaman/" style="font-size:100%;color:Deep Teal;">Anwitaman DATTA</a><br>
School of Computer Science and Engineering, NTU Singapore.        
</div>

#### Teaching material
- <a href="M3-SoupMong.slides.html">.html</a> deck of slides
- <a href="M3-SoupMong.ipynb">.ipynb</a> Jupyter notebook

### Disclaimer/Caveat emptor

- Non-systematic and non-exhaustive review
- Example solutions are not necessarily the most efficient or elegant, let alone unique

### Positioning this module in the big picture

<img src="pics/Module-3-In-BigPic.png" alt="Big picture" width="500"/><br>

#### BeautifulSoup

<img src="pics/souplogo.png" alt="BeautifulSoup" width="300"/><br>

* BeautifulSoup (BS4) is a Python library for pulling data out of HTML and XML files.
* It is modular, and thus requires other libraries such as `requests' to fetch the data through http requests (or you can first fetch the data whichever way, and then process the locally stored data with BS4), and one can use different parsers, e.g., lxml or html5lib.
* Useful resources:
    - Doc: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
    - A nice online tutorial: https://www.youtube.com/watch?v=ng2o98k983k 
* Contrast with Scrapy: https://scrapy.org/ (Check out Scrapy on your own)
    - Scrapy is an application framework for writing "web spiders" to crawl and extract data from them. It has its in-built data extraction mechanism (<a href="https://docs.scrapy.org/en/latest/topics/selectors.html">selectors</a>), but you may also use BeautifulSoup instead for the extraction purpose, after having carried out the crawl and download.  


In [1]:
### BeautifulSoup: Install & import libraries as needed
#!pip install --upgrade beautifulsoup4
#! pip install lxml 
#! pip install html5lib # an alternate parser
#! pip install requests
from bs4 import BeautifulSoup
import requests
#print(bs4.__version__)

In [2]:
# Since the individual project will use DR-NTU academic profile as one of the data sources,
# let's see some examples from there. 
# As example, I am using the data from the profile of SCSE's chair as on August 2021.

soup_URL="https://dr.ntu.edu.sg/cris/rp/rp00084"
soup_source = requests.get(soup_URL).text
soup = BeautifulSoup(soup_source,'lxml')
print(soup)

<!DOCTYPE html>
<html>
<head>
<title>Prof Miao Chun Yan | Nanyang Technological University (DR-NTU)</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="DSpace CRIS-6.3.0-SNAPSHOT" name="Generator"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="Dr. Jaclyn, Chunyan Miao is a Full Professor in the School of Computer Engineering at Nanyang Technological University (NTU). Her research focus is on ..." name="description"/>
<link href="https://dr.ntu.edu.sg/rs/resourcesync.xml" rel="resourcesync sitemap" type="application/xml"/>
<link href="/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<link href="/static/css/jquery-ui-1.10.3.custom/redmond/jquery-ui-1.10.3.custom.css" rel="stylesheet" type="text/css"/>
<link href="/css/researcher.css" rel="stylesheet" type="text/css"/>
<link href="/css/jdyna.css" rel="stylesheet" type="text/css"/>
<link href="/sta

In [3]:
# prettify() helps print with indentation
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Prof Miao Chun Yan | Nanyang Technological University (DR-NTU)
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="DSpace CRIS-6.3.0-SNAPSHOT" name="Generator"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <meta content="Dr. Jaclyn, Chunyan Miao is a Full Professor in the School of Computer Engineering at Nanyang Technological University (NTU). Her research focus is on ..." name="description"/>
  <link href="https://dr.ntu.edu.sg/rs/resourcesync.xml" rel="resourcesync sitemap" type="application/xml"/>
  <link href="/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
  <link href="/static/css/jquery-ui-1.10.3.custom/redmond/jquery-ui-1.10.3.custom.css" rel="stylesheet" type="text/css"/>
  <link href="/css/researcher.css" rel="stylesheet" type="text/css"/>
  <link href="/css/jdyna.css" rel="stylesheet" type

In [4]:
# Extract the HTML page title from the soup object
match_title=soup.title
print(match_title)

<title>Prof Miao Chun Yan | Nanyang Technological University (DR-NTU)</title>


In [5]:
match_title_txt=soup.title.text
print(match_title_txt)

Prof Miao Chun Yan | Nanyang Technological University (DR-NTU)


In [6]:
print(soup.find('div').prettify())
# soup.div==soup.find('div') ## these are equivalent
# find() returns the first matching instance

<div class="container">
 <br/>
 <div class="navbar-header">
  <button class="navbar-toggle" data-target=".navbar-collapse" data-toggle="collapse" type="button">
   <span class="icon-bar">
   </span>
   <span class="icon-bar">
   </span>
   <span class="icon-bar">
   </span>
  </button>
 </div>
 <div class="col-sm-5 col-md-5 hidden-xs">
  <div class="col-sm-12" id="brandname">
   <a href="https://www.ntu.edu.sg" style="text-decoration: none">
    <img class="hidden-xs hidden-sm" src="/image/hires_logo_bw_school.jpg" style="max-width:181px;"/>
   </a>
  </div>
 </div>
 <div class="col-sm-7 col-md-7 hidden-xs">
  <div class="pull-right nowrap">
   <a class="home_name" href="/" target="_self">
    DR-NTU (Digital Repository of NTU)
   </a>
  </div>
 </div>
 <div class="hidden-md hidden-lg hidden-sm visible-xs">
  <div class="row">
   <div class="col-xs-12" id="brandname">
    <a href="https://www.ntu.edu.sg" style="text-decoration: none">
     <img class="center-block" src="/image/hires_lo

In [7]:
soup.find('div', class_="dynaFieldValue")
# class is a special keyword in python, so we need class_ to distinguish
# You can use .div.text to access only the text

<div class="dynaFieldValue" id="biographyDiv">
<div>Dr. Jaclyn, Chunyan Miao is a Full Professor in the School of Computer Engineering at Nanyang Technological University (NTU). Her research focus is on infusing intelligent agents into interactive new media (virtual, mixed, mobile and pervasive media) to create novel experiences and dimensions in game design, interactive narrative and other real world agent systems. She has done significant research work her research areas and published over 30 top quality international conference and journal papers. She believes that intelligence/agent augmentation will have a major impact on future new media systems.</div>
</div>

In [8]:
soup.find('div', class_="dynaFieldValue").text 
# Recall: you can use .strip() to clean-up the \n 

'\nDr. Jaclyn, Chunyan Miao is a Full Professor in the School of Computer Engineering at Nanyang Technological University (NTU). Her research focus is on infusing intelligent agents into interactive new media (virtual, mixed, mobile and pervasive media) to create novel experiences and dimensions in game design, interactive narrative and other real world agent systems. She has done significant research work her research areas and published over 30 top quality international conference and journal papers. She believes that intelligence/agent augmentation will have a major impact on future new media systems.\n'

In [9]:
# Let's find all the projects this faculty is involved in
soup.find('div', id="currentprojectsDiv")

<div class="dynaFieldValue" id="currentprojectsDiv">
<ul>
<li>ADL+: A Digital Toolkit For Cognitive Assessment And Intervention</li>
<br/>
<li>Alibaba-NTU Singapore Joint Research Institute</li>
<br/>
<li>An End-to-end Adaptive AI-Assisted 3H Care (A3C) System</li>
<br/>
<li>Joint NTU-Alibaba Research Institute</li>
<br/>
<li>Joint NTU-WeBank Research Centre on Fintech</li>
<br/>
<li>Joint SDU-NTU Centre for Artificial Intelligence Research(C-FAIR) - Smart Community Research and Talent Programme</li>
<br/>
<li>Monetary Academic Resources for Research</li>
<br/>
<li>Senior-friendly Persuasive AI Companions for an Aging Population</li>
<br/>
<li>The Joint NTU-WeBank Research Centre Of Eco-Intelligent Applications ("THEIA")</li>
<br/>
<li>TrustFUL: Trustworthy Federated Ubiquitous Learning</li>
<br/>
<li>TrustFUL: Trustworthy Federated Ubiquitous Learning (SCSE)</li>
</ul>
</div>

In [10]:
soup.find('div', id="currentprojectsDiv").find_all('li')

[<li>ADL+: A Digital Toolkit For Cognitive Assessment And Intervention</li>,
 <li>Alibaba-NTU Singapore Joint Research Institute</li>,
 <li>An End-to-end Adaptive AI-Assisted 3H Care (A3C) System</li>,
 <li>Joint NTU-Alibaba Research Institute</li>,
 <li>Joint NTU-WeBank Research Centre on Fintech</li>,
 <li>Joint SDU-NTU Centre for Artificial Intelligence Research(C-FAIR) - Smart Community Research and Talent Programme</li>,
 <li>Monetary Academic Resources for Research</li>,
 <li>Senior-friendly Persuasive AI Companions for an Aging Population</li>,
 <li>The Joint NTU-WeBank Research Centre Of Eco-Intelligent Applications ("THEIA")</li>,
 <li>TrustFUL: Trustworthy Federated Ubiquitous Learning</li>,
 <li>TrustFUL: Trustworthy Federated Ubiquitous Learning (SCSE)</li>]

In [11]:
# Chain it with a find_all over 'li' list-item tag, and extract the text of the result to create a list
[x.text for x in soup.find('div', id="currentprojectsDiv").find_all('li')]

['ADL+: A Digital Toolkit For Cognitive Assessment And Intervention',
 'Alibaba-NTU Singapore Joint Research Institute',
 'An End-to-end Adaptive AI-Assisted 3H Care (A3C) System',
 'Joint NTU-Alibaba Research Institute',
 'Joint NTU-WeBank Research Centre on Fintech',
 'Joint SDU-NTU Centre for Artificial Intelligence Research(C-FAIR) - Smart Community Research and Talent Programme',
 'Monetary Academic Resources for Research',
 'Senior-friendly Persuasive AI Companions for an Aging Population',
 'The Joint NTU-WeBank Research Centre Of Eco-Intelligent Applications ("THEIA")',
 'TrustFUL: Trustworthy Federated Ubiquitous Learning',
 'TrustFUL: Trustworthy Federated Ubiquitous Learning (SCSE)']

In [12]:
# What's the faculty member's email address?
soup.find('div', id="emailDiv").text.strip()

'ascymiao@ntu.edu.sg'

In [13]:
# What's the faculty member's personal webpage URL?
# Not everyone maintains their homepage, 
# and not all those who do host on canonical NTU personal page address
soup.find('div', id="personalsiteDiv")

<div class="dynaFieldValue" id="personalsiteDiv">
<a href="https://personal.ntu.edu.sg/ascymiao" target="_blank">
<span style="min-width: 40em;">Website</span>
</a>
</div>

In [14]:
soup.find('div', id="personalsiteDiv").a['href'] 
# if you are looking at a single instance
# access the attribute of a tag in a manner analogous to referring a Python dictionary element

'https://personal.ntu.edu.sg/ascymiao'

In [15]:
# How to extract the URLs for all the faculty members listed on a page
# Note that you may need to do some clean-up after this extraction
# Note also that, I am scoping the search, so this won't locate all the links on the page
# Particularly, in this specific example, we won't find the link to the next page

SCSE_list_url="https://dr.ntu.edu.sg/simple-search?filterquery=ou00030&filtername=school&filtertype=authority&location=researcherprofiles"
new_soup_source = requests.get(SCSE_list_url).text
new_soup = BeautifulSoup(new_soup_source,'lxml')
#print(new_soup)
#new_soup.find("div", class_="discovery-result-results").find_all('a')
[(x.get('href'),x.text) for x in new_soup.find("div", class_="discovery-result-results").find_all('a')]

[('#', 'Full Name'),
 ('/cris/rp/rp01023', 'Guan Cuntai'),
 ('/cris/rp/rp00345', 'Seah Hock Soon'),
 ('/cris/rp/rp00531', 'Lee Bu Sung'),
 ('/cris/rp/rp00707', 'Quek Hiok Chai'),
 ('/cris/rp/rp00700', 'Hui Siu Cheung'),
 ('/cris/rp/rp00693', 'Goh Wooi Boon'),
 ('/cris/rp/rp00691', 'Chan Syin'),
 ('/cris/rp/rp00670', 'Lau Chiew Tong'),
 ('/cris/rp/rp00643', 'Huang Shell Ying'),
 ('/cris/rp/rp00839', 'Vun Chan Hua, Nicholas'),
 ('/cris/rp/rp00841', 'Thambipillai Srikanthan'),
 ('/cris/rp/rp00799', 'Kwoh Chee Keong'),
 ('/cris/rp/rp00964', 'Wentong Cai'),
 ('/cris/rp/rp00963', 'Yeo Chai Kiat'),
 ('/cris/rp/rp00958', 'Lin Feng'),
 ('/cris/rp/rp00991', 'Chia Liang Tien'),
 ('/cris/rp/rp01094', 'Wee Keong NG'),
 ('/cris/rp/rp00169', 'Jagath Chandana Rajapakse'),
 ('/cris/rp/rp00552', 'Sun Aixin'),
 ('/cris/rp/rp00503', 'He Ying'),
 ('/cris/rp/rp00706', 'Anwitaman Datta'),
 ('/cris/rp/rp00683', 'Lin Weisi'),
 ('/cris/rp/rp00834', 'Wai Kin Adams Kong'),
 ('/cris/rp/rp00274', 'Alexei Sourin'),


#### RESTful APIs

Various ways to obtain/exchange data:
- EDI (electronic data interchange) is a somewhat generic term.
    * Encompasses both direct point-to-point communications, as well as through third-party managed data transmissions.
    * Can be implemented through a wide range of protocols, e.g., SFTP (Secure File Transfer Protocol), HTTPS, SOAP (Simple Object Access Protocol), etc. 
- Web Services using SOAP: standardized (XML based) format, but extensible, decoupled from the transport layer protocol and underlying programming model; amenable to distributed enterprise environments.
- <b>REST</b> (Representational State Transfer): simpler client-server architecture, stateless & cacheable, closely aligned with web technologies (using HTTP requests: POST, GET, PUT, DELETE, PATCH), supports multiple and flexible formats (XML, JSON, ...).
    * with/out authentication
- Streaming APIs 

An example (Reddit) with authenticated REST API:<br>
<img src="pics/reddit_merlion.webp" alt="BeautifulSoup" width="100"/><br>


In [16]:
#! pip install --upgrade praw
# Python Reddit API Wrapper(PRAW) 
# https://praw.readthedocs.io/en/stable/index.html 
# For a quick start, check "Working with PRAW's models" 

import praw
import numpy as np
import pandas as pd

In [17]:
# Uncomment and replace the dummy "XYZ"s with your own token information. 
# red_client_id="XYZ"
# red_client_secret="XYZ"
# red_user_agent="XYZ"
# red_username="XYZ" # Don't need these for just reading data
# red_password="XYZ" # Don't need these for just reading data

# Comment out the below 
fo = open("reddit-credentials.txt", "r")
str = fo.readlines()
fo.close()
str=[x.split()[2] for x in str]
red_client_id=str[0]
red_client_secret=str[1]
red_password=str[2]  # Don't need these for just reading data
red_user_agent=str[3]
red_username=str[4]  # Don't need these for just reading data

In [18]:
reddit = praw.Reddit(
    client_id=red_client_id,
    client_secret=red_client_secret,
    user_agent=red_user_agent,
#    username=red_username,
#    password=red_password,    
)
#print(reddit.user.me())

In [19]:
#dib_subreddit = reddit.subreddit('dataisbeautiful')
#hot_posts = dib_subreddit.hot(limit=30)
#for post in hot_posts:
#    print(post.title)

In [20]:
sg_subreddit = reddit.subreddit('singapore')
posts = []
for post in sg_subreddit.hot(limit=10):
    posts.append([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])
posts_df = pd.DataFrame(posts,columns=['title', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created'])
posts_df

Unnamed: 0,title,score,id,subreddit,url,num_comments,body,created
0,/r/singapore random discussion and small quest...,5,paxd13,singapore,https://www.reddit.com/r/singapore/comments/pa...,193,"Talk about your day. Anything goes, but subre...",1629842000.0
1,Yip Pin Xiu will defend her crown in the Tokyo...,174,pb16e0,singapore,https://www.channelnewsasia.com/sport/tokyo-pa...,5,,1629855000.0
2,Is This The New Normal For Employment Prospects?,791,pajh8r,singapore,https://i.redd.it/wbndejths9j71.png,202,,1629795000.0
3,lightnight strike in bb,2674,pae0xj,singapore,https://v.redd.it/u44597f8r7j71,256,,1629771000.0
4,"Taman Jurong hawker, 82, happy to share Chwee ...",160,par30f,singapore,https://mothership.sg/2021/08/chwee-kueh-taman...,6,,1629823000.0
5,Update: Footpath near Little Guilin got worse ...,21,pb2ljv,singapore,https://v.redd.it/rsagps7g6fj71,5,,1629861000.0
6,ComfortDelgro cabby tells passenger he need no...,30,pb074f,singapore,https://mothership.sg/2021/08/comfort-delgro-t...,20,,1629852000.0
7,"Weekly Vaccination Update - Total: 8,765,866 /...",27,pb0g4o,singapore,https://www.reddit.com/r/singapore/comments/pb...,15,**Singapore Vaccination Data (as of 24. Aug 20...,1629853000.0
8,Woman makes police report alleging safe distan...,20,pb22d8,singapore,https://www.straitstimes.com/singapore/woman-m...,11,,1629859000.0
9,How to mic test like a boss,1564,paebbh,singapore,https://v.redd.it/ttj65lggu7j71,85,,1629772000.0


In [21]:
# Let's identify the submission with the largest score
submission_id=posts_df[posts_df['score']==posts_df['score'].max()]['id'].iloc[0]
submission_id

'pae0xj'

In [22]:
submission = reddit.submission(id=submission_id)
for top_level_comment in submission.comments:
    print(top_level_comment.body)

[close up photo](https://imgur.com/a/dPC39Xs)
Ex building engineer here, 

The lightning certainly would have hit lightning protection copper tape or air terminal at the top of the building beside, and down conductors taking it to earth. Designs are based on what is known as the rolling sphere method (BCA requirement dictate a min of 45m radius), such that lightning will go to a favorable lightning protection air terminal in lieu of directly hitting ground, thus I’m sure the lightning didn’t manage to hit directly to the manhole cover, but hit the building.

One possibility (dont pofma me pls) is I guess lightning protection system might be be wrongly connected to sanitary pipes, which exploded gases in the inspection chamber. (the close up photos are metal covers for the sewer inspection chamber). Another possibility is there was a gas leak..
Alright I’mma WFH today. Can’t argue with god of thunder
But why the lightning hit the ground? There are so many tall buildings in the surroundi

In [23]:
comments_lst = []
submission.comments.replace_more(limit=None) 
# See more on the CommentForest structure in the documentations. 
# https://praw.readthedocs.io/en/stable/code_overview/other/commentforest.html 
# The below code using list() provides a BFS based flattening of the comments 
for comment in submission.comments.list():
   comments_lst.append([comment.id, comment.score,comment.author, comment.body, comment.parent_id])
comments_df=pd.DataFrame(comments_lst,columns=['id','score','author','comment','parent']).sort_values('score', ascending=False)
# parent_id: The ID of the parent comment (prefixed with t1_). 
# If it is a top-level comment, this returns the submission ID instead (prefixed with t3_).
comments_df

Unnamed: 0,id,score,author,comment,parent
1,ha4gfi9,345,butilikewaffles,"Ex building engineer here, \n\nThe lightning c...",t3_pae0xj
0,ha42wsj,301,spartacurse,[close up photo](https://imgur.com/a/dPC39Xs),t3_pae0xj
131,ha43mwd,265,NotSiaoOn,Now I know what 3 damage in magic the gatherin...,t1_ha42wsj
140,ha497uo,248,Cleftbutt,"It didnt, it hit the lightning protection and ...",t1_ha41b5a
3,ha41b5a,244,k34t0n,But why the lightning hit the ground? There ar...,t3_pae0xj
...,...,...,...,...,...
112,ha4465v,-4,Helmet_Politician,A bit confused why there would be an explosion...,t3_pae0xj
113,ha47wkg,-4,nikhoftime,WHY NOT HD,t3_pae0xj
164,ha4j8gc,-5,Dynosmite,Because it was made in blender by someone who ...,t1_ha45rse
155,ha4j4d3,-17,Dynosmite,Because it is,t1_ha41vs8


In the previous example
- We used a ready-made wrapper, instead of using the API endpoints directly.
- Given the nature of accesses granted, we also needed to use an authentication token.
    * Let's next look at another example, where we will use the API endpoints natively.
        + let's check some data from Data.gov.sg

In [24]:
sg_env_url="https://api.data.gov.sg/v1/environment/psi"
sg_env_dat = requests.get(sg_env_url).text
print(sg_env_dat)

{"region_metadata":[{"name":"west","label_location":{"latitude":1.35735,"longitude":103.7}},{"name":"national","label_location":{"latitude":0,"longitude":0}},{"name":"east","label_location":{"latitude":1.35735,"longitude":103.94}},{"name":"central","label_location":{"latitude":1.35735,"longitude":103.82}},{"name":"south","label_location":{"latitude":1.29587,"longitude":103.82}},{"name":"north","label_location":{"latitude":1.41803,"longitude":103.82}}],"items":[{"timestamp":"2021-08-25T11:00:00+08:00","update_timestamp":"2021-08-25T11:08:52+08:00","readings":{"o3_sub_index":{"west":3,"national":4,"east":4,"central":3,"south":3,"north":1},"pm10_twenty_four_hourly":{"west":22,"national":24,"east":13,"central":21,"south":20,"north":24},"pm10_sub_index":{"west":22,"national":24,"east":13,"central":21,"south":20,"north":24},"co_sub_index":{"west":6,"national":6,"east":5,"central":3,"south":5,"north":4},"pm25_twenty_four_hourly":{"west":10,"national":11,"east":4,"central":11,"south":8,"north"

In [25]:
import json
# below code for is_json is from 
# https://stackoverflow.com/questions/5508509/how-do-i-check-if-a-string-is-valid-json-in-python

def is_json(myjson):
  try:
    json_object = json.loads(myjson)
  except ValueError as e:
    return False
  return True

is_json(sg_env_dat)

True

In [26]:
import pprint
pprint.pprint(json.loads(sg_env_dat))

{'api_info': {'status': 'healthy'},
 'items': [{'readings': {'co_eight_hour_max': {'central': 0.34,
                                               'east': 0.47,
                                               'national': 0.59,
                                               'north': 0.35,
                                               'south': 0.53,
                                               'west': 0.59},
                         'co_sub_index': {'central': 3,
                                          'east': 5,
                                          'national': 6,
                                          'north': 4,
                                          'south': 5,
                                          'west': 6},
                         'no2_one_hour_max': {'central': 21,
                                              'east': 20,
                                              'national': 48,
                                              'north': 29,
                    

In [27]:
# We can grab historical data for specific date and/or time-window
sg_env_url_dated="https://api.data.gov.sg/v1/environment/psi?date=2020-02-20"
#sg_env_url_dated="https://api.data.gov.sg/v1/environment/psi?date_time=2020-02-20T03:20:15"
sg_env_dat_dated = requests.get(sg_env_url_dated).text
pprint.pprint(json.loads(sg_env_dat_dated))

{'api_info': {'status': 'healthy'},
 'items': [{'readings': {'co_eight_hour_max': {'central': 0.43,
                                               'east': 0.51,
                                               'national': 0.57,
                                               'north': 0.43,
                                               'south': 0.5,
                                               'west': 0.57},
                         'co_sub_index': {'central': 4,
                                          'east': 5,
                                          'national': 6,
                                          'north': 4,
                                          'south': 5,
                                          'west': 6},
                         'no2_one_hour_max': {'central': 7,
                                              'east': 8,
                                              'national': 11,
                                              'north': 4,
                        

                                                    'east': 51,
                                                    'national': 51,
                                                    'north': 40,
                                                    'south': 40,
                                                    'west': 35},
                         'so2_sub_index': {'central': 2,
                                           'east': 1,
                                           'national': 4,
                                           'north': 2,
                                           'south': 1,
                                           'west': 4},
                         'so2_twenty_four_hourly': {'central': 3,
                                                    'east': 2,
                                                    'national': 6,
                                                    'north': 4,
                                                    'south': 2,
               

In [28]:
# sg_airtemp_dated="https://api.data.gov.sg/v1/environment/air-temperature?date=2020-03-21"
# sg_airtemp_dat_dated = requests.get(sg_airtemp_dated).text
# pprint.pprint(json.loads(sg_airtemp_dat_dated))

#### What's this JSON (JavaScript Object Notation) format?
<br>
<img src="pics/JSON_Example_MongoDB.png" alt="BeautifulSoup" width="500"/><br>
<center>Image source: <a href="https://www.mongodb.com/blog/post/getting-started-with-python-and-mongodb">MongoDB Blog</a></center>


In [29]:
# JSON is everywhere, for example, you may encounter them when/if you try to grab data from DBLP
from urllib.request import urlopen
from pandas.io.json import json_normalize 
import urllib.request 

DBLP_URL = 'https://dblp.org/search/publ/api?q=andrew+ng&format=json&h=1000'
with urllib.request.urlopen(DBLP_URL) as url:
    dblp_data_instance = json.loads(url.read().decode())
pprint.pprint(dblp_data_instance)     

{'result': {'completions': {'@computed': '1',
                            '@sent': '1',
                            '@total': '1',
                            'c': {'@dc': '812',
                                  '@id': '27437870',
                                  '@oc': '1772',
                                  '@sc': '1772',
                                  'text': 'ng'}},
            'hits': {'@computed': '812',
                     '@first': '0',
                     '@sent': '812',
                     '@total': '812',
                     'hit': [{'@id': '195628',
                              '@score': '9',
                              'info': {'authors': {'author': [{'@pid': '263/3159',
                                                               'text': 'Akshay '
                                                                       'Smit'},
                                                              {'@pid': '257/3291',
                                                 

                              'info': {'authors': {'author': [{'@pid': '161/3744',
                                                               'text': 'Pranav '
                                                                       'Rajpurkar'},
                                                              {'@pid': '206/8409',
                                                               'text': 'Anirudh '
                                                                       'Joshi'},
                                                              {'@pid': '259/3231',
                                                               'text': 'Anuj '
                                                                       'Pareek'},
                                                              {'@pid': '259/3229',
                                                               'text': 'Phil '
                                                                       'Chen'},
                   

                                       'year': '2016'},
                              'url': 'URL#1895542'},
                             {'@id': '1937857',
                              '@score': '7',
                              'info': {'authors': {'author': [{'@pid': '155/3328',
                                                               'text': 'Dario '
                                                                       'Amodei'},
                                                              {'@pid': '80/10926',
                                                               'text': 'Sundaram '
                                                                       'Ananthanarayanan'},
                                                              {'@pid': '119/7727',
                                                               'text': 'Rishita '
                                                                       'Anubhai'},
                                               

                                                               'text': 'Juhan '
                                                                       'Nam'},
                                                              {'@pid': '58/2562',
                                                               'text': 'Honglak '
                                                                       'Lee'},
                                                              {'@pid': 'n/AndrewYNg',
                                                               'text': 'Andrew '
                                                                       'Y. '
                                                                       'Ng'}]},
                                       'ee': 'https://icml.cc/2011/papers/399_icmlpaper.pdf',
                                       'key': 'conf/icml/NgiamKKNLN11',
                                       'pages': '689-696',
                                       'title': '

                                                               'text': 'Adam '
                                                                       'Coates'},
                                                              {'@pid': 'n/AndrewYNg',
                                                               'text': 'Andrew '
                                                                       'Y. Ng'},
                                                              {'@pid': '83/5894',
                                                               'text': 'Yi Gu'},
                                                              {'@pid': '00/5982',
                                                               'text': 'Charles '
                                                                       'DuHadway'}]},
                                       'doi': '10.1145/1390156.1390218',
                                       'ee': 'https://doi.org/10.1145/1390156.1390218',
                 

                              '@score': '7',
                              'info': {'authors': {'author': [{'@pid': '95/2750',
                                                               'text': 'Shai '
                                                                       'Shalev-Shwartz'},
                                                              {'@pid': 's/YoramSinger',
                                                               'text': 'Yoram '
                                                                       'Singer'},
                                                              {'@pid': 'n/AndrewYNg',
                                                               'text': 'Andrew '
                                                                       'Y. '
                                                                       'Ng'}]},
                                       'doi': '10.1145/1015330.1015376',
                                       'ee': 'https://doi

                             {'@id': '749657',
                              '@score': '4',
                              'info': {'authors': {'author': [{'@pid': '213/6879',
                                                               'text': 'Floris '
                                                                       'P. '
                                                                       'Barthel'},
                                                              {'@pid': '184/4954',
                                                               'text': 'Kevin '
                                                                       'C. '
                                                                       'Johnson'},
                                                              {'@pid': '163/6674',
                                                               'text': 'Frederick '
                                                                       'S. '
                 

                                                               'text': 'Mariela '
                                                                       'Soto-Berelov'},
                                                              {'@pid': '97/4491',
                                                               'text': 'Andrew '
                                                                       'K. '
                                                                       'Skidmore'},
                                                              {'@pid': '217/4094',
                                                               'text': 'Trung '
                                                                       'H. '
                                                                       'Nguyen'}]},
                                       'doi': '10.1016/J.JAG.2019.102034',
                                       'ee': 'https://doi.org/10.1016/j.jag.2019.102034',
                 

                                                              {'@pid': '01/5245',
                                                               'text': 'Osama '
                                                                       'Khan'},
                                                              {'@pid': '159/7345',
                                                               'text': 'Sahar '
                                                                       'M. '
                                                                       'Mesri'},
                                                              {'@pid': '146/6243',
                                                               'text': 'Ioana '
                                                                       'Suciu'},
                                                              {'@pid': '246/0588',
                                                               'text': 'Lydia '
                              

                             {'@id': '2777842',
                              '@score': '3',
                              'info': {'authors': {'author': [{'@pid': '16/11513',
                                                               'text': 'Andrew '
                                                                       'Keong '
                                                                       'Ng'},
                                                              {'@pid': '08/1924',
                                                               'text': 'Kai '
                                                                       'Keng '
                                                                       'Ang'},
                                                              {'@pid': '94/581',
                                                               'text': 'Keng '
                                                                       'Peng '
                            

                                                              {'@pid': 't/AndrewTeohBengJin',
                                                               'text': 'Andrew '
                                                                       'Teoh '
                                                                       'Beng '
                                                                       'Jin'},
                                                              {'@pid': 'n/DavidChekLingNgo',
                                                               'text': 'David '
                                                                       'Ngo '
                                                                       'Chek '
                                                                       'Ling'}]},
                                       'doi': '10.1587/ELEX.2.70',
                                       'ee': 'https://doi.org/10.1587/elex.2.70',
                           

                                                               'text': 'Chris '
                                                                       'Yeung'},
                                                              {'@pid': '39/6118',
                                                               'text': 'Lih-Wee '
                                                                       'Chew'}]},
                                       'ee': 'http://www.aaai.org/Library/IAAI/1997/iaai97-188.php',
                                       'key': 'conf/aaai/NgGCKYC97',
                                       'pages': '913-918',
                                       'title': 'SunRay V - An Intelligent '
                                                'Container Trucking Operations '
                                                'Management and Control '
                                                'System.',
                                       'type': 'Conference and Workshop P

                             {'@id': '129689',
                              '@score': '2',
                              'info': {'authors': {'author': [{'@pid': '76/5973',
                                                               'text': 'Ling '
                                                                       'Zhang'},
                                                              {'@pid': '254/6674',
                                                               'text': 'Matthew '
                                                                       'Butrovich'},
                                                              {'@pid': '92/9835',
                                                               'text': 'Tianyu '
                                                                       'Li'},
                                                              {'@pid': '58/4127',
                                                               'text': 'Andrew '
            

                                       'volume': 'abs/2009.14825',
                                       'year': '2020'},
                              'url': 'URL#619261'},
                             {'@id': '620148',
                              '@score': '2',
                              'info': {'authors': {'author': [{'@pid': '41/2483',
                                                               'text': 'Bao '
                                                                       'Nguyen'},
                                                              {'@pid': '48/726',
                                                               'text': 'Adam '
                                                                       'Feldman'},
                                                              {'@pid': '275/8043',
                                                               'text': 'Sarath '
                                                                       'Bethapudi'},

                             {'@id': '1395759',
                              '@score': '2',
                              'info': {'authors': {'author': [{'@pid': '21/2107',
                                                               'text': 'Jonathan '
                                                                       'Tuke'},
                                                              {'@pid': '128/5665',
                                                               'text': 'Andrew '
                                                                       'Nguyen'},
                                                              {'@pid': '91/8376',
                                                               'text': 'Mehwish '
                                                                       'Nasim'},
                                                              {'@pid': '13/2439',
                                                               'text': 'Drew '
          

                              'info': {'authors': {'author': [{'@pid': '96/5223',
                                                               'text': 'Yun '
                                                                       'Zheng'},
                                                              {'@pid': '191/0737',
                                                               'text': 'Vandana '
                                                                       'Hivrale'},
                                                              {'@pid': '44/6434',
                                                               'text': 'Xiaotuo '
                                                                       'Zhang'},
                                                              {'@pid': '21/9078',
                                                               'text': 'Babu '
                                                                       'Valliyodan'},
                   

                                                              {'@pid': '70/10957',
                                                               'text': 'Andrew '
                                                                       'Gontarek'},
                                                              {'@pid': '165/1250',
                                                               'text': 'Aaron '
                                                                       'Vose'},
                                                              {'@pid': '48/8524',
                                                               'text': 'Robert '
                                                                       'Moench'},
                                                              {'@pid': '09/6203',
                                                               'text': 'David '
                                                                       'Abramson'},
                  

                                                                       'Musasizi'},
                                                              {'@pid': '138/1221',
                                                               'text': 'Catherine '
                                                                       'Nassimbwa'},
                                                              {'@pid': '86/7470',
                                                               'text': 'Sandy '
                                                                       'Stevens '
                                                                       'Tickodri-Togboa'},
                                                              {'@pid': '138/1210',
                                                               'text': 'Edward '
                                                                       'Kale '
                                                                       'Kayihura'},
  

                                                                       'Vu'}]},
                                       'doi': '10.1109/LCN.2011.6115165',
                                       'ee': 'https://doi.org/10.1109/LCN.2011.6115165',
                                       'key': 'conf/lcn/NguyenAV11',
                                       'pages': '109-116',
                                       'title': 'Service differentiation '
                                                'without prioritization in '
                                                'IEEE 802.11 WLANs.',
                                       'type': 'Conference and Workshop Papers',
                                       'url': 'https://dblp.org/rec/conf/lcn/NguyenAV11',
                                       'venue': 'LCN',
                                       'year': '2011'},
                              'url': 'URL#3360170'},
                             {'@id': '3372564',
                        

                                                               'text': 'Andrew '
                                                                       'S. '
                                                                       'Grimshaw'},
                                                              {'@pid': '06/6439',
                                                               'text': 'Anh '
                                                                       'Nguyen-Tuong'}]},
                                       'doi': '10.1109/HPDC.2000.868647',
                                       'ee': 'https://doi.org/10.1109/HPDC.2000.868647',
                                       'key': 'conf/hpdc/WhiteGN00',
                                       'pages': '165-174',
                                       'title': 'Grid-based File Access - The '
                                                'Legion I/O Model.',
                                       'type': 'Conference and Works

In [30]:
# Navigating to a sub-part of the document
pprint.pprint(dblp_data_instance['result']['hits']['hit']) 

[{'@id': '195628',
  '@score': '9',
  'info': {'authors': {'author': [{'@pid': '263/3159', 'text': 'Akshay Smit'},
                                  {'@pid': '257/3291', 'text': 'Damir Vrabac'},
                                  {'@pid': '140/9459', 'text': 'Yujie He'},
                                  {'@pid': 'n/AndrewYNg',
                                   'text': 'Andrew Y. Ng'},
                                  {'@pid': '154/3831',
                                   'text': 'Andrew L. Beam'},
                                  {'@pid': '161/3744',
                                   'text': 'Pranav Rajpurkar'}]},
           'ee': 'https://arxiv.org/abs/2103.14339',
           'key': 'journals/corr/abs-2103-14339',
           'title': 'MedSelect - Selective Labeling for Medical Image '
                    'Classification Combining Meta-Learning with Deep '
                    'Reinforcement Learning.',
           'type': 'Informal Publications',
           'url': 'https://dblp.org

                                   'text': 'Pranav Rajpurkar'},
                                  {'@pid': '177/9122', 'text': 'Anand Avati'},
                                  {'@pid': 'n/AndrewYNg',
                                   'text': 'Andrew Yan-Tak Ng'},
                                  {'@pid': '32/6623', 'text': 'Sanjay Basu'},
                                  {'@pid': 's/NHShah',
                                   'text': 'Nigam H. Shah'}]},
           'doi': '10.1016/J.JBI.2021.103826',
           'ee': 'https://doi.org/10.1016/j.jbi.2021.103826',
           'key': 'journals/jbi/KoCARANBS21',
           'pages': '103826',
           'title': 'Improving hospital readmission prediction using '
                    'individualized utility analysis.',
           'type': 'Journal Articles',
           'url': 'https://dblp.org/rec/journals/jbi/KoCARANBS21',
           'venue': 'J. Biomed. Informatics',
           'volume': '119',
           'year': '2021'},
  'url': 'URL#5442

                                  {'@pid': '234/7825',
                                   'text': 'Silviana Ciurea-Ilcus'},
                                  {'@pid': '234/7903', 'text': 'Chris Chute'},
                                  {'@pid': '234/7535',
                                   'text': 'Henrik Marklund'},
                                  {'@pid': '234/7782',
                                   'text': 'Behzad Haghgoo'},
                                  {'@pid': '19/11264', 'text': 'Robyn L. Ball'},
                                  {'@pid': '115/9205',
                                   'text': 'Katie S. Shpanskaya'},
                                  {'@pid': '234/8025', 'text': 'Jayne Seekins'},
                                  {'@pid': '191/4960', 'text': 'David A. Mong'},
                                  {'@pid': '181/0313',
                                   'text': 'Safwan S. Halabi'},
                                  {'@pid': '09/7425',
                        

                                  {'@pid': '115/9205',
                                   'text': 'Katie S. Shpanskaya'},
                                  {'@pid': '209/9732',
                                   'text': 'Matthew P. Lungren'},
                                  {'@pid': 'n/AndrewYNg',
                                   'text': 'Andrew Y. Ng'}]},
           'ee': 'http://arxiv.org/abs/1712.06957',
           'key': 'journals/corr/abs-1712-06957',
           'title': 'MURA Dataset - Towards Radiologist-Level Abnormality '
                    'Detection in Musculoskeletal Radiographs.',
           'type': 'Informal Publications',
           'url': 'https://dblp.org/rec/journals/corr/abs-1712-06957',
           'venue': 'CoRR',
           'volume': 'abs/1712.06957',
           'year': '2017'},
  'url': 'URL#1738696'},
 {'@id': '1852471',
  '@score': '7',
  'info': {'authors': {'author': [{'@pid': '25/6501',
                                   'text': 'Michiel Kallenberg'},
  

  'info': {'authors': {'author': [{'@pid': '60/5515', 'text': 'Adam Coates'},
                                  {'@pid': 'n/AndrewYNg',
                                   'text': 'Andrew Y. Ng'}]},
           'ee': 'https://proceedings.neurips.cc/paper/2011/hash/6c1da886822c67822bcf3679d04369fa-Abstract.html',
           'key': 'conf/nips/CoatesN11',
           'pages': '2528-2536',
           'title': 'Selecting Receptive Fields in Deep Networks.',
           'type': 'Conference and Workshop Papers',
           'url': 'https://dblp.org/rec/conf/nips/CoatesN11',
           'venue': 'NIPS',
           'year': '2011'},
  'url': 'URL#3368222'},
 {'@id': '3368336',
  '@score': '7',
  'info': {'authors': {'author': [{'@pid': '29/6166', 'text': 'Quoc V. Le'},
                                  {'@pid': '27/3000',
                                   'text': 'Alexandre Karpenko'},
                                  {'@pid': '72/8781', 'text': 'Jiquan Ngiam'},
                                  {'@

                                  {'@pid': 'n/AndrewYNg',
                                   'text': 'Andrew Y. Ng'}]},
           'ee': 'https://proceedings.neurips.cc/paper/2010/hash/01f78be6f7cad02658508fe4616098a9-Abstract.html',
           'key': 'conf/nips/LeNCCKN10',
           'pages': '1279-1287',
           'title': 'Tiled convolutional neural networks.',
           'type': 'Conference and Workshop Papers',
           'url': 'https://dblp.org/rec/conf/nips/LeNCCKN10',
           'venue': 'NIPS',
           'year': '2010'},
  'url': 'URL#3598244'},
 {'@id': '3631460',
  '@score': '7',
  'info': {'authors': {'author': [{'@pid': 'a/PieterAbbeel',
                                   'text': 'Pieter Abbeel'},
                                  {'@pid': 'n/AndrewYNg',
                                   'text': 'Andrew Y. Ng'}]},
           'doi': '10.1007/978-0-387-30164-8_417',
           'ee': 'https://doi.org/10.1007/978-0-387-30164-8_417',
           'key': 'reference/ml/AbbeelN1

 {'@id': '4343889',
  '@score': '7',
  'info': {'authors': {'author': [{'@pid': '16/5422',
                                   'text': 'Jenny Rose Finkel'},
                                  {'@pid': 'm/ChristopherDManning',
                                   'text': 'Christopher D. Manning'},
                                  {'@pid': 'n/AndrewYNg',
                                   'text': 'Andrew Y. Ng'}]},
           'ee': 'https://aclanthology.org/W06-1673/',
           'key': 'conf/emnlp/FinkelMN06',
           'pages': '618-626',
           'title': 'Solving the Problem of Cascading Errors - Approximate '
                    'Bayesian Inference for Linguistic Annotation Pipelines.',
           'type': 'Conference and Workshop Papers',
           'url': 'https://dblp.org/rec/conf/emnlp/FinkelMN06',
           'venue': 'EMNLP',
           'year': '2006'},
  'url': 'URL#4343889'},
 {'@id': '4366907',
  '@score': '7',
  'info': {'authors': {'author': [{'@pid': 'a/PieterAbbeel',
    

                                  {'@pid': 'e/DREngler',
                                   'text': 'Dawson R. Engler'}]},
           'ee': 'http://www.usenix.org/events/osdi06/tech/kremenek.html',
           'key': 'conf/osdi/KremenekTBNE06',
           'pages': '161-176',
           'title': 'From Uncertainty to Belief - Inferring the Specification '
                    'Within.',
           'type': 'Conference and Workshop Papers',
           'url': 'https://dblp.org/rec/conf/osdi/KremenekTBNE06',
           'venue': 'OSDI',
           'year': '2006'},
  'url': 'URL#4396757'},
 {'@id': '4407448',
  '@score': '7',
  'info': {'authors': {'author': [{'@pid': '25/559', 'text': 'Einat Minkov'},
                                  {'@pid': 'c/WWCohen',
                                   'text': 'William W. Cohen'},
                                  {'@pid': 'n/AndrewYNg',
                                   'text': 'Andrew Y. Ng'}]},
           'doi': '10.1145/1148170.1148179',
           'e

  'info': {'authors': {'author': [{'@pid': '58/6491', 'text': 'Sin Chun Ng'},
                                  {'@pid': '40/3667',
                                   'text': 'Chi-Chung Cheung'},
                                  {'@pid': '61/489',
                                   'text': 'Andrew Kwok-Fai Lui'},
                                  {'@pid': '117/3062',
                                   'text': 'Hau-Ting Tse'}]},
           'doi': '10.1007/978-3-642-31346-2_24',
           'ee': 'https://doi.org/10.1007/978-3-642-31346-2_24',
           'key': 'conf/isnn/NgCLT12',
           'pages': '206-216',
           'title': 'Addressing the Local Minima Problem by Output Monitoring '
                    'and Modification Algorithms.',
           'type': 'Conference and Workshop Papers',
           'url': 'https://dblp.org/rec/conf/isnn/NgCLT12',
           'venue': 'ISNN',
           'year': '2012'},
  'url': 'URL#3100198'},
 {'@id': '3100199',
  '@score': '5',
  'info': {'authors

                    'Data Sharing Framework.',
           'type': 'Conference and Workshop Papers',
           'url': 'https://dblp.org/rec/conf/hicss/ChowdhuryKWSN20',
           'venue': 'HICSS',
           'year': '2020'},
  'url': 'URL#465242'},
 {'@id': '481577',
  '@score': '3',
  'info': {'authors': {'author': [{'@pid': '116/7137',
                                   'text': 'A. S. M. Kayes'},
                                  {'@pid': '39/1032',
                                   'text': 'Mohammad Hammoudeh'},
                                  {'@pid': '122/3123',
                                   'text': 'Shahriar Badsha'},
                                  {'@pid': '85/10911',
                                   'text': 'Paul A. Watters'},
                                  {'@pid': '16/3523', 'text': 'Alex Ng'},
                                  {'@pid': '265/7792',
                                   'text': 'Fatma Mohammed'},
                                  {'@pid': '96/147

                                   'text': 'Arjun Nagendran'},
                                  {'@pid': '07/6819', 'text': 'Jason Liu'},
                                  {'@pid': '128/7080',
                                   'text': 'William Mayfield'},
                                  {'@pid': '128/7145', 'text': 'Mehdi Tayoubi'},
                                  {'@pid': '128/7057',
                                   'text': 'Richard Breitner'}]},
           'doi': '10.1002/ROB.21451',
           'ee': 'https://doi.org/10.1002/rob.21451',
           'key': 'journals/jfr/RichardsonWNHPRGHNLMTB13',
           'number': '3',
           'pages': '323-348',
           'title': 'The &quot;Djedi&quot; Robot Exploration of the Southern '
                    'Shaft of the Queen&apos;s Chamber in the Great Pyramid of '
                    'Giza, Egypt.',
           'type': 'Journal Articles',
           'url': 'https://dblp.org/rec/journals/jfr/RichardsonWNHPRGHNLMTB13',
           'venu

           'number': '11',
           'pages': '2245-2255',
           'title': 'Biohashing - two factor authentication featuring '
                    'fingerprint data and tokenised random number.',
           'type': 'Journal Articles',
           'url': 'https://dblp.org/rec/journals/pr/JinLG04',
           'venue': 'Pattern Recognit.',
           'volume': '37',
           'year': '2004'},
  'url': 'URL#4614607'},
 {'@id': '4628222',
  '@score': '3',
  'info': {'authors': {'author': [{'@pid': '65/4628', 'text': 'Han Foon Neo'},
                                  {'@pid': '42/2141', 'text': 'Ying-Han Pang'},
                                  {'@pid': 't/AndrewTeohBengJin',
                                   'text': 'Andrew Teoh Beng Jin'},
                                  {'@pid': 'n/DavidChekLingNgo',
                                   'text': 'David Ngo Chek Ling'}]},
           'doi': '10.1109/CGIV.2004.1323962',
           'ee': 'https://doi.org/10.1109/CGIV.2004.1323962',
    

                                  {'@pid': '299/4773',
                                   'text': 'Metin Yesiltepe'},
                                  {'@pid': '294/1437',
                                   'text': 'Naohiro Yonemoto'},
                                  {'@pid': '299/4698', 'text': 'Chuanhua Yu'},
                                  {'@pid': '299/5364',
                                   'text': 'Mikhail Sergeevich Zastrozhin'},
                                  {'@pid': '299/5124',
                                   'text': 'Anasthasia Zastrozhina'},
                                  {'@pid': '98/4146',
                                   'text': 'Zhi-Jiang Zhang'},
                                  {'@pid': '270/4099',
                                   'text': 'Christopher J. L. Murray'},
                                  {'@pid': '205/8414', 'text': 'Theo Vos'}]},
           'doi': '10.1186/S12911-021-01590-Y',
           'ee': 'https://doi.org/10.1186/s12911-021-0159

  'info': {'authors': {'author': [{'@pid': '208/9303',
                                   'text': 'Pedro P. Vergara'},
                                  {'@pid': '253/8225', 'text': 'Tam T. Mai'},
                                  {'@pid': '253/8135',
                                   'text': 'Andrew Burstein'},
                                  {'@pid': '26/4389',
                                   'text': 'Phuong H. Nguyen'}]},
           'doi': '10.1109/ISGTEUROPE.2019.8905499',
           'ee': 'https://doi.org/10.1109/ISGTEurope.2019.8905499',
           'key': 'conf/isgteurope/VergaraMBN19',
           'pages': '1-5',
           'title': 'Feasibility and Performance Assessment of Commercial PV '
                    'Inverters Operating with Droop Control for Providing '
                    'Voltage Support Services.',
           'type': 'Conference and Workshop Papers',
           'url': 'https://dblp.org/rec/conf/isgteurope/VergaraMBN19',
           'venue': 'ISGT Europe',
    

                                  {'@pid': '208/1492',
                                   'text': 'Vinh Q. Nguyen'},
                                  {'@pid': '208/1684', 'text': 'Ulvi Baspinar'},
                                  {'@pid': '76/6763', 'text': 'Michael White'},
                                  {'@pid': '42/1595', 'text': 'Frank C. Sup'}]},
           'doi': '10.1109/ICORR.2017.8009416',
           'ee': 'https://doi.org/10.1109/ICORR.2017.8009416',
           'key': 'conf/icorr/LaPreNBWS17',
           'pages': '1221-1226',
           'title': 'Capturing prosthetic socket fitment - Preliminary results '
                    'using an ultrasound-based device.',
           'type': 'Conference and Workshop Papers',
           'url': 'https://dblp.org/rec/conf/icorr/LaPreNBWS17',
           'venue': 'ICORR',
           'year': '2017'},
  'url': 'URL#1622988'},
 {'@id': '1625753',
  '@score': '2',
  'info': {'authors': {'author': [{'@pid': 'n/AnneHHNgu',
                    

           'title': 'Supporting Relative Debugging for Large-scale UPC '
                    'Programs.',
           'type': 'Conference and Workshop Papers',
           'url': 'https://dblp.org/rec/conf/iccS/DinhAJDMG14',
           'venue': 'ICCS',
           'year': '2014'},
  'url': 'URL#2521900'},
 {'@id': '2521918',
  '@score': '2',
  'info': {'authors': {'author': [{'@pid': '130/5927', 'text': 'Damian Goik'},
                                  {'@pid': '73/10809', 'text': 'Konrad Jopek'},
                                  {'@pid': '84/4953',
                                   'text': 'Maciej Paszynski'},
                                  {'@pid': '15/2621',
                                   'text': 'Andrew Lenharth'},
                                  {'@pid': '48/4701', 'text': 'Donald Nguyen'},
                                  {'@pid': '71/5735',
                                   'text': 'Keshav Pingali'}]},
           'doi': '10.1016/J.PROCS.2014.05.086',
           'ee': '

                                   'text': 'Andrew S. Brunker'},
                                  {'@pid': '30/4876',
                                   'text': 'Quang Vinh Nguyen'},
                                  {'@pid': '21/3309',
                                   'text': 'Anthony J. Maeder'},
                                  {'@pid': '36/8738', 'text': 'Rhys Tague'},
                                  {'@pid': '151/5666',
                                   'text': 'Gregory S. Kolt'},
                                  {'@pid': '151/5614',
                                   'text': 'Trevor N. Savage'},
                                  {'@pid': '151/5581',
                                   'text': 'Corneel Vandelanotte'},
                                  {'@pid': '151/5657',
                                   'text': 'Mitch J. Duncan'},
                                  {'@pid': '151/5535',
                                   'text': 'Cristina M. Caperchione'},
                

                                   'text': 'Yeshaiahu Fainman'},
                                  {'@pid': '23/5856',
                                   'text': 'Truong Q. Nguyen'}]},
           'doi': '10.1109/GLOCOM.2011.6133623',
           'ee': 'https://doi.org/10.1109/GLOCOM.2011.6133623',
           'key': 'conf/globecom/WangGSRFN11',
           'pages': '1-5',
           'title': 'Stochastic Model on the Post-Fabrication Error for a '
                    'Bragg Reflectors Based Photonic Allpass Filter.',
           'type': 'Conference and Workshop Papers',
           'url': 'https://dblp.org/rec/conf/globecom/WangGSRFN11',
           'venue': 'GLOBECOM',
           'year': '2011'},
  'url': 'URL#3310887'},
 {'@id': '3314163',
  '@score': '2',
  'info': {'authors': {'author': [{'@pid': '72/9188',
                                   'text': 'Raghavendra Rajkumar'},
                                  {'@pid': '58/5765-2',
                                   'text': 'Andrew Wang 0002

  'info': {'authors': {'author': [{'@pid': '38/234',
                                   'text': 'Brian S. Leibowitz'},
                                  {'@pid': '78/2056', 'text': 'Robert Palmer'},
                                  {'@pid': '86/5771', 'text': 'John Poulton'},
                                  {'@pid': '28/10712', 'text': 'Yohan Frans'},
                                  {'@pid': '96/1774', 'text': 'Simon Li'},
                                  {'@pid': '05/4137-2',
                                   'text': 'John M. Wilson 0002'},
                                  {'@pid': '06/10711',
                                   'text': 'Michael Bucher'},
                                  {'@pid': '05/10710',
                                   'text': 'Andrew M. Fuller'},
                                  {'@pid': '12/3316', 'text': 'John G. Eyles'},
                                  {'@pid': '87/687', 'text': 'Marko Aleksic'},
                                  {'@pid': '94/499

  'info': {'authors': {'author': [{'@pid': '93/5342',
                                   'text': 'Mysore Y. Jaisimha'},
                                  {'@pid': '36/6603',
                                   'text': 'Andrew G. Bruce'},
                                  {'@pid': '76/4829', 'text': 'Thien Nguyen'}]},
           'doi': '10.1117/12.234774',
           'ee': 'https://doi.org/10.1117/12.234774',
           'key': 'conf/spieSR/JaisimhaBN96',
           'pages': '350-361',
           'title': 'DocBrowse - A System for Information Retrieval from '
                    'Document Image Data.',
           'type': 'Conference and Workshop Papers',
           'url': 'https://dblp.org/rec/conf/spieSR/JaisimhaBN96',
           'venue': 'Storage and Retrieval for Image and Video Databases',
           'year': '1996'},
  'url': 'URL#5313594'},
 {'@id': '5313670',
  '@score': '2',
  'info': {'authors': {'author': [{'@pid': '06/6439',
                                   'text': 'Anh Nguyen

#### NoSQL: MongoDB document store

vs SQL:
- document store: no fixed schema, more flexible
    * new records with extra fields!
- encapsulates hierarchical information
    * no need for flattening/data normalization, etc.
- naturally amenable for distributed storage and processing

<img src="pics/InkedMongoDBAtlasFreeTier.jpg" alt="BeautifulSoup" width="500"/><br>
<center>MongoDB Atlas: Let's use their free tier cloud offering to try it out</center>

#### MongoDB storage model

- Projects
    - Databases
        - Collections
            * Data records: stored as BSON documents (BSON: Binary JSON, for storage efficiency)
- You can access and manipulate through GUI, scripts/shell, as well from programming environments
    * Here, we shall see how to do so with Python using pymongo 
- Other benefits (not focus of this course): Automatic fault-tolerant load-balanced elastic storage

Useful resources: https://university.mongodb.com/ 

In [31]:
#!pip3 install pymongo -U
#!pip3 install "pymongo[srv]" -U
#!pip3 install dnspython
# You would need to restart runtime after installations. 
# Useful resource: https://pymongo.readthedocs.io/en/stable/tutorial.html
import pymongo
import dns

In [32]:
### COMMENT THIS OUT
fo = open("mongodb-uri.txt", "r")
str = fo.readline()
fo.close()

### Put in appropriate conn_str uri string with your MongoDB deployment's connection string.
# Typically, it will be something like
# conn_str=mongodb+srv://USER:TOKEN@cluster0.5tnpm.mongodb.net/sample_airbnb?retryWrites=true&w=majority
conn_str = str
# set a 5-second connection timeout
client = pymongo.MongoClient(conn_str, serverSelectionTimeoutMS=5000)
print(client.server_info()) # just a sanity check



In [33]:
# I will use some default datasets provided by MongoDB 
# for illustrative examples, so that you can reproduce/rerun them
# More about this default dataset https://docs.atlas.mongodb.com/sample-data/available-sample-datasets/  

client.list_database_names()

['dblpMongo',
 'sample_airbnb',
 'sample_analytics',
 'sample_geospatial',
 'sample_mflix',
 'sample_restaurants',
 'sample_supplies',
 'sample_training',
 'sample_weatherdata',
 'admin',
 'local']

<br>
<img src="pics/MongoDBSampleDBs.png" alt="BeautifulSoup" width="500"/><br>
<center>MongoDB Atlas DB viewer</center>

In [34]:
db1 = client['sample_airbnb']
db2 = client['sample_analytics']

In [35]:
list(db1.list_collections())

[{'name': 'listingsAndReviews',
  'type': 'collection',
  'options': {},
  'info': {'readOnly': False,
   'uuid': UUID('3b44bd30-808a-4d19-b60a-44b523195391')},
  'idIndex': {'v': 2, 'key': {'_id': 1}, 'name': '_id_'}}]

In [36]:
pprint.pprint(list(db2.list_collections()))

[{'idIndex': {'key': {'_id': 1}, 'name': '_id_', 'v': 2},
  'info': {'readOnly': False,
           'uuid': UUID('4b510301-29cd-41d2-85b8-74cb147d26ab')},
  'name': 'accounts',
  'options': {},
  'type': 'collection'},
 {'idIndex': {'key': {'_id': 1}, 'name': '_id_', 'v': 2},
  'info': {'readOnly': False,
           'uuid': UUID('ea93b891-039c-40a2-90a6-e643c9b6ca2a')},
  'name': 'transactions',
  'options': {},
  'type': 'collection'},
 {'idIndex': {'key': {'_id': 1}, 'name': '_id_', 'v': 2},
  'info': {'readOnly': False,
           'uuid': UUID('efcfd80e-e94d-43c0-9689-794ce2c7ed16')},
  'name': 'customers',
  'options': {},
  'type': 'collection'}]


In [37]:
# Let's access one of these collections
listings=db1.listingsAndReviews
# Let's access one of the records in that collection
pprint.pprint(listings.find_one()) 
# find_one returns the first matching instance (first per the system's storage) 

{'_id': '10006546',
 'access': 'We are always available to help guests. The house is fully '
           'available to guests. We are always ready to assist guests. when '
           'possible we pick the guests at the airport.  This service transfer '
           'have a cost per person. We will also have service "meal at home" '
           'with a diverse menu and the taste of each. Enjoy the moment!',
 'accommodates': 8,
 'address': {'country': 'Portugal',
             'country_code': 'PT',
             'government_area': 'Cedofeita, Ildefonso, Sé, Miragaia, Nicolau, '
                                'Vitória',
             'location': {'coordinates': [-8.61308, 41.1413],
                          'is_location_exact': False,
                          'type': 'Point'},
             'market': 'Porto',
             'street': 'Porto, Porto, Portugal',
             'suburb': ''},
 'amenities': ['TV',
               'Cable TV',
               'Wifi',
               'Kitchen',
              

In [38]:
pprint.pprint(listings.find_one({'_id':"10009999"})) 
# I had checked manually using the GUI DB browser that there is a record with this id

{'_id': '10009999',
 'access': '',
 'accommodates': 4,
 'address': {'country': 'Brazil',
             'country_code': 'BR',
             'government_area': 'Jardim Botânico',
             'location': {'coordinates': [-43.23074991429229,
                                          -22.966253551739655],
                          'is_location_exact': True,
                          'type': 'Point'},
             'market': 'Rio De Janeiro',
             'street': 'Rio de Janeiro, Rio de Janeiro, Brazil',
             'suburb': 'Jardim Botânico'},
 'amenities': ['Wifi',
               'Wheelchair accessible',
               'Kitchen',
               'Free parking on premises',
               'Smoking allowed',
               'Hot tub',
               'Buzzer/wireless intercom',
               'Family/kid friendly',
               'Washer',
               'First aid kit',
               'Essentials',
               'Hangers',
               'Hair dryer',
               'Iron',
               '

In [39]:
# We can refine the search with conditions 
listings.find_one({"bedrooms": 3, 'minimum_nights': '2'}) 

{'_id': '10006546',
 'listing_url': 'https://www.airbnb.com/rooms/10006546',
 'name': 'Ribeira Charming Duplex',
 'summary': 'Fantastic duplex apartment with three bedrooms, located in the historic area of Porto, Ribeira (Cube) - UNESCO World Heritage Site. Centenary building fully rehabilitated, without losing their original character.',
 'space': 'Privileged views of the Douro River and Ribeira square, our apartment offers the perfect conditions to discover the history and the charm of Porto. Apartment comfortable, charming, romantic and cozy in the heart of Ribeira. Within walking distance of all the most emblematic places of the city of Porto. The apartment is fully equipped to host 8 people, with cooker, oven, washing machine, dishwasher, microwave, coffee machine (Nespresso) and kettle. The apartment is located in a very typical area of the city that allows to cross with the most picturesque population of the city, welcoming, genuine and happy people that fills the streets with h

In [40]:
# accessing sub-fields (fields of the sub-documents) using dotted namespace
listings.find_one({"bedrooms": 3, 'minimum_nights': '2', 'host.host_has_profile_pic': False}) 

{'_id': '4025482',
 'listing_url': 'https://www.airbnb.com/rooms/4025482',
 'name': '3 Bedrooms charmante sur le plateau',
 'summary': 'warm and bright apartment with terrace, wi-fi, fully equipped, in the heart of the plateau trendy Montreal, 10 min. from downtown center, metro and all services. This appartement is a place for quiet and respectful people (No party, no celebration, no visitors) Price depending on the number of bedrooms you will used, go to ; "Description + / Accès des voyageurs" to see all the option and tell me your choice. You have access to the bedroom reserved according to your number of traveler.',
 'space': 'bright space in a very peaceful and quiet area. This appartement is a place for quiet and respectful people. (Visitors are not allowed.)',
 'description': 'warm and bright apartment with terrace, wi-fi, fully equipped, in the heart of the plateau trendy Montreal, 10 min. from downtown center, metro and all services. This appartement is a place for quiet and r

In [41]:
# What if I do not want the whole record, but only parts of it?
# Use *** projections ***
# https://docs.mongodb.com/manual/tutorial/project-fields-from-query-results/

listings.find_one({"bedrooms": 3}, {'number_of_reviews':1,'review_scores':1}) 

{'_id': '10006546',
 'number_of_reviews': 51,
 'review_scores': {'review_scores_accuracy': 9,
  'review_scores_cleanliness': 9,
  'review_scores_checkin': 10,
  'review_scores_communication': 10,
  'review_scores_location': 10,
  'review_scores_value': 9,
  'review_scores_rating': 89}}

In [42]:
# If we want to find all the entries matching certain conditions!
# Use find: Since there are too many results, I only show the size instead
len(list(listings.find({"bedrooms": 3}, {'number_of_reviews':1,'review_scores':1})))

427

In [43]:
# If you actually just want to know the size, you can (still) do so using a native count() method
listings.find({"bedrooms": 4}).count()

  listings.find({"bedrooms": 4}).count()


161

In [44]:
# The new recommended way to do so
listings.count_documents({"bedrooms": 4})

161

In [45]:
# What if you need other query filters than just equality?  
listings.count_documents({"bedrooms": {"$gte": 4}})

229

In [46]:
# If you want full-text search, you will need to index it first (only once, if your data itself is static)
# Here, we are indexing the description text.
try:
    listings.create_index([('description', 'text')])
    print('Indexed')
except:  
    print('Sometimes issues WriteConcernError')
# Very like replica synch issues. In any case, the index is already created.    

Indexed


In [47]:
listings.count_documents( { "$text": { "$search": "Botanical Garden" } } )

478

In [48]:
#botanicgarden_listings=listings.find( { "$text": { "$search": "Botanical Garden" } } ,{'number_of_reviews':1,'review_scores':1, 'description':1 }  )

In [49]:
botanicgarden_listings=listings.find( { "$text": { "$search": "Botanical Garden" } } ,{'number_of_reviews':1,'review_scores':1, 'reviews':1, 'description':1 } )

In [50]:
pprint.pprint(botanicgarden_listings[470])

{'_id': '846854',
 'description': 'Large duplex on prime Park block in West 80ies. Bedrooms '
                'sleep 4 adults and 1 or 2 toddlers plus a couch for 2 adults. '
                'Lower level with office and TV room. Easy subway/bus access '
                'and Citi bikes. Ask us for chromebook or baby supplies. Large '
                'duplex on prime block on UWS in 80ies between CPW and '
                'Columbus. 2 bedrooms sleep 4 adults and 2 toddlers. There is '
                'a sleeping couch for 2 more adults. Ask us for chromebook or '
                'baby supplies. We have it.  Upper West Side garden level '
                'apartment in brownstone building on 80ies between Central '
                'Park West and Columbus Ave. Walk in less than 5 minutes to '
                'Central Park. Great to discover the Upper West Side (3 blocks '
                'from Natural History Museum), Central park (close to the '
                'jogging track at the Jacky 

In [51]:
botanicgarden_listings[470]['reviews']

[{'_id': '4462736',
  'date': datetime.datetime(2013, 5, 6, 4, 0),
  'listing_id': '846854',
  'reviewer_id': '6108625',
  'reviewer_name': 'Sara And Louis',
  'comments': 'We stayed there for one week with our 1 1/2 old daughter. Lovely little place with a garden - perfect for her to run around and play. It was such a nice and quiet street - you could even hear birds chirping in the morning. Amazing location - one block from the park, and one block from the nicest part of the upper westside.\r\nWe would definitely stay there again.'},
 {'_id': '4843213',
  'date': datetime.datetime(2013, 5, 28, 4, 0),
  'listing_id': '846854',
  'reviewer_id': '5646285',
  'reviewer_name': 'Kat',
  'comments': "We loved staying here, it's perfect or a family and a beautiful, convenient location. Everything you could possibly need at your doorstep.\r\n\r\nDirk was a great host letting us in and showing us around when we arrived, he even helped us with our bags. Debra-Jo who assists with the property wh

In [52]:
#botanicgarden_listings=listings.find( { "$text": { "$search": "Botanical Garden" }, "bedrooms": {"$gte": 4} } ,{'number_of_reviews':1,'review_scores':1, 'reviews':1, 'description':1 } )
#list(botanicgarden_listings)

In [53]:
botanic_listings_df=pd.DataFrame(list(botanicgarden_listings)) 
botanic_listings_df

Unnamed: 0,_id,description,number_of_reviews,review_scores,reviews
0,6171211,Large 1br in a 3br. available. Apartment is lo...,0,{},[]
1,15100883,"Fordham university is a walking distance away,...",1,"{'review_scores_accuracy': 6, 'review_scores_c...","[{'_id': '327815723', 'date': 2018-09-24 04:00..."
2,6541214,Cute and huge one bedroom apartment only 5 min...,1,"{'review_scores_accuracy': 10, 'review_scores_...","[{'_id': '43940821', 'date': 2015-08-23 04:00:..."
3,16976284,My accommodation is within a 10 minute walk fr...,87,"{'review_scores_accuracy': 10, 'review_scores_...","[{'_id': '137800254', 'date': 2017-03-17 04:00..."
4,10009999,One bedroom + sofa-bed in quiet and bucolic ne...,0,{},[]
...,...,...,...,...,...
473,13748617,"My place is close to jean coutu, Metro super m...",0,{},[]
474,20744948,"We're close to JFK airport,3 groceries (Walmar...",33,"{'review_scores_accuracy': 8, 'review_scores_c...","[{'_id': '191218025', 'date': 2017-09-06 04:00..."
475,15116002,"Newly renovated, open and airy, 900 sq foot, f...",71,"{'review_scores_accuracy': 10, 'review_scores_...","[{'_id': '106902696', 'date': 2016-10-08 04:00..."
476,5650901,Suite avaibable for the time you need Great lo...,56,"{'review_scores_accuracy': 10, 'review_scores_...","[{'_id': '31045600', 'date': 2015-05-01 04:00:..."


#### Aggregation 
- process data records and return computed results
    - aggregation pipeline
        * example follows
    - map-reduce
    - single-purpose aggregations 
        * we already saw an example for this, namely counting  

Check more at
* https://docs.mongodb.com/manual/aggregation/ 
* https://pymongo.readthedocs.io/en/stable/examples/aggregation.html

In [54]:
# Find average price depending on number of bedrooms in the AirBNB sample dataset
# sorted in increasing order of avgPrice
# See https://www.youtube.com/watch?v=0MZFTiKIPnU to see how to build this aggregation pipeline
# using the GUI on MongoDB Atlas, and export the Python code

agg_pipeline=[{'$match': {'address.country_code': 'US'}},
 {'$group': {'_id': '$bedrooms', 'avgPrice': {'$avg': '$price'}}},
 {'$sort': {'avgPrice': 1}}] 

pprint.pprint(list(listings.aggregate(agg_pipeline)))

[{'_id': None, 'avgPrice': Decimal128('79.00')},
 {'_id': 1, 'avgPrice': Decimal128('127.1636863823933975240715268225585')},
 {'_id': 0, 'avgPrice': Decimal128('139.2867647058823529411764705882353')},
 {'_id': 2, 'avgPrice': Decimal128('258.3797468354430379746835443037975')},
 {'_id': 3, 'avgPrice': Decimal128('360.8481012658227848101265822784810')},
 {'_id': 4, 'avgPrice': Decimal128('513.9166666666666666666666666666667')},
 {'_id': 6, 'avgPrice': Decimal128('668.6666666666666666666666666666667')},
 {'_id': 5, 'avgPrice': Decimal128('1763.333333333333333333333333333333')}]


#### CRUD operations: create, read, update, and delete
- Let's look at creation of a new DB/collection, and data insertion

In [55]:
dblpMongo = client["dblpMongo"]
print(client.list_database_names())
# The DB is not created until it is populated

['dblpMongo', 'sample_airbnb', 'sample_analytics', 'sample_geospatial', 'sample_mflix', 'sample_restaurants', 'sample_supplies', 'sample_training', 'sample_weatherdata', 'admin', 'local']


In [56]:
dblpCollect = dblpMongo["Researchers"]
print(client.list_database_names())
# The collection is not created until it is populated

['dblpMongo', 'sample_airbnb', 'sample_analytics', 'sample_geospatial', 'sample_mflix', 'sample_restaurants', 'sample_supplies', 'sample_training', 'sample_weatherdata', 'admin', 'local']


In [57]:
from bson import ObjectId
#myquery = {"_id": ObjectId('61259526792654840aee635a')} 
#list(dblpCollect.find(myquery))

In [58]:
### Note that I have hard-coded the ObjectId here. 
### You will need to change this to reflect the correct id.
try:  
  myquery = {"_id": ObjectId('61259526792654840aee635a')} # change id
  #pprint.pprint(dblpCollect.find_one(myquery))  
  dblpCollect.Researchers.delete_one(myquery)
  #print('Deleted')
except:
  print("An exception occurred")
# Not fully checked: It is possible that deletions take some time to take effect.

In [59]:
x = dblpCollect.insert_one(dblp_data_instance)
print(x.inserted_id) 
# Check out Bulk.insert() https://docs.mongodb.com/manual/reference/method/Bulk.insert/
print(client.list_database_names())
pprint.pprint(list(dblpMongo.list_collections()))
# Et voila! The DB and the collection are also created now!!

6125b905bd2b079502748003
['dblpMongo', 'sample_airbnb', 'sample_analytics', 'sample_geospatial', 'sample_mflix', 'sample_restaurants', 'sample_supplies', 'sample_training', 'sample_weatherdata', 'admin', 'local']
[{'idIndex': {'key': {'_id': 1}, 'name': '_id_', 'v': 2},
  'info': {'readOnly': False,
           'uuid': UUID('ddd84d74-3f75-4321-8a65-e4ff0416b764')},
  'name': 'Researchers',
  'options': {},
  'type': 'collection'}]


<img src="pics/InsertedDataInstance.png" alt="Inserted data instance" width="500"/>
<br> Note: I have deleted/inserted the same record multiple times. The id shown in the screenshot may not match the id created the last time this notebook has been executed.

#### Wrap up

- This module:
    * Data acquisition from various sources
        + Scraping webpages and extracting data (BeautifulSoup)
        + Using API wrappers and endpoints
    * JSON format
    * NoSQL datastore (MongoDB)   

#### Wrap up


<table><tr>
<td> <img src="pics/3ModulesSummary.png" alt="Drawing" style="width: 500px;"/> </td>
<td> So far in this course<br>
<ul>
    <li>Data Products & Pipeline: The big picture</li>
    <li>Module 1: Traditional data storage <ul><li>RDBMS (SQLite)</li></ul></li>
    <li>Module 2: Basic data manipulation & cleaning <ul><li>Pandas</li><li>RegEX</li></ul></li>
    <li>Module 3: Data acquisition & NoSQL <ul><li>BeautifulSoup</li><li>APIs</li><li>MongoDB</li></ul></li>
</ul>
</tr></table>

<p style="font-size:134%;color:Deep Teal;">That's it folks!</p>

<img src="pics/NoSQLBoom.jpg" alt="NoSQL!" width="300"/>

In [60]:
#!jupyter nbconvert M3-Examples.ipynb --to slides --post serve