{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "###### Data Privacy Enhancing Techniques\n", "

\n", "
\n", "

SC 4125: Developing Data Products

\n", "

Module-8: Data Governance/Privacy Issues


\n", "\n", " \n", "
\n", "by Anwitaman DATTA
\n", "School of Computer Science and Engineering, NTU Singapore. \n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Teaching material\n", "- .pdf deck of slides (complements the html slides)\n", "- .html deck of slides\n", "- .ipynb Jupyter notebook" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Disclaimer/Caveat emptor\n", "\n", "- Non-systematic and non-exhaustive review\n", "- Illustrative approaches are not necessarily the most efficient or elegant, let alone unique\n", "- This Jupyter notebook is accompanied by a deck of slides discussing Data Governance in general, as well as specific privacy enhancing techniques." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Acknowledgement\n", "\n", "> This module is adapted from workshop training material created by Andreas Dewes and Katherine Jarmul from Kiprotect, which they have made available under MIT License. Here's an accompanying talk on Data Science Meets Data Protection. \n", ">\n", ">If anyone reuses the material in the current format as readapted and provided here, please still also attribute the original creators of the content.\n", ">\n", "> If there are any attribution omissions to be rectified, or should anything in the material need to be changed or redacted, the copyright owners are requested to contact me at anwitaman@ntu.edu.sg " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# k-Anonymity" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "External data (voter registration information) can be used to deanonymyze | 2-Anonymyzed patient data\n", ":-------------------------:|:-------------------------:\n", "\"bigpic\" | \"Sampling\"\n", "Example from Mondrian Multidimensional K-Anonymity by LeFevre et al" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# this is a list of the column names in our dataset (the file doesn't contain headers)\n", "names = (\n", " 'age',\n", " 'workclass', #Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.\n", " 'fnlwgt', # final weight. In other words, this is the number of people the census believes the entry represents.\n", " 'education',\n", " 'education-num',\n", " 'marital-status',\n", " 'occupation',\n", " 'relationship',\n", " 'race',\n", " 'sex',\n", " 'capital-gain',\n", " 'capital-loss',\n", " 'hours-per-week',\n", " 'native-country',\n", " 'income',\n", ")" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# some fields are categorical and thus they need to be treated accordingly\n", "# note that integers are used to represent some of the categorical data\n", "categorical = set((\n", " 'workclass',\n", " 'education',\n", " 'marital-status',\n", " 'occupation',\n", " 'relationship',\n", " 'sex',\n", " 'native-country',\n", " 'race',\n", " 'income',\n", "))" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(48842, 15)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countryincome
3851537Private285637HS-grad9Never-marriedTransport-movingNot-in-familyBlackMale0050United-States<=50k
1974457Private29375HS-grad9SeparatedSalesNot-in-familyAmer-Indian-EskimoFemale0035United-States<=50k
958419Private201743Some-college10Never-marriedOther-serviceOwn-childWhiteFemale0026United-States<=50k
1827323State-gov35633Some-college10Never-marriedOther-serviceNot-in-familyWhiteMale0050United-States<=50k
1483832Self-emp-inc161153HS-grad9Married-civ-spouseFarming-fishingHusbandWhiteMale0190255United-States>50k
1658544Private192381Bachelors13Married-civ-spouseProf-specialtyHusbandWhiteMale0184840United-States>50k
2257043Self-emp-inc602513Masters14Married-civ-spouseExec-managerialHusbandWhiteMale0050United-States>50k
2529769Federal-gov14384911th7WidowedAdm-clericalNot-in-familyWhiteFemale0020United-States<=50k
3994235Private351772Some-college10Married-civ-spouseSalesHusbandWhiteMale0060United-States>50k
443121Private174503HS-grad9Never-marriedAdm-clericalNot-in-familyWhiteFemale0030United-States<=50k
\n", "
" ], "text/plain": [ " age workclass fnlwgt education education-num \\\n", "38515 37 Private 285637 HS-grad 9 \n", "19744 57 Private 29375 HS-grad 9 \n", "9584 19 Private 201743 Some-college 10 \n", "18273 23 State-gov 35633 Some-college 10 \n", "14838 32 Self-emp-inc 161153 HS-grad 9 \n", "16585 44 Private 192381 Bachelors 13 \n", "22570 43 Self-emp-inc 602513 Masters 14 \n", "25297 69 Federal-gov 143849 11th 7 \n", "39942 35 Private 351772 Some-college 10 \n", "4431 21 Private 174503 HS-grad 9 \n", "\n", " marital-status occupation relationship \\\n", "38515 Never-married Transport-moving Not-in-family \n", "19744 Separated Sales Not-in-family \n", "9584 Never-married Other-service Own-child \n", "18273 Never-married Other-service Not-in-family \n", "14838 Married-civ-spouse Farming-fishing Husband \n", "16585 Married-civ-spouse Prof-specialty Husband \n", "22570 Married-civ-spouse Exec-managerial Husband \n", "25297 Widowed Adm-clerical Not-in-family \n", "39942 Married-civ-spouse Sales Husband \n", "4431 Never-married Adm-clerical Not-in-family \n", "\n", " race sex capital-gain capital-loss hours-per-week \\\n", "38515 Black Male 0 0 50 \n", "19744 Amer-Indian-Eskimo Female 0 0 35 \n", "9584 White Female 0 0 26 \n", "18273 White Male 0 0 50 \n", "14838 White Male 0 1902 55 \n", "16585 White Male 0 1848 40 \n", "22570 White Male 0 0 50 \n", "25297 White Female 0 0 20 \n", "39942 White Male 0 0 60 \n", "4431 White Female 0 0 30 \n", "\n", " native-country income \n", "38515 United-States <=50k \n", "19744 United-States <=50k \n", "9584 United-States <=50k \n", "18273 United-States <=50k \n", "14838 United-States >50k \n", "16585 United-States >50k \n", "22570 United-States >50k \n", "25297 United-States <=50k \n", "39942 United-States >50k \n", "4431 United-States <=50k " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "datapath ='data/kiprotectdata/' # change this to adjust relative path\n", "df = pd.read_csv(datapath+\"adult.all.txt\", sep=\", \", header=None, names=names, index_col=False, engine='python')\n", "for name in categorical:\n", " df[name] = df[name].astype('category')\n", "print(df.shape)\n", "df.sample(10)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Greedy heuristic" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Helper function: spans of columns of a dataframe\n", "\n", "> We first need a function that returns the spans (max-min for numerical columns, number of different values for categorical columns) of all columns for a partition of a dataframe." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def get_spans(df, partition, scale=None):\n", " \"\"\"\n", " :param df: the dataframe for which to calculate the spans\n", " :param partition: the partition for which to calculate the spans\n", " :param scale: if given, the spans of each column will be divided\n", " by the value in `scale` for that column\n", " : returns: The spans of all columns in the partition\n", " \"\"\" \n", " spans = {}\n", " for column in df.columns:\n", " if column in categorical:\n", " span = len(df[column][partition].unique())\n", " else:\n", " span = df[column][partition].max()-df[column][partition].min()\n", " if scale is not None:\n", " span = span/scale[column]\n", " spans[column] = span\n", " return spans" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "{'age': 73,\n", " 'workclass': 9,\n", " 'fnlwgt': 1478115,\n", " 'education': 16,\n", " 'education-num': 15,\n", " 'marital-status': 7,\n", " 'occupation': 15,\n", " 'relationship': 6,\n", " 'race': 5,\n", " 'sex': 2,\n", " 'capital-gain': 99999,\n", " 'capital-loss': 4356,\n", " 'hours-per-week': 98,\n", " 'native-country': 42,\n", " 'income': 2}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "full_spans = get_spans(df, df.index)\n", "full_spans" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Helper function: split a dataframe in two partitions based on a `column`\n", "\n", "> partition the dataframe, returning two partitions such that\n", "> - all rows with values of a chosen/indicated column `column` below the median are in one partition \n", "> - all rows with values above or equal to the median are in the other\n", "> - for categorical data, divide them into two disjoint sets of (roughly) equal number of categories" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def split(df, partition, column):\n", " \"\"\"\n", " :param df: The dataframe to split\n", " :param partition: The partition to split\n", " :param column: The column along which to split\n", " : returns: A tuple containing a split of the original partition\n", " \"\"\"\n", " dfp = df[column][partition]\n", " if column in categorical:\n", " values = dfp.unique()\n", " lv = set(values[:len(values)//2])\n", " rv = set(values[len(values)//2:])\n", " return dfp.index[dfp.isin(lv)], dfp.index[dfp.isin(rv)]\n", " else: \n", " median = dfp.median()\n", " dfl = dfp.index[dfp < median]\n", " dfr = dfp.index[dfp >= median]\n", " return (dfl, dfr)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Helper function: Determines if a specific partition has at least k entrees" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def is_k_anonymous(df, partition, sensitive_column, k=5):\n", " \"\"\"\n", " :param df: The dataframe on which to check the partition.\n", " :param partition: The partition of the dataframe to check.\n", " :param sensitive_column: The name of the sensitive column\n", " :param k: The desired k\n", " :returns : True if the partition is valid according to our k-anonymity criteria, False otherwise.\n", " \"\"\"\n", " if len(partition) < k:\n", " return False\n", " return True" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Greedy partitioning" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def partition_dataset(df, feature_columns, sensitive_column, scale, is_valid):\n", " \"\"\"\n", " :param df: The dataframe to be partitioned.\n", " :param feature_columns: A list of column names along which to partition the dataset.\n", " :param sensitive_column: The name of the sensitive column (to be passed on to the `is_valid` function)\n", " :param scale: The column spans as generated before.\n", " :param is_valid: A function that takes a dataframe and a partition and returns True if the partition is valid.\n", " :returns : A list of valid partitions that cover the entire dataframe.\n", " \"\"\"\n", " finished_partitions = []\n", " partitions = [df.index]\n", " while partitions:\n", " partition = partitions.pop(0)\n", " spans = get_spans(df[feature_columns], partition, scale)\n", " for column, span in sorted(spans.items(), key=lambda x:-x[1]):\n", " #we try to split this partition along a given column\n", " lp, rp = split(df, partition, column)\n", " if not is_valid(df, lp, sensitive_column) or not is_valid(df, rp, sensitive_column):\n", " continue\n", " # the split is valid, we put the new partitions on the list and continue\n", " partitions.extend((lp, rp))\n", " break\n", " else:\n", " # no split was possible, we add the partition to the finished partitions\n", " finished_partitions.append(partition)\n", " return finished_partitions" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "460\n" ] } ], "source": [ "# we apply the partitioning method to two columns of the dataset, using \"income\" as the sensitive attribute\n", "feature_columns = ['age', 'education-num']\n", "sensitive_column = 'income'\n", "finished_partitions = partition_dataset(df, feature_columns, sensitive_column, full_spans, is_k_anonymous)\n", "print(len(finished_partitions))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Visualizing the partitioning achieved" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAEGCAYAAACKB4k+AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAAVYUlEQVR4nO3df7RlZX3f8ffHAcSYOIkwpjqAF4RgUaPArZUaU0VqQBwwBgIoLkALgaoYVwJrsDYmtlmSHyUWIqREEKIERNTCODSYpRiaSihgpII4hiLqCBEwdWIREci3f5x9N4fr/XHmzt3nzLnn/VrrrnP2s8/Z57v3zL2fs389T6oKSZIAnjLqAiRJ2w9DQZLUMhQkSS1DQZLUMhQkSa0dRl3Atth1111rampq1GVI0li59dZbH6yqNXPNG+tQmJqa4pZbbhl1GZI0VpJ8Y755Hj6SJLW2q1BI8vokf5rk6iSvGXU9kjRpOg+FJBcnuT/J7bPaD02yKcldSdYDVNV/q6qTgROBY7quTZL0ZMPYU7gEOLS/Ickq4IPAYcB+wHFJ9ut7yXua+ZKkIeo8FKrqBuAfZjW/FLirqu6uqh8BVwBHpuf3gP9eVV+ca3lJTklyS5JbHnjggW6Ll6QJM6pzCmuBb/VNb27a3gEcAhyV5NS53lhVF1bVdFVNr1kz5xVVkqQlGtUlqZmjrarqXODcYRcjSeoZ1Z7CZmD3vundgHtHVIskqTGqULgZ2CfJnkl2Ao4FrhlmAVPrNzK1fuMwP1KStnvDuCT1cuBGYN8km5O8taoeA94OXAfcCVxZVXdsxTLXJblwy5Yt3RQtSROq83MKVXXcPO3XAtcucZkbgA3T09Mnb0ttkqQn267uaJYkjZahIElqGQqSpNZYhoInmiWpG2MZClW1oapOWb169ahLkaQVZSxDQZLUjYkPBW9gk6QnTHwoSJKeYChIklpjGQpefSRJ3RjLUPDqI0nqxliGgiSpG4aCJKllKEiSWoaCJKk1lqHg1UeS1I2xDAWvPpKkboxlKEiSumEoSJJahoIkqWUoSJJahoIkqWUoSJJaYxkK3qcgSd0Yy1DwPgVJ6sZYhoIkqRuGgiSpZShIklqGgiSpZShIklqGgiSpZShIklqGgiSpNZahsNx3NE+t38jU+o3LsixJGmdjGQre0SxJ3RjLUJAkdcNQkCS1DAVJUstQkCS1DAVJUstQkCS1DAVJUstQkCS1DAVJUstQkCS1DAVJUmssQ2G5O8STJPWMZSjYIZ4kdWMsQ0GS1I2BQiHJzyR5QZK9kkxEkDi+gqRJtMN8M5KsBt4GHAfsBDwA7Az8bJK/Ac6vquuHUqUkaSjmDQXgKuDPgFdU1ff6ZyQ5EHhzkr2q6qIO65MkDdG8oVBV/2aBebcCt3ZSkSRpZBbaUwAgyQFzNG8BvlFVjy1/SZKkUVk0FIDzgQOA/w0EeGHzfJckp1bVZzqsT5I0RINcSXQPsH9VTVfVgcD+wO3AIcDvd1ibJGnIBtlTeH5V3TEzUVVfSbJ/Vd2dpMPSRsNLUSVNskFCYVOSC4ArmuljgK8leSrwaGeVSZKGbpDDRycCdwG/DrwLuLtpexR4VUd1jYR7CZIm3aJ7ClX1cJLzgM8ABWyqqpk9hP/XZXGSpOEa5JLUVwKX0jvhHGD3JCdU1Q2dViZJGrpBzin8Z+A1VbUJIMnPAZcDB3ZZmCRp+AY5p7DjTCAAVNXXgB27K0mSNCqD7CnckuQi4CPN9JuwiwtJWpEG2VM4DbgDOB14J/AV4NQui1qMI69JUjcWDYWqeqSqzqmqN1TVL1fVH1XVI8MoboGaHHlNkjqw0HgKX6Z3CeqcqurnO6lIkjQyC51TeN3QqpAkbRcWCoVvVtW8ewoASbLYayRJ42OhcwrXJ3lHkj36G5PslOTgJJcCJ3RbniRpmBbaUzgUeAtweZI9ge/RG6N5Fb0uL/6oqr7UdYGSpOFZaDjOH9IbYOf8JDsCuwIPzx6vWZK0cgxy8xpNB3j3dVyLJGnEBrl5TZI0IQwFSVJr0VBI8vQkT2me/1ySI5pzDJKkFWaQPYUbgJ2TrAU+C5wEXNJlUZKk0RgkFFJVPwDeAJxXVb8M7NdtWZKkURgoFJIcRK/L7JlBjAe6akmSNF4GCYV3AmcBn6qqO5LsBVzfbVmSpFFY9Bt/MxbzDX3Td9MbW0GStMIsGgrNmMy/CUz1v76qDu6uLEnSKAxybuDjwJ8AHwIe77YcSdIoDRIKj1XVBZ1Xsh2aWt87r37P2YePuBJJGo5BTjRvSPLvkjw7yTNnfjqvTJI0dIPsKcyMmXBGX1sBey1/OZKkURrk6qM9h1GIJGn0Brn6aEfgNOAXm6bPA/+16U5bkrSCDHL46AJgR3oD7gC8uWn7t10VJUkajUFC4V9U1Yv7pj+X5LauCpIkjc4gVx89nuR5MxNNNxferyBJK9AgewpnANcnuRsI8Fx63WcvqyZs/j2wuqqOWu7lS5IWt+ieQlV9FtiHXn9HpwP7VtVAHeIluTjJ/Ulun9V+aJJNSe5Ksr75nLur6q1bvwqSpOUybygkObh5fANwOLA38Dzg8KZtEJcAh85a7irgg8Bh9MZlOC6J4zNI0nZgocNH/xr4HLBujnkFfHKxhVfVDUmmZjW/FLir6W2VJFcARwJfGaTgJKcApwDsscceg7xFkjSgeUOhqt7bPH1fVX29f16SbbmhbS3wrb7pzcC/TLIL8LvA/knOqqr3z1PXhcCFANPT07UNdUiSZhnkRPMngANmtV0FHLjEz8wcbVVV3wVOXeIyJUnLYN5QSPJ84AXA6lnnEJ4B7LwNn7kZ2L1vejfg3m1YniRpmSy0p7Av8Drgp3nyeYXvAydvw2feDOzTHIL6NnAs8MZtWJ4kaZksdE7hauDqJAdV1Y1LWXiSy4FXArsm2Qy8t6ouSvJ24DpgFXBxVd2xlctdB6zbe++9l1LWNnOcBUkr1UKHj86sqt8H3pjkuNnzq2rRcZqr6sfe17RfC1y7NYXOev8GYMP09PS27LFIkmZZ6PDRnc3jLcMoRJI0egsdPtrQPP1BVX28f16SozutSpI0EoN0iHfWgG2SpDG30DmFw4DXAmuTnNs36xnAY10XtpBRn2iWpJVqoT2Fe+mdT/ghcGvfzzXAL3Vf2vyqakNVnbJ69epRliFJK85C5xRua3o3fU1VXTrEmiRJI7LgOYWqehzYJclOQ6pHkjRCg/R99A3gfya5BnhoprGqzumsKknSSAwSCvc2P08BfqrbciRJo7RoKFTV7wyjkK0x7KuPptZvtEsLSRNh0fsUkqxJ8gdJrk3yuZmfYRQ3H68+kqRuDHLz2mXAV4E9gd8B7qHX06kkaYUZJBR2qaqLgEer6q+q6i3AyzquS5I0AoOcaH60ebwvyeH0Tjrv1l1JkqRRGSQU/lOS1cBvAOfR6+biXZ1WJUkaiUGuPvp083QL8KpuyxmMfR9JUjcGufporyQbkjyY5P4kVyfZaxjFzcerjySpG4OcaP5z4ErgnwHPAT4OXN5lUZKk0RgkFFJVH6mqx5qfjwLVdWGSpOEb5ETz9UnWA1fQC4NjgI1JnglQVf/QYX2SpCEaJBSOaR5/bVb7W+iFxEjPL0iSls8gVx/tOYxCJEmjN8g5BUnShBjk8NF2ZxT3KUyt37hsy7DH1W7Zq620dPPuKSR5efP41OGVMxjvU5Ckbix0+Ojc5vHGYRQiSRq9hQ4fPZrkw8DaJOfOnllVp3dXliRpFBYKhdcBhwAHA7cOpxxJ0ijNGwpV9SBwRZI7q+q2IdYkSRqRQS5J/W6STzWd4X0nySeSOJ6CJK1Ag4TCh4Fr6HWGtxbY0LRJklaYQULhWVX14b4O8S4B1nRclyRpBAYJhQeSHJ9kVfNzPPDdrguTJA3fIKHwFuBXgb8H7gOOatpGJsm6JBdu2bJllGW0tvVu56n1G+dcxnztk8ZtIA3PoqFQVd+sqiOqak1VPauqXl9V3xhGcQvU5B3NktQBO8STJLUMBUlSy1CQJLUWDYUk7+l7vt31mCpJWj4LdZ19ZpKD6F1tNMMeUyVpBVuoQ7xNwNHAXkn+B3AnsEuSfatq01CqkyQN1UKHj/4v8G7gLuCVPDG+wvokX+i4LknSCCy0p3Ao8F7gecA5wG3AQ1V10jAKkyQN37x7ClX17qp6NXAP8FF6AbImyV8n2TCk+iRJQ7TQnsKM66rqZuDmJKdV1S8k2bXrwiRJwzdINxdn9k2e2LQ92FVBkqTR2aqb17aXEdi2tw7xwM7rRsHtLS2/sbyj2Q7xJKkbYxkKkqRuGAqSpJahIElqGQqSpJahIElqGQqSpJahIElqGQqSpJahIElqGQqSpJahIElqGQqSpJahIElqGQqSpJahIElqGQqSpNYgYzRvd5KsA9btvffeI61jrpG/ptZv5J6zD1/SexdqX6qZ5d1z9uFPWvZ8Nfa/vr9tZnqQ54stb5D2+ZYzn62dv9j6D/KaQf6dtzfjXLuGYyz3FBx5TZK6MZahIEnqhqEgSWoZCpKklqEgSWoZCpKklqEgSWoZCpKklqEgSWoZCpKklqEgSWoZCpKklqEgSWoZCpKklqEgSWoZCpKklqEgSWoZCpKklqEgSWoZCpKklqEgSWoZCpKklqEgSWoZCpKklqEgSWoZCpKk1g6jLmBGkqcD5wM/Aj5fVZeNuCRJmjid7ikkuTjJ/Ulun9V+aJJNSe5Ksr5pfgNwVVWdDBzRZV2SpLl1ffjoEuDQ/oYkq4APAocB+wHHJdkP2A34VvOyxzuuS5I0h04PH1XVDUmmZjW/FLirqu4GSHIFcCSwmV4wfIkFwirJKcApAHvsscfyF70MptZv/LG2e84+fMH5C7UvtMyZ9rmWP+hnzrxu0M+f7/linzVILVtrW5e1lM/sf2//tptvO85u7/93658/+zVLqWfQ9873f2Sx985X99aa67MWWvZc225bPn9rjeIz59N1LaM40byWJ/YIoBcGa4FPAr+S5AJgw3xvrqoLq2q6qqbXrFnTbaWSNGFGcaI5c7RVVT0EnDTsYiRJTxjFnsJmYPe+6d2Ae0dQhyRpllGEws3APkn2TLITcCxwzQjqkCTN0vUlqZcDNwL7Jtmc5K1V9RjwduA64E7gyqq6YyuXuy7JhVu2bFn+oiVpgnV99dFx87RfC1y7DcvdAGyYnp4+eanLkCT9OLu5kCS1DAVJUstQkCS1UlWjrmGrJVkHrAOOAf5uCYvYFXhwWYsaL5O+/uA2mPT1h8neBs+tqjnv/h3LUNhWSW6pqulR1zEqk77+4DaY9PUHt8F8PHwkSWoZCpKk1qSGwoWjLmDEJn39wW0w6esPboM5TeQ5BUnS3CZ1T0GSNAdDQZLUmqhQmGds6BUlye5Jrk9yZ5I7kryzaX9mkr9M8nfN48/0veesZptsSvJLo6t+eSVZleRvk3y6mZ6YbZDkp5NcleSrzf+FgyZp/QGSvKv5Hbg9yeVJdp60bbAUExMKC4wNvdI8BvxGVf1z4GXA25r1XA98tqr2AT7bTNPMOxZ4Ab3xtM9vttVK8E56PfHOmKRt8F+Av6iq5wMvprcdJmb9k6wFTgemq+qFwCp66zgx22CpJiYU6Bsbuqp+BMyMDb2iVNV9VfXF5vn36f0xWEtvXS9tXnYp8Prm+ZHAFVX1SFV9HbiL3rYaa0l2Aw4HPtTXPBHbIMkzgF8ELgKoqh9V1feYkPXvswPwtCQ7AD9BbzCvSdsGW22SQmG+saFXrCRTwP7ATcDPVtV90AsO4FnNy1bqdvkAcCbwT31tk7IN9gIeAD7cHD77UJKnMznrT1V9G/hD4JvAfcCWqvoME7QNlmqSQmHOsaGHXsWQJPlJ4BPAr1fVPy700jnaxnq7JHkdcH9V3TroW+ZoG+dtsANwAHBBVe0PPERzmGQeK239ac4VHAnsCTwHeHqS4xd6yxxtY70NlmqSQmFixoZOsiO9QLisqj7ZNH8nybOb+c8G7m/aV+J2eTlwRJJ76B0mPDjJR5mcbbAZ2FxVNzXTV9ELiUlZf4BDgK9X1QNV9SjwSeBfMVnbYEkmKRQmYmzoJKF3LPnOqjqnb9Y1wAnN8xOAq/vaj03y1CR7AvsA/2tY9Xahqs6qqt2qaorev/Pnqup4JmQbVNXfA99Ksm/T9GrgK0zI+je+CbwsyU80vxOvpnd+bZK2wZJ0Ohzn9qSqHksyMzb0KuDirR0beky8HHgz8OUkX2ra3g2cDVyZ5K30fmGOBqiqO5JcSe+PxmPA26rq8aFXPRyTtA3eAVzWfAG6GziJ3pfAiVj/qropyVXAF+mt09/S69biJ5mQbbBUdnMhSWpN0uEjSdIiDAVJUstQkCS1DAVJUstQkCS1DAWtSE3XDp10eJhkTZKbmi4kXrHMy35Jktf2TR8x06Nvktf3r1OS9yU5ZDk/X/KSVGkrJTkWOKyqTlj0xVu33B2A4+n17Pn2OeZfAny6qq5azs+V+hkKGmtNR29X0uuWYBXwH6vqY0k+D/wmvX5v3te8/GnATlW1Z5IDgXPo3cz0IHDiTEdpfct+LnAxsIZeB3MnAc+kd/fr04BvAwdV1cN977kH+BjwqqbpjVV1V5J1wHuAnYDvAm+qqu8k+e2mxqmmjl/oW/b7m+fTwJ8Dnwa2ND+/AvwHmpBI8mp6HcDtQO/u/dOq6pGmnkuBdcCOwNFV9dWt3c6aHB4+0rg7FLi3ql7c9Jv/F/0zq+qaqnpJVb0EuA34w6ZvqPOAo6rqQHp/+H93jmX/MfBnVfXzwGXAuVX1JeC3gI81y314jvf9Y1W9tHn/B5q2vwZe1nRQdwW9HlxnHAgcWVVvnLXsj/WtxxfohdEZzbz/MzMvyc7AJcAxVfUiesFwWt/yH6yqA4AL6AWlNC9DQePuy8AhSX4vySuqastcL0pyJvBwVX0Q2Bd4IfCXTVcg76G3pzHbQfS+oQN8hN63+EFc3vd4UPN8N+C6JF8GzqA3mMuMa+YJl0HtS6/zt68105fSG09hxkyniLfS2yOR5mUoaKw1fwgPpBcO70/yW7Nf0xxaORo4daYJuGNmD6KqXlRVrxnk4wYta47n5wF/3HyT/zVg577XPDTgcuczV7fP/R5pHh9ngvo709IYChprSZ4D/KCqPkrvmPoBs+Y/Fzgf+NW+b+ObgDVJDmpes2OS/m/uM75Ar5dVgDfROwQ0iGP6Hm9snq+md54Anuilcy7fB35qK+d9FZhKsncz/WbgrwasVXoSvzVo3L0I+IMk/wQ8ypOPpQOcCOwCfKrXgzL3VtVrkxwFnJtkNb3fgw8As3vNPR24OMkZPHGieRBPTXITvS9dxzVtvw18PMm3gb+hN/jLXK4H1jeHtd4/a94VwJ8mOR04aqaxqn6Y5KRm+TMnmv9kwFqlJ/HqI2kZNVf7TFfVg6OuRVoKDx9JklruKUiSWu4pSJJahoIkqWUoSJJahoIkqWUoSJJa/x/lVChjLgBXLwAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "import matplotlib.patches as patches\n", "part_sizes=[len(x) for x in finished_partitions]\n", "plt.hist(part_sizes,bins=200,log=True)\n", "plt.ylabel('# of partitions (log)')\n", "plt.xlabel('size of partition');" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "# supporting functions and pre-computations for plotting 2D-partitions \n", "\n", "def build_indexes(df):\n", " indexes = {}\n", " for column in categorical:\n", " values = sorted(df[column].unique())\n", " indexes[column] = { x : y for x, y in zip(values, range(len(values)))}\n", " return indexes\n", "\n", "def get_coords(df, column, partition, indexes, offset=0.1):\n", " if column in categorical:\n", " sv = df[column][partition].sort_values()\n", " l, r = indexes[column][sv[sv.index[0]]], indexes[column][sv[sv.index[-1]]]+1.0\n", " else:\n", " sv = df[column][partition].sort_values()\n", " next_value = sv[sv.index[-1]]\n", " larger_values = df[df[column] > next_value][column]\n", " if len(larger_values) > 0:\n", " next_value = larger_values.min()\n", " l = sv[sv.index[0]]\n", " r = next_value\n", " # we add some offset to make the partitions more easily visible\n", " l -= offset\n", " r += offset\n", " return l, r\n", "\n", "def get_partition_rects(df, partitions, column_x, column_y, indexes, offsets=[0.1, 0.1]):\n", " rects = []\n", " for partition in partitions:\n", " xl, xr = get_coords(df, column_x, partition, indexes, offset=offsets[0])\n", " yl, yr = get_coords(df, column_y, partition, indexes, offset=offsets[1])\n", " rects.append(((xl, yl),(xr, yr)))\n", " return rects\n", "\n", "def get_bounds(df, column, indexes, offset=1.0):\n", " if column in categorical:\n", " return 0-offset, len(indexes[column])+offset\n", " return df[column].min()-offset, df[column].max()+offset\n", "\n", "def plot_rects(df, ax, rects, column_x, column_y, edgecolor='grey', facecolor='none'):\n", " for (xl, yl),(xr, yr) in rects:\n", " ax.add_patch(patches.Rectangle((xl,yl),xr-xl,yr-yl,linewidth=1,edgecolor=edgecolor,facecolor=facecolor, alpha=0.5))\n", " ax.set_xlim(*get_bounds(df, column_x, indexes))\n", " ax.set_ylim(*get_bounds(df, column_y, indexes))\n", " ax.set_xlabel(column_x)\n", " ax.set_ylabel(column_y)\n", " \n", "indexes = build_indexes(df)\n", "column_x, column_y = feature_columns[:2]\n", "rects = get_partition_rects(df, finished_partitions, column_x, column_y, indexes, offsets=[0.0, 0.0])" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Details of plotting function plot_rects has been skipped in the slide: Check the Jupyter notebook\n", "plt.figure(figsize=(15,8))\n", "ax = plt.subplot()\n", "plot_rects(df, ax, rects, column_x, column_y, facecolor='r')\n", "# we can also plot the datapoints themselves\n", "plt.scatter(df[column_x], df[column_y])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Generating the k-anonymous dataset\n", "\n", "> So far we have identified the partition boundaries, along which to group the data items.\n", "> * We need to aggregate data within the partitions, to create the k-anonymous dataset." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def agg_categorical_column(series):\n", " return [','.join(set(series))]\n", "\n", "def agg_numerical_column(series):\n", " return [series.mean()]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# Caution: I have tested it with other feature_columns choices and the code is brittle and it breaks.\n", "# It serves the pedagogical purpose, but you may have to fix/reimplement for general purpose use.\n", "def build_anonymized_dataset(df, partitions, feature_columns, sensitive_column, max_partitions=None):\n", " aggregations = {}\n", " for column in feature_columns:\n", " if column in categorical:\n", " aggregations[column] = agg_categorical_column\n", " else:\n", " aggregations[column] = agg_numerical_column \n", " rows = []\n", " for i, partition in enumerate(partitions):\n", " #if i % 100 == 1:\n", " # print(\"Finished {} partitions...\".format(i))\n", " if max_partitions is not None and i > max_partitions:\n", " break\n", " grouped_columns = df.loc[partition].agg(aggregations, squeeze=False)\n", " sensitive_counts = df.loc[partition].groupby(sensitive_column).agg({sensitive_column : 'count'})\n", " values = grouped_columns.iloc[0].to_dict()\n", " for sensitive_value, count in sensitive_counts[sensitive_column].items():\n", " if count == 0:\n", " continue\n", " values.update({\n", " sensitive_column : sensitive_value,\n", " 'count' : count,\n", " })\n", " rows.append(values.copy())\n", " \n", " return pd.DataFrame(rows)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageeducation-numincomecount
60517.04.000000<=50k5
11017.05.000000<=50k36
11117.06.000000<=50k198
017.07.200599<=50k334
12017.09.000000<=50k14
...............
72690.09.000000>50k4
73690.010.545455<=50k9
73790.010.545455>50k2
77290.014.000000<=50k2
77390.014.000000>50k3
\n", "

776 rows × 4 columns

\n", "
" ], "text/plain": [ " age education-num income count\n", "605 17.0 4.000000 <=50k 5\n", "110 17.0 5.000000 <=50k 36\n", "111 17.0 6.000000 <=50k 198\n", "0 17.0 7.200599 <=50k 334\n", "120 17.0 9.000000 <=50k 14\n", ".. ... ... ... ...\n", "726 90.0 9.000000 >50k 4\n", "736 90.0 10.545455 <=50k 9\n", "737 90.0 10.545455 >50k 2\n", "772 90.0 14.000000 <=50k 2\n", "773 90.0 14.000000 >50k 3\n", "\n", "[776 rows x 4 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# we sort the resulting dataframe using the feature columns and the sensitive attribute\n", "dfn=build_anonymized_dataset(df, finished_partitions, feature_columns, sensitive_column)\\\n", " .sort_values(feature_columns+[sensitive_column])\n", "dfn" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Ungraded task \n", "\n", "Ungraded Task 8.1.i: Generate k-anonymous, l-diverse dataset. \n", "\n", "Hint: You need minor modification to how/when partitioning is done.\n", "\n", "Ungraded Task 8.1.ii: For a dataset of your choice, train a ML model (of your choice) with the raw data versus the k-anonymyzed data and compare their performances (you may use default parameter choices, or try to optimize for each case). \n", "\n", "Note: When using k-anonymyzed data to train, the test data could still be the way the original \"raw\" data was. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Ungraded Task 8.2: Generate differentially private data from the census dataset, using the same set of features and sensitive columns. Benchmark the performance of models trained with different parameter choices, comparing with the model built using the original (non-noisy) data. \n", "\n", "Hint: There are only two categories for the sensitive (income) data." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "![anon](pics/anon.png)" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }