BI Tech CP303 - Data Mining

Erin Shellman - shellman@uw.edu

TA: Bryan Mayer - mayerbry@uw.edu

Welcome! These are lecture notes that accompany an applied course on data mining. Read the syllabus to find details about assignments and due dates.

Textbooks

Everything you need for this course is in these notes and R tutorials, but there are some great textbooks available for free online.

Course Philosophy

Data mining is an applied skill and this course focuses on applications. Each class is a blend of concepts and applications using the programming language R. There are three projects, each consisting of a data mining analysis to solve a business problem and a brief written report of your findings.

By the end of this course, you'll be able to approach unfamiliar data and quickly find and communicate results.

Navigating the notes

The notes progress from left to right, and for those interested in exploring more about a topic sometimes there's bonus material in the up, down direction.

Just like this!

Traverse the topics

You can also locate slides by topic using the navigation bar at the top.

Bookmark and check back often!

Data technology is a rapidly growing and changing industry. I chose a web-based format for course notes so the slides could be easily maintained and up-to-date. As you apply these skills outside of class, I hope these notes can be a valuable reference.

Notes improved with the support of viewers like you!

These slides are a work in progress and I need your help to make them rule. If you see an error or think something is unclear or omitted, please report it!

I'll do my best to keep the slides as updated and accurate as possible.

Introduction to Data Mining with R

BI Tech CP303 - Data Mining

R Tutorial

We are inundated with data

Governments, corporations, scientists, and consumers are creating and collecting more data than ever before. The unprecendented availability of data has transformed the modern economy and, for many, the human condition. Using clickstream technology, the modern advertiser can follow would-be customers as they browse online. These marketers are searching for clues about your preferences and shopping habits, evidence of your family composition, and even upcoming events in your life, all for the purposes of personalization. On an individual level, the Internet of Things allows us to continually monitor our bodies with Fitbits, Fuel bands, and Jawbone, and our homes with products like Nest.

Not only do consumers personally generate data, but they share it with others on social media. Similarly the biological and earth sciences are exploding with data as the cost of collecting it has plummetted. And the government, of course, wants all the data.

We are inundated with data

But that's just it, we're collecting data in faster than ever in history, but our growth of understanding of these data has not kept pace.

Data representation and storage is more heterogeneous than ever

Not only are we generating more data faster, we're developing more ways to represent and store data. Marketing personalization combines disperate data about a single person or household: demographics, transactional histories, social media activities and clickstream. Personalization efforts require the ability to ingest and combine data from many sources across many formats. Graph databases like Titan or neo4j represent complex interactions as network graphs and analysts query the network with traversal patterns. Algorithms like mapreduce on the other hand, revolutionized the way we query unstructured data, like text and weblogs.

We need new tools

In the context of big, heterogeneous data, the number of candidate features is too large to explore with traditional analysis techniques. Further, the volume of data creates new challenges that analyists historically haven't dealt with such as

computational efficiency, i.e. speed
memory use and allocation, i.e. computing resources
data pipelines and integrity
the ethical consequences of findings

Data mining is a process of knowledge discovery

Data Mining is the process of identifying patterns in large volumes of data for the purposes of prediction. Realistically, data cleaning and exploratory analysis will occupy the vast majority of your time with the rest left over for model discovery. Model deployment is optional and depends on the result of the first three steps. After completing the analysis you might find that long-term adoption of your model isn't worthwhile.

Additional definitions

Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. Witten, Frank and Hall

The most commonly accepted definition of “data mining” is the discovery of “models” for data." Where 'model' could mean a statistical model, and machine learning model or an algorithm like mapreduce. Leskovec, Rajaraman, Ullman

Data Mining is an analytic process designed to explore data in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns. Hill & Lewicki

Same techniques, different goal

You'll find that many of the techniques you've used in an analysis setting are identical to those used in machine learning, for example, linear and logistic regression. The key difference is that in machine learning, we'll judge our model quality by assessing its ability to predict on hold-out data, rather than through goodness-of-fit metrics like $ R^2 $ or AIC.

Lets look at some examples

Data mining problems generally fall into one of three classes and we're going to get some hands-on experience with all three:

Prediction of continuous outcomes
Classification (or supervised learning)
Clustering and unsupervised learning

Forest fires

In the first unit, you'll use a collection of meteorological data and regression techniques to predict the burn area of a nature park in Portugal.

Credit card default

In the second unit you will build classification models with logistic regresson and decision trees to identify customers who will default on their next credit card payment.

College ranking

Photo by DAVID ILIFF. License: CC-BY-SA 3.0

In the third unit you'll use the unsupervised learning approaches of association rule mining and clustering to help prospective students identify cheaper alternatives to the US News and World Reports college rankings.

Limitations of data mining

Like any analytical technique, and arguably more so than traditional analysis, data mining has limitations:

many patterns are uninteresting
patterns can be coincidental or a result of sampling bias
data are almost always untidy or missing

How do these limitations influence the interpretation and application of the results of a data mining process? What are our standards for belief?

Ethical data mining

As a result of the volume and speed at which businesses can collect data, conversations about the ethical use of these data are more important than ever. We're used to reading headlines about ethical violations in big data, but often these incidents can be anticipated and avoided.

Here's a fairly horrifying list of articles about data mining and potential ethical violoations.

Ethics case study: Target

Take 20 minutes and read the following piece from the New York Times Magazine which describes how Target used transactional data to target expectant mothers for marketing campaigns: How Companies Learn Your Secrets

The business problem:

Specifically, the marketers said they wanted to send specially designed ads to women in their second trimester, which is when most expectant mothers begin buying all sorts of new things, like prenatal vitamins and maternity clothing. “Can you give us a list?” the marketers asked.

Discussion questions

Put yourself in the role of an analyst at Target.

Why did people get upset?
What data do you need to build a model to predict if a woman is pregnant?
Is it ethical to merge data from outside of Target's scope in customers' lives (e.g. public birth records).
Does the cost of backlash outweigh the benefits of the knowledge?
How might Target have met the business goals, without losing customer credibility?
To what degree should Andrew Pole be held accountable for how his analyses were used?

How should we think about ethics?

When we are faced with a data mining application, what framework should we use to think about ethics?

Ethical decision points

Ethical decision points provide a framework for exploring the relationship between what values you hold as individuals—and as members of a common enterprise—and aligning those values with the actions you take in building and managing products and services utilizing big data technologies.

Inquiry: what are are my organization's values? What are my values?
Analysis: what are the practice standards in my industry?
Articulation: where are there alignments and gaps between our values and proposed data practices?
Action: what must we do to close value alignment gaps?

Inquiry: what are my values?

Target has a values page on their website, but more relevant to this discussion is the privacy policy.

At Target, we want you to know how we collect, use, share and protect information about you. By interacting with Target, you consent to use of information that is collected or submitted as described in this privacy policy. We may change or add to this privacy policy, so we encourage you to review it periodically. To help you track the changes, we include a history of changes below.

Analysis: what are the standards of my industry?

In marketing a great place to start is the Digital Marketing Association's Ethics and Compliance Program and Guidelines for ethical business practice the latter of which covers privacy and marketing to children, something Target did whether they knew it or not, in depth.

Articulation: are there value gaps?

If we send someone a catalog and say, ‘Congratulations on your first child!’ and they’ve never told us they’re pregnant, that’s going to make some people uncomfortable,” Pole told me. “We are very conservative about compliance with all privacy laws. But even if you’re following the law, you can do things where people get queasy.

The girl was high school age, a fact that, presumably, Target knew. Why then did they choose to proceed with the mailer?

Action: how do we close the value gaps?

With the pregnancy products, though, we learned that some women react badly. Then we started mixing in all these ads for things we knew pregnant women would never buy, so the baby ads looked random. We’d put an ad for a lawn mower next to diapers. We’d put a coupon for wineglasses next to infant clothes. That way, it looked like all the products were chosen by chance.

... and then of course they say this dehumanizing thing.

And we found out that as long as a pregnant woman thinks she hasn’t been spied on, she’ll use the coupons. She just assumes that everyone else on her block got the same mailer for diapers and cribs. As long as we don’t spook her, it works.

Ethical behavior requires constant diligence

The ethical consequences of data mining isn't a one-time consideration. Everytime we sit down to conduct and analysis we should think about the potential for harm. We'll return to the discussion of ethics repeatedly throughout the course.

From O'Reilly's 2014 survey of data science salaries.

R for statistical computing

R is a leading tool in the industry for statistics and data science.

is free and open-source
has an active community, which means you've got tons of analysis options.
is in demand!

Installation

Install R
Install R Studio

Introductory Resources

Data Mining with R: Learning with Case Studies
Ethics of Big Data: Balancing Risk and Innovation
Coursera course on R
Exploratory Data Analysis from Coursera
Exploratory Data Analysis from Udacity
Mining of Massive Datasets
The Elements of Statistical Learning. Trevor Hastie, Robert Tibshirani and Jerome Friedman. This is a classic text available from their website
Hands-On Data Sciece with R
Data mining resources

Predicting numerical outcomes with linear regression

BI Tech CP303 - Data Mining

R Tutorial

What makes a successful station?

Your first project is to help a hypothetical bike-share program expand operations by identifying characteristics of successful bike stations. You'll do that by predicting the expected number of rentals per station as a function of the surrounding landscape.

Linear Regression is a technique for estimating numerical outcomes such as number of rentals per day or the ride time in minutes.

What is linear regression?

Linear regression is a method to describe an outcome variable in terms of a linear combination of predictor variables. Regression with multiple predictors allows the us to answer the question "what is the best predictor of the outcome of interest?"

In data mining, linear regression is a technique that enables us to predict the value of a continuous variable, like ride duration or number of rentals, in terms of the values of other variables, like the number of crosswalks or museums nearby.

Why linear regression?

Regression approaches are great to have in your toolbelt due to their ease of

interpretability
implementation
flexibility
popularity

Interpretability

Regression coefficients have an explanatory interpretation. In the case of a regression with a single predictor, the coefficients are interpreted as the average change in the outcome for a unit change in an explanatory variable. Say for example we had the following regression equation:

$ duration = 15 + 30 * museums $

The interpretation is that the average ride duration is 15 minutes when there are no nearby museums ($museums = 0$), and the average duration increases by 30 minutes for each museum within a quarter mile radius.

Ease of implementation

Most programming languages have out-of-the-box implementations of linear regression, so they can be applied across many applications.

Examples of linear regression implementations in Python, Scala, and R.

Powerful

Predicted rentals times based on the number of nearby musuems (duration = 15 + 30 * museums).

In prediction settings, linear models can outperform more complex nonlinear techniques, particularly when working with small training sets or sparse data. This is because regression is an averaging technique, making it less sensitive to missingness and extreme values.

Flexible

Even in situations where linearity assumptions fail, regression can be applied on transformations of variables such as logarithms.

Mathematical foundations

Let's walk through the foundations of regression with just one explanatory variable/covariate. Regression with just one covariate is called simple linear regression (SLR). SLR has a simpler framework than multiple regression and is fundamental to understanding more complex models.

Mathematical foundations

SLR describes the relationship between an outcome and a single predictor variable. The regession equation expresses how the average of the outcome variable changes over the range of the predictor variable.

e.g. How does the expected number of rentals change as the number of nearby crosswalks changes?

Mathematical foundations

Given $n$ observations of $x$ and $y$,

$ (x_1, y_1), (x_2, y_2), \ldots , (x_n, y_n) $

the model for a simple linear regression is expressed as

$ y_i = \beta_{0} + \beta_{1}x_i + \epsilon_i $

Hey, kinda looks like $y = mx + b$ where $\beta_0$ is the intercept and $\beta_1$ is the slope of the line.

Which line is the best?

How do we choose $\beta_0$ and $\beta_1$? With visual inspection alone, there seems to be several lines that could be drawn through the center of the points. We need an algorithm for standardizing and automating the process of finding the best one.

Errors

Data points will not fall exactly on any straight line we choose, there will always be some error. The distance of a point from the regression line is called the residual error, $ \epsilon_i $. The errors are assumed to be independent and normally distributed with mean 0 and standard deviation $\sigma$.

Ordinary least squares ftw

The method of least squares selects the line that minimizes the sum of the squares of the errors. $\epsilon_i$ is the error of the prediction line for each observation.

$ \epsilon_i = (y_i - \beta_{0} - \beta_{1}x_i) $

$ \epsilon_i^2 = (y_i - \beta_{0} - \beta_{1}x_i)^2 $

Choose $\beta_0$ and $\beta_1$ so that $\sum\limits_{i=1}^n\epsilon_i^2$ is minimized.

Intuition

Each $y_i$ is an observed value of the outcome variable. $\beta_{0} + \beta_{1}x_i$ is the estimated, or expected, $y_i$ value. So when we minimize the sum of squared errors we're making $(observed - expected)^2$ as close to 0 as possible.

Assumptions

To safely apply linear regression without fear of weird results, some assumptions must be met:

Linearity between the outcome and coefficients ($\beta$'s)
Independence of the errors (no autocorrelation)
Constant variance (homoscedasticity) of the errors
Normality of errors

We might violate some assumptions and it's application-dependent whether we should be concerned about those violations.

Diagnostic plots

In R, if you plot a linear model object, it'll give you a nice collection of diagnostic plots for assessing model assumptions.

model = lm(rentals ~ crossing, data = data)
par(mfrow = c(2, 2)) # make 4 plots in 2 rows, 2 cols
plot(model)

Interpretation

Regression equations have a nice interpretation. The intercept $\beta_0$ is the mean value of $y$ when $x = 0$. The slope of the regression line, $\beta_{1}$, is the change in the mean of y for each unit change in x.

R regression output

model = lm(rentals ~ crossing, data = data)
summary(model)

Call:
lm(formula = rentals ~ crossing, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-51.815 -21.176  -8.394  13.765 126.969 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 27.53163    2.59354  10.615  < 2e-16 ***
crossing     0.49159    0.07031   6.992 4.92e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 29.54 on 183 degrees of freedom
Multiple R-squared:  0.2108,	Adjusted R-squared:  0.2065 
F-statistic: 48.88 on 1 and 183 DF,  p-value: 4.92e-11

Assessing model quality

We've discussed how regression is a tool for numerical prediction. When we use a model to make predictions, how do we know if the model is any good? To assess model quality, we quantify its predictive accuracy.

Hold-out sets

We measure prediction quality by simulating new data coming in. We do that by dividing the original data into a training set and a testing set. The training set is used to fit and tune the predictive model. Then, the testing set is used to validate the predictions. Generally construction of the train and test sets should be done completely at random.

Quick heuristics for study design

60% train
20% test
20% validation

60% train
40% test

Do cross validation

Measuring model quality for continuous outcomes
$\text{mean absolute error (MAE)} = \\ \frac{1}{n} \sum\limits_{i=1}^n |predicted_i - observed_i|$
$\text{root mean squared error (RMSE)} = \\ \sqrt{\frac{1}{n} \sum\limits_{i=1}^n (predicted_i - observed_i)^2}$

Which metric should I use?

Just try them all, each metric has strengths and weaknesses.

MSE and RMSE are sensitive to outliers.
Median absolute deviation is often more sensitive to outliers.

Linear Regression Resources

OLS regression Explained Visually
Practical Machine Learning is a great Coursera course and is part of the data science specialization from Johns Hopkins.
A Short Introduction to the caret Package
Building Predictive Models in R Using the caret Package

Model Selection

BI Tech CP303 - Data Mining

R Tutorial

The full model

We'll refer to the regression with all available predictors as a full model. The full model of the number of rentals contains 71 predictors, which is too many to meaningfully interpret and may be challenging to apply in practice.

Further, models with many predictors risk being over-parameterized, or overfit.

Overfitting

Overfitting occurs when a model is trained too closely to a particular training set and as a result fails to generalize to new data.

Underspecified models like the one on the left are said to exhibit 'high bias.' That is, they are opinionated about the nature of the relationship between x and y, specifically that y changes linearly as a function of x. Overspecified models like the one on the right are said to exhibit 'high variance' that is, their predictive accuracy is highly variable because the model is overtrained to a specific data context.

Simplifying the full model

This week we'll discuss methods for reducing model size while maintaining prediction accuracy that will make the final model practical to apply and that are not overfit.

Reduce predictors
- combination of features
- feature selection
Regularization - keep all predictors, but shrink the parameter values
- ridge regression
- lasso

The curse of dimensionality

Linear regression works well when the number of observations, $n$, is larger than the number of predictors, $p$. In a data mining context we sometimes have data where $p >> n$. An example of this from the biomedical science is microarray technology. One tissue sample can produce over 50,000 observations making $p$ orders of magnitude larger than $n$.

The "curse of dimensionality" is not a problem of high-dimensional data, but a joint problem of the data and the algorithm being applied. It arises when the algorithm does not scale well to high-dimensional data, typically due to needing an amount of time or memory that is exponential in the number of dimensions of the data.

Combining like features

The bikeshare data set includes information about nearby amenities like the number of restaurants, statues, parking lots and cafes. Since we're mostly working with counts, one simple way to reduce the number of variables is to combine like features.

nightlife = bar + club + pub + nightclub 
tourism = tourism_artwork + tourism_hotel + tourism_information

Subset selection

Subset selection methods use OLS regression, but limit the number of predictors $k$ allowed to remain in the model. They iteratively add dor remove predictors to the model until $k$ predictors are included.

Forward and backward selection

Forward selection starts with the intercept and sequentially adds predictors to the model.
Backward selection starts with the full model and sequentially deletes predictors that have the least contribution to model fit.

How should we set parameter values?

Both forward and backward selection requires that we specify a parameter value, $k$. but how do we choose? Cross validation!

Use cross validation to determine the best parameter values

Accurately Measuring Model Prediction Error

Cross-validation works by splitting the data up into sets of $n$ folds. Your model building and error estimation procedure is then repeated $n$ times. After completing a 5-fold cross-validation you'd have 5 error estimates and can average to obtain a robust estimate of the prediction error.

Cross Validation with caret

This plot shows how the RMSE changes as the parameter value $k$ changes.

Subset selection summary

After reducing the full model using subset selection, the final model is interpretable and potentially has better predictive accuracy than the full model. However we got there by throwing predictors out of the model.

Selection methods do not guarantee that the subsets from the stepwise procedures contain the same variables or even be the "best" performing subset. There's also the possibility of multiple, equally good models.

Multi-collinearity

In contexts where there are many predictors, some of them are likely correlated. Fitting regression with correlated predictors causes the model to be highly variable and produces unstable estimates of the coefficients.

When predictors are correlated they are said to exhibit multicollinearity, and in general it's something we want to avoid.

Variance inflation

Take, for example, the regression model: $\text{num_rentals} = \beta_0 + \beta_1*\text{embassy} + \beta_2*\text{diplomatic_embassy}$.

These two predictors are virtually the same, have a correlation of 0.99, and models that include both of them display highly unstable coefficient values.

Variance inflation

summary(lm(num_rentals ~ embassy + diplomatic_embassy, data = bikeshare))

Call:
lm(formula = num_rentals ~ embassy + diplomatic_embassy, data = bikeshare)

Residuals:
    Min      1Q  Median      3Q     Max 
-5857.3 -2112.1  -919.1  1118.9 24867.9 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         2668.14      99.85  26.721   <2e-16 ***
embassy             2168.61    1102.25   1.967   0.0494 *  
diplomatic_embassy -1260.87    1113.48  -1.132   0.2578    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3105 on 998 degrees of freedom
Multiple R-squared:  0.0297,	Adjusted R-squared:  0.02775 
F-statistic: 15.27 on 2 and 998 DF,  p-value: 2.931e-07

Shrinkage methods

Coefficient shrinkage methods penalize coefficient values so they do not exceed some threshold.

Ridge regression

Ridge regression shrinks regression coefficients by penalizing based on the coefficient size. In ordinary least squares, we picked $\beta$s that minimize our squared errors. Now we do that, but include a penalty parameter.

$ \epsilon_i^2 = \sum\limits_{i} (y_i - x^T_i\beta)^2 + \lambda\sum\limits_{j = 1}^p \beta_j^2 $

or, OLS + penalty

Choosing $\lambda$

$\lambda$ is a complexity parameter that controls the amount of coefficient shrinkage. The penalty parameter $\lambda$ controls the level of shrinking. $\lambda = 0$ will give you exactly the same coefficients as OLS. Beyond that, the choice of $\lambda$ will depend on the data and model, and we'll use cross validation to determine the best value.

Lasso

Like ridge regression, lasso is a shrinkage method with a major difference.

$\epsilon_i^2 = \sum\limits_{i} (y_i - x^T_i\beta)^2$ subject to $\sum\limits_{j = 1}^p |{\beta_j}| \leq t$

Lasso

Making t sufficiently small will cause some of the coefficients to be exactly zero, making the lasso method similar to a subset selection.

Choosing $t$

If $t_0 \geq \sum\limits_{j = 1}^p |\beta_j|$, the lasso estimates and the OLS estimates are identical. If $t = t_0/2$ then the least squares coefficients are shrunk by about 50% on average. Once again we can pick the $t$ that gives the lowest test error, and we'll learn how to do that in caret.

Which method should I use?

We discussed 3 methods for model selection, forward/backward selection, ridge regression, and the lasso, but which should you use and why? As usual, your choice in method is typically informed from your application.

If model size is a practical concern then forward/backward selection with a sensible $k$ could be a good choice. For example, if the data you're working with is quite expensive to collect like human surveys then in the future you'd like to reduce the number of variables to record.

If predictors come cheap and you want to tune the regression for better performance, then consider ridge regression or the lasso. For example maybe your data are performance metrics from webserver logs and data collection is simple.

Constructing an analysis plan

Model Selection Summary

Full models can be impractical and lead to overfitting.
Subset selection methods add, remove, or both add and remove predictors sequentially depending on the improvements in fit.
Skrinkage methods penalize coefficient size to
Subset selection and shrinkage methods all require additional parameters to work, e.g. $k$, $\lambda$, and $t$. Cross-validation can be used to identify optimal values of required parameters.

Feature Selection Resources

Modeling training and tuning with caret

Classification with Logistic Regression

BI Tech CP303 - Data Mining

R tutorial

Classification

We use classification techniques when we're trying to assign something to a category.

Spam or not spam
Diseased or not diseased
Fraud or not fraud

In our case, the outcome $y$ is either 0 or 1: $y \in{\{0, 1\}}$

Identifying spam

We could build a spam classifier with a linear regression predicting whether a message is spam. The predictor is the number of occurrences of the character '$' in a message.

Identifying spam

Then to predict the class, create a threshold on $y$:

if $x \geq 8$, spam
if $x < 8$, not spam

Messages with more than eight '$'s, will be labeled as spam.

Identifying spam

What if an extreme value is observed later? The current rule isn't violated, but the refit model begins to misclassify.

We need a new tool

In binary classification the outcome, $y$, can only take the values $0$ or $1$, but linear regression produces predictions for $y$ that could be much larger than $1$ and much smaller than $0$. To make binary predictions in the last example, we imposed a threshold at $y=0.50$, but that threshold is arbitrary. Our previous prediction tool, linear regression, won't work well for classification problems. Instead we can use another type of regression called, logistic regression.

History of GLM

Articulated in Nelder and Wedderburn in 1972 paper in Journal of the Royal Statistical Society.

GLMs are composed of three parts:

An exponential family model for the response
A systematic component via a linear predictor
A link function that connects the means of the response to the linear predictor

Logistic regression GLM parts

Exponential family model:
Linear predictor: $ \eta_{i} = \sum_{k = 1}^{p} X_{ik}\beta_k $
Link: $ g(\mu) = \eta = \log{\mu \over 1-\mu} $

Deriving logistic regression
Goal: $0 \leq predictions \leq 1$
$y = \beta_0 + \beta_{1}x$ ➟ Simple Linear Regression
How can we transform the right hand size to produce values that are between 0 and 1, like the outcome?
$g(z) = \frac{1}{1+e^{-z}}$ ➟ Logistic (or sigmoid) function
$z = \beta_0 + \beta_{1}x$
$y = \frac{1}{1 + e^{-(\beta_0 + \beta_{1}x)}}$ ➟ Logistic Regression

The logistic function $g(z) = \frac{1}{1+e^{-z}}$

The logistic function $g(z) = \frac{1}{1+e^{-z}}$

How to interpret predictions?

Model predictions represent the probability that $y = 1$ given the data $x$.

e.g. The probability of a message being spam, given the number of $\$'s$. If I get a value of 0.70 from my model, then $p(\text{message is spam}\mid 10 \text{\$'s}) = 0.70$.

How do we decide the class?

The decision boundary separates the region where the model predicts $y = 1$ from the region where the model predicts $y = 0$.

If $y \geq 0.5$, then the message is spam
If $y < 0.5$, then the message is not spam

How do we decide the class?

$p(\text{spam}) = -0.80 + 0.10*\text{number of \$s}$

predict y = 1, if: $-0.80 + 0.10x \geq 0$

$0.10x \geq 0.80$

$x \geq 8$

The line at $x = 8$ is called the decision boundary.

Bot or Not?

Like many Internet giants Twitter makes money by selling ads, but they’ve got an insidious infestation eroding their advertising credibility: bots. More than 23 million of them. Twitter bots are automatons living in the Twittersphere and ranging wildly in capability. At their most complex they use speech patterns that can, at times, fool humans.

When advertisers pay for engagement, they aren’t interested in a four-hour flame war between a gamergate bot and a Kanye bot. When advertisers analyze social data they want to be sure the numbers are the result of human activity.

Bot or Not?

For project 2, we'll use Twitter user data to build a classifier for identifying bots.

Not a bot...

Spot a bot!

Logistic Regression Resources

Classification with Trees

BI Tech CP303 - Data Mining

R Tutorial

Classification with trees

Tree-based methods are a conceptually simple and powerful method for classification (and regression). They work well in non-linear settings where regression can underperform.

Classification with trees

Besides working well in non-linear settings, an advantage of decision trees is their interpretability.

For example in a clinical setting, they mirror the way that a clinician might think when assessing patient disease risk.

The construction of trees is iterative

Visualization of a series of splits in two-dimensions. Source.

Decisions trees are created by interatively splitting predictors to create increasingly homogenous groups. The basic algorithm can be summarized by the following steps:

Start with all predictors and repeat:

Find the variable and split that best "separates" the outcome, i.e. creates homogenous groups
Divide the data into two groups on that split
Stop when the groups are too small or sufficiently homogenous

How to determine the splitting point?

The first step in our algorithm is to determine the predictor that best splits the data into homogenous groups (sometimes refered to as node purity). There are many measures of node purity, but two common ones we'll see used in caret:

Misclassification
Gini Purity

Misclassification purity

This measure of node purity estimates the probability

$misclassification = 1 - p(y = 1), \text{if y = 1 is the reference class.}$

A misclassification error of 0 implies perfect purity and 0.5 implies an even mixture of both classes (i.e. bot and not) and no 'purity.'

Gini Impurity

Gini impurity is a measure of how often an observation would be incorrectly classified if it was randomly labeled according to the distribution of classes in the split. Gini impurity is computed by summing the probability of each item being chosen times the probability of a classification mistake, i.e. $1 - prob(\text{random guess is correct})$.

$\text{Gini Impurity} = \sum_{i=1}^{k} p_i(1-p_i) = 1 - \sum_{i=1}^{k} p_i^2$ where $k$ is the class (e.g. 'bot', 'not')

A gini impurity of 0 indicates all observations in the node fall into a single category.

Example splits

Suppose we had two candidate splits:

$misclassification = 1 / 16 = 0.06$ $gini = 1 - [(1 / 16)^2 + (15 / 16)^2] = 0.12$

$misclassification = 8 / 16 = 0.50$ $gini = 1 - [(8 / 16)^2 + (8 / 16)^2] = 0.50$

Controlling tree complexity

Tree methods can suffer from overfitting, which occurs when our model is too tuned to the training data. When models are overfit they generalize poorly and exhibit high variance and poor prediction accuracy on new data.

Pruning

When we were doing linear regression, we started by fitting the full model, which sometimes had the lowest RMSE, but was also prohibitively complex. Tree methods can suffer a similar weakness in that resulting trees tend to be large and too complex. Just like regression, a smaller tree with fewer splits often leads to lower variance, easier interpretation and lower test errors, at the cost of a little bias.

Complexity penalty, $Cp$

The complexity penality ranges from 0 to 1 and penalizes the number of terminal nodes in the tree. A high value of $Cp$ corresponds to a decreased likelihood of a branch split and lower $Cp$ penalizes less dramatically. In caret, the train() function will identify an optimal $Cp$ for us.

Reducing variance

Single trees are simple to compute and pretty easy to understand, but they are susceptible to high variance when tested on new data. Methods such as boosting, bagging and random forests have been developed to overcome this problem.

Boosting

Boosting works by combining the output of many potentially weak classifiers to produce a powerful consensus tree. A weak classifier is one whose error rate is only slightly better than random guessing.

Let's use this toy example to walk-through the process.

AdaBoost

Short for adaptive boosting, AdaBoost is one of the most popular boosting algorithms. AdaBoost is adaptive in the sense that subsequent weak learners are boosted in favor of the cases previously misclassified.

Bootstrap aggregating (bagging)

Bagging procedures work by randomly resampling data and refitting the model multiple times. The final model is determined by either averaging or taking a majority vote to form a single model.

The goal of bagging is to reduce variance by averaging many models.

Random Forest

Random forest is a special type of bagging algorithm that resamples both the variables and the data used to fit the model. It's one of the best performing classifiers.

Assume you've got a data set with 5 rows labeled: [1, 2, 3, 4, 5] and three predictors: [A, B, C]. The diagram above shows 4 hypothetical trees that could be constructed from those inputs. Note that the observations are sampled with replacement so some rows occur multiple times. The number of predictors to sample in each iteration is 2, rather than the 3 available in total. Source.

Tree Resources

Tree-based Methods. A great tutorial from Ryan Kelly.
Decision Tree Learning
Johns Hopkins Data Mining course notes. Great intro to decision trees.
Boosting Tutorial by Ron Meir.
Classification Trees overview from Statsoft

Association rule mining

BI Tech CP303 - Data Mining

R Tutorial

Farewell to supervised learning

At this point you've built models to predict numerical and binary outcomes. Your models were based on a training set and predictive accuracy was assessed with a testing set. This class of problems is called supervised learning because we observe the true value of the outcome, and those observations serve as a "teacher" for that model, providing it the correct answer and allowing us to compute measures of accuracy. e.g. RMSE and accuracy measurements.

Unsupervised Learning

In unsupervised learning applications, the goal is to find hidden structure in data when we do not observe an outcome. Since we have no observed outcome to compare our predictions with, we cannot compute an error to measure the predictive accuracy of our models. This distinguishes unsupervised from supervised learning.

Association Rule Mining

Association rule mining is a powerful and popular tool for discovering patterns between variables in large transactional databases. The goal is to find recurring associations.

Market basket analysis

In a retail context, association rule mining is often referred to as "market basket" analysis. In this context the observations are sales transactions, such as those occurring at the checkout counter of a store.

The resulting association rules can be very valuable for use in cross-marketing in promotions, catalog design, consumer segmentation based and product recommendation.

Formalizing market basket analysis

$I = \{item_1, item_2, \dots, item_n\}$ is a collection of binary attributes called items.

$D = \{t_1, t_2, \dots, t_m\}$ is a set of uniquely labeled transactions in the dataset.

Rules take the form $X \Rightarrow Y$ where $X, Y \subseteq I$

Grocery store example

The pool of items is $I = \{milk, bread, butter, beer\}$

The transaction dataset:

Transaction Id	Itemsets
1	milk, bread
2	bread, butter
3	beer
4	milk, bread, butter
5	bread, butter

Rules

$X \Rightarrow Y$

The itemsets on the left-hand-side (LHS) are called the antecedent and the itemsets on the right-hand-side (RHS) are called the consequent of the rule.

Filtering rules

To identify interesting rules from all possible combinations, we need to construct a measure of significance. The most common constraints are imposed with the support, confidence, and lift.

Association rules are rules which surpass a user-specified minimum support and minimum confidence threshold.

Support

The support of an itemset is the proportion of transactions in the data which contain the itemset.

In the grocery example, the itemset $\{milk, bread\}$ occurs in 2 of 5 transactions, so it has a support of 2/5 = 0.4 = 40%.

What is the support for $\{bread, butter\}$?

Confidence

The confidence of a rule, $X \Rightarrow Y$, answers the question "for what proportion of the transactions does this rule apply?", and is an estimate of the conditional probability $p(Y|X)$.

$confidence(X \Rightarrow Y) = \frac{\text{proportion of transactions containing the itemsets in X and Y}}{\text{Proportion of transactions containg the itemset X}}$

Confidence

The rule $\{milk, bread\} \Rightarrow \{butter\}$ has a confidence of 0.2/0.4 = 0.5, meaning that for 50% of the transactions containing $\{milk, bread\}$ the rule holds.

Constraining rules

When association rules are constructed, they must satisfy both minimal support and confidence thresholds, however even with these requirements we're often left with a long list of candidate rules. We can further reduce the pool of rules using the lift metric.

Lift

$lift(A \Rightarrow B) = \frac{support(A \cup B)}{support(A) \times support(B)}$

Lift is the deviation in the support of the whole rule from the support expected under independence given the supports of the LHS and the RHS. Larger lift indicates stronger associations.

Interpreting lift

A lift of 1 indicates that the association between the LHS and RHS is exactly what we'd expect if there is no relationship between the itemsets on either side, i.e. the itemsets are independent. Imagine you're at the grocery story buying your bread, butter and milk, and you remember that you need vacuum cleaner bags.

Lift values greater than 1 indicate that there's some depedence between the itemsets. For example chips and salsa, or beer and pretzels.

The apriori algorithm

The apriori algorithm is an iterative, breadth-first approach to finding frequent itemsets and deriving association rules from them.

College Scorecard Data

In project 3 we'll use a dataset released from the White House last year called the College Scorecard. The purpose of the scorecard is to give students and parents more information for making college decisions and includes data about colleges (e.g. size, composition of student body, common majors) and data about labor market outcomes (e.g. median earnings after graduation, unemployment rates, and debt).

Association Rules Resources

Clustering

BI Tech CP303 - Data Mining

R Tutorial

Clustering

The goal of cluster analysis (or segmentation) is to group observations into subsets (clusters) such that observations within a cluster are more closely related to one another than observations in different clusters.

Clustering is accomplished by grouping observations according to a distance metric.

Discovering categories

Often cluster analysis is used to identify distinct subgroups in data. Retailers often use shopper "personas" to represent patterns of shopping, e.g. Spendy Sally and Cheap Chad. Clustering techniques could be applied in that context to identify categories of shoppers.

Defining similarity is the key

The essential component of all clustering techniqiues is a measure of similarity (or dissimilarity) between observations. Once the measure of similarity is defined, the clustering method groups the observations based on the measure of similarity.

$k$-means Clustering

One of the most popular clustering algorithms, the goal of $k$-means clustering is to group $n$ observations into $k$ clusters. $k$-means has essentially two steps:

Cluster assignment step - assign all observations to one of the centroids based on which is closest.
Centroid shift step - move the centroids to the average value of the points within the cluster.
Stop when the centroids stop moving.

$k$-means Algorithm

1. Randomly initialize K = 3 cluster centroids (red, green, and blue).

2. Assign each observation to a cluster based on which centroid they're nearest.

3. Find the mean of the clusters and move the K centriods to those means.

4. Repeat steps 2 and 3 until the centroids stop moving.

Selecting $k$

The number of centroids, $k$, is a required input for the $k$-means algorithm. The choice of the size of $k$ depends on the goal. For segmentation applications, $k$ is typically defined by the context of the problem. For example, a salesteam may have $k$ members, and the goal is to assign each salesperson a set of leads such that the customers assigned to each are similar as possible. But often $k$ is unknown.

Check out this cool video of $k$-means simulations with varying $k$s and $n$s.

Hierarchical Clustering

The results of $k$-means depend on the value of $k$ and starting position of the centroids, but another clustering method, hierarchical clustering, does not require us to specificy the number of clusters. Instead we specify a distance metric between observations.

This method produces hierarchical representations in which the clusters at each level of the hierarchy are created by merging clustering at the next lower level. At the lowest level each cluster contains a single observation. At the highest level there is only one cluster containing all the data.

Methods of hierarchical clustering

There are two main types of hierarchical clustering:

Agglomerative - a 'bottom up' approach in which each observation is initially its own cluster and pairs of observations are iteratively combined.
Divisive - a 'top down' approach in which all observations start in a single cluster and recursively split.

Agglomerative clustering

To construct the hierarchy, we'll repeat the following steps until there's only one cluster left:

Identify the closest set of clusters
Merge them into one cluster
Remove their unclustered nodes

Distance

To decide which clusters to combine or divide, a distance metric (also called a dissimilarity) between observations is required. How can we measure the distance between two observations?

Euclidean Distance

Look familiar? It's an application of the Pythagorean theorem!

You already know one distance metric: Euclidean distance. Distances are often referred to as similarities or dissimilarities.

Dendrograms

A common tool for representing hierarchical clusters is the dendrogram. The distance between observations is represented on a dendrogram by the length of the edge.

Clustering Resources

Nice presentation by Pham Cuong about clustering techniques for recommendations.
Clustering and Data Mining in R
Data Clustering with R, from RDataMining.com.

You've come a long way, baby

Prediction of a continuous outcome
- Linear regression
- Ridge regression
- Lasso regression
Prediction of a binary outcome
- Logistic regression
- Decision trees
  - Bagging
  - Boosting
  - Random Forest
Unsupervised learning
- Association rule mining
- K-means clustering
- Hierarchical clustering

GREAT JOB!!!

And now... share your work!

You've learned a ton about data mining, problem solving and writing technical reports. To make the most of your efforts, let's create digital portfolios of work with RPubs and Markdown.

Markdown

Markdown is a text-to-HTML conversion tool. It allows you to write your text using an easy-to-read, easy-to-write plain text format, then convert it to structurally valid HTML. Markdown is

a plain text formatting syntax and
a software tool, written in Perl, that converts the plain text formatting to HTML

An analysis and communication tool combined

Markdown gives us a powerful communication tool that can be embedded with R code so that we can document analyses as we go. This is a powerful workflow because it keeps communication and clarity of thought top of mind. Ultimately our goal is to report and communicate findings, why not do that with a single tool?

Reproducibility

Tools like Excel can be great for doing quick explorations, but as your analyses become complex, your work becomes increasingly unreproducable. That lack of reproducibility isn't just a pain for the consumers of your analyses, but it's a huge pain for you.

"Prior to shifting to R, my basic approach was to enter data in Excel, import the data into Systat, and then use the command line to do a bunch of manipulations in there. I generally viewed those as one-off manipulations (e.g., calculating log density), and, while I could have saved a command file for those things, I didn’t. This meant that, if I discovered an error in the original Excel file, I needed to redo all that manually." Meghan Duffy

Interactivity

Another advantage of using a combination of R and Markdown as both your analysis and communication tool is that it enables interactivity, which can help engage your audience and allow them to explore the data themselves. As an example, check out this benefits calculator that a co-worker and I built to help companies understand the pros and cons of offering paid leave to employees.

R Markdown

R Markdown is a custom flavor of Markdown and an authoring format that enables easy creation of dynamic documents, presentations, and reports from R. It combines the core syntax of Markdown with embedded,runnable R code. R Markdown documents are fully reproducible, so that they can be automatically regenerated whenever underlying R code or data changes.

KnitR

The knitr package was designed to be a transparent engine for dynamic report generation with R. KnitR is the library that actually compiles the Markdown.

"When I was first starting out, I’d create a bunch of figures and tables and email them to my collaborator with a description of the findings in the body of the email. That was cumbersome for me and for the collaborator. ('Which figure are we talking about, again?')" Karl Broman

A better way to communicate

"...Now, I deliver my informal reports to collaborators as html documents that can be viewed in a browser. A big advantage to this is that I don’t have to worry about page breaks. For example, I can have very tall figures, with say 30 panels. That makes it easy to show the results in detail, and you don’t have to worry about how to get figures to fit nicely into a page." Karl Broman

Make a new R Markdown doc

Once you select a new Markdown document you can choose from documents, presentations, Shiny application or from a template.

Denoting code

Backticks ` and ```{r chunk_name, options}. If you want to put code snippets in line with your text just wrap them in `the ticks`. To make bigger code chunks, then do this:

```{r linear regression}
require(ggplot2)
data(diamonds)
model = lm(carat ~ depth, data = diamonds)
```

Output your report to a file

Export your Markdown to a format of your choice

You can save as HTML, PDF or a Word Document.

Or, share your notebooks online!

RPubs is a free service from RStudio where you can upload your R Markdown files and view them online.

Sharing Resources

But who is Erin Shellman?

I’m a statistician, programmer, and senior data scientist at Zymergen. I’ve done research and data science in a broad range of industries including retail, cloud computing, and biotechnology. Along the way, I built product recommendations, web scrapers, interactive visualizations, and analyzed terabytes of data.

BI Tech CP303 - Data Mining

Erin Shellman - shellman@uw.edu

TA: Bryan Mayer - mayerbry@uw.edu

Textbooks

Course Philosophy

Navigating the notes

Just like this!

Traverse the topics

Bookmark and check back often!

Notes improved with the support of viewers like you!

Introduction to Data Mining with R

BI Tech CP303 - Data Mining

We are inundated with data

We are inundated with data

Data representation and storage is more heterogeneous than ever

We need new tools

Data mining is a process of knowledge discovery

Additional definitions

Same techniques, different goal

Lets look at some examples

Forest fires

Credit card default

College ranking

Limitations of data mining

Ethical data mining

Ethics case study: Target

The business problem:

Discussion questions

How should we think about ethics?

Ethical decision points

Inquiry: what are my values?

Analysis: what are the standards of my industry?

Articulation: are there value gaps?

Action: how do we close the value gaps?

Ethical behavior requires constant diligence

R for statistical computing

Installation

Introductory Resources

Predicting numerical outcomes with linear regression

BI Tech CP303 - Data Mining

What makes a successful station?

What is linear regression?

Why linear regression?

Interpretability

Ease of implementation

Powerful

Flexible

Mathematical foundations

Mathematical foundations

Mathematical foundations

Which line is the best?

Errors

Ordinary least squares ftw

Intuition

Assumptions

Diagnostic plots

Interpretation

R regression output

Assessing model quality

Hold-out sets

Quick heuristics for study design

Measuring model quality for continuous outcomes$\text{mean absolute error (MAE)} = \\ \frac{1}{n} \sum\limits_{i=1}^n |predicted_i - observed_i|$$\text{root mean squared error (RMSE)} = \\ \sqrt{\frac{1}{n} \sum\limits_{i=1}^n (predicted_i - observed_i)^2}$

Which metric should I use?

Linear Regression Resources

Model Selection

BI Tech CP303 - Data Mining

The full model

Simplifying the full model

Combining like features

Subset selection

Forward and backward selection

How should we set parameter values?

Use cross validation to determine the best parameter values

Cross Validation with caret

Subset selection summary

Multi-collinearity

Variance inflation

Variance inflation

Shrinkage methods

Ridge regression

Measuring model quality for continuous outcomes
$\text{mean absolute error (MAE)} = \\ \frac{1}{n} \sum\limits_{i=1}^n |predicted_i - observed_i|$
$\text{root mean squared error (RMSE)} = \\ \sqrt{\frac{1}{n} \sum\limits_{i=1}^n (predicted_i - observed_i)^2}$