Welcome! These are lecture notes that accompany an applied course on data mining. Read the syllabus to find details about assignments and due dates.

Everything you need for this course is in these notes and R tutorials, but there are some great textbooks available for free online.

Data mining is an applied skill and this course focuses on applications. Each class is a blend of concepts and applications using the programming language R. There are three projects, each consisting of a data mining analysis to solve a business problem and a brief written report of your findings.

By the end of this course, you'll be able to approach unfamiliar data and quickly find and communicate results.

The notes progress from left to right, and for those interested in exploring more about a topic sometimes there's bonus material in the up, down direction.

You can also locate slides by topic using the navigation bar at the top.

Data technology is a rapidly growing and changing industry. I chose a web-based format for course notes so the slides could be easily maintained and up-to-date. As you apply these skills outside of class, I hope these notes can be a valuable reference.

These slides are a work in progress and I need your help to make them rule. If you see an error or think something is unclear or omitted, please report it!

I'll do my best to keep the slides as updated and accurate as possible.

Governments, corporations, scientists, and consumers are creating and collecting more data than ever before. The unprecendented availability of data has transformed the modern economy and, for many, the human condition. Using clickstream technology, the modern advertiser can follow would-be customers as they browse online. These marketers are searching for clues about your preferences and shopping habits, evidence of your family composition, and even upcoming events in your life, all for the purposes of personalization. On an individual level, the Internet of Things allows us to continually monitor our bodies with Fitbits, Fuel bands, and Jawbone, and our homes with products like Nest.

Not only do consumers personally generate data, but they *share* it with others on social media. Similarly the biological and earth sciences are exploding with data as the cost of collecting it has plummetted. And the government, of course, wants all the data.

But that's just it, we're *collecting* data in faster than ever in history, but our growth of *understanding* of these data has not kept pace.

Not only are we generating more data faster, we're developing more ways to represent and store data. Marketing personalization combines disperate data about a single person or household: demographics, transactional histories, social media activities and clickstream. Personalization efforts require the ability to ingest and combine data from many sources across many formats. Graph databases like Titan or neo4j represent complex interactions as network graphs and analysts query the network with traversal patterns. Algorithms like mapreduce on the other hand, revolutionized the way we query unstructured data, like text and weblogs.

In the context of big, heterogeneous data, the number of candidate features is too large to explore with traditional analysis techniques. Further, the volume of data creates new challenges that analyists historically haven't dealt with such as

- computational efficiency,
*i.e.*speed - memory use and allocation,
*i.e.*computing resources - data pipelines and integrity
- the ethical consequences of findings

Data Mining is the process of identifying patterns in large volumes of data __for the purposes of prediction__. Realistically, data cleaning and exploratory analysis will occupy the vast majority of your time with the rest left over for model discovery. Model deployment is optional and depends on the result of the first three steps. After completing the analysis you might find that long-term adoption of your model isn't worthwhile.

Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. Witten, Frank and Hall

The most commonly accepted definition of “data mining” is the discovery of “models” for data." Where 'model' could mean a statistical model, and machine learning model or an algorithm like mapreduce. Leskovec, Rajaraman, Ullman

Data Mining is an analytic process designed to explore data in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns. Hill & Lewicki

You'll find that many of the techniques you've used in an analysis setting are identical to those used in machine learning, for example, linear and logistic regression. The key difference is that in machine learning, we'll judge our model quality by assessing its ability to predict on hold-out data, rather than through goodness-of-fit metrics like $ R^2 $ or AIC.

Data mining problems generally fall into one of three classes and we're going to get some hands-on experience with all three:

- Prediction of continuous outcomes
- Classification (or supervised learning)
- Clustering and unsupervised learning

In the first unit, you'll use a collection of meteorological data and regression techniques to predict the burn area of a nature park in Portugal.

In the second unit you will build classification models with logistic regresson and decision trees to identify customers who will default on their next credit card payment.

In the third unit you'll use the unsupervised learning approaches of association rule mining and clustering to help prospective students identify cheaper alternatives to the US News and World Reports college rankings.

Like any analytical technique, and arguably more so than traditional analysis, data mining has limitations:

- many patterns are uninteresting
- patterns can be coincidental or a result of sampling bias
- data are almost always untidy or missing

How do these limitations influence the interpretation and application of the results of a data mining process? What are our standards for belief?

As a result of the volume and speed at which businesses can collect data, conversations about the ethical use of these data are more important than ever. We're used to reading headlines about ethical violations in big data, but often these incidents can be anticipated and avoided.

Here's a fairly horrifying list of articles about data mining and potential ethical violoations.

Take 20 minutes and read the following piece from the New York Times Magazine which describes how Target used transactional data to target expectant mothers for marketing campaigns: How Companies Learn Your Secrets

Specifically, the marketers said they wanted to send specially designed ads to women in their second trimester, which is when most expectant mothers begin buying all sorts of new things, like prenatal vitamins and maternity clothing. “Can you give us a list?” the marketers asked.

Put yourself in the role of an analyst at Target.

- Why did people get upset?
- What data do you need to build a model to predict if a woman is pregnant?
- Is it ethical to merge data from outside of Target's scope in customers' lives (
*e.g.*public birth records). - Does the cost of backlash outweigh the benefits of the knowledge?
- How might Target have met the business goals, without losing customer credibility?
- To what degree should Andrew Pole be held accountable for how his analyses were used?

When we are faced with a data mining application, what framework should we use to think about ethics?

Ethical decision points provide a framework for exploring the relationship between what values you hold as individuals—and as members of a common enterprise—and aligning those values with the actions you take in building and managing products and services utilizing big data technologies.

**Inquiry**: what are are my organization's values? What are my values?**Analysis**: what are the practice standards in my industry?**Articulation**: where are there alignments and gaps between our values and proposed data practices?**Action**: what must we do to close value alignment gaps?

Target has a values page on their website, but more relevant to this discussion is the privacy policy.

At Target, we want you to know how we collect, use, share and protect information about you. By interacting with Target, you consent to use of information that is collected or submitted as described in this privacy policy. We may change or add to this privacy policy, so we encourage you to review it periodically. To help you track the changes, we include a history of changes below.

In marketing a great place to start is the Digital Marketing Association's Ethics and Compliance Program and Guidelines for ethical business practice the latter of which covers privacy and marketing to children, something Target did whether they knew it or not, in depth.

If we send someone a catalog and say, ‘Congratulations on your first child!’ and they’ve never told us they’re pregnant, that’s going to make some people uncomfortable,” Pole told me. “We are very conservative about compliance with all privacy laws. But even if you’re following the law, you can do things where people get queasy.

The girl was high school age, a fact that, presumably, Target knew. Why then did they choose to proceed with the mailer?

With the pregnancy products, though, we learned that some women react badly. Then we started mixing in all these ads for things we knew pregnant women would never buy, so the baby ads looked random. We’d put an ad for a lawn mower next to diapers. We’d put a coupon for wineglasses next to infant clothes. That way, it looked like all the products were chosen by chance.

... and then of course they say this dehumanizing thing.

And we found out that as long as a pregnant woman thinks she hasn’t been spied on, she’ll use the coupons. She just assumes that everyone else on her block got the same mailer for diapers and cribs. As long as we don’t spook her, it works.

The ethical consequences of data mining isn't a one-time consideration. Everytime we sit down to conduct and analysis we should think about the potential for harm. We'll return to the discussion of ethics repeatedly throughout the course.

R is a leading tool in the industry for statistics and data science.

- is free and open-source
- has an active community, which means you've got tons of analysis options.
- is in demand!

- Data Mining with R: Learning with Case Studies
- Ethics of Big Data: Balancing Risk and Innovation
- Coursera course on R
- Exploratory Data Analysis from Coursera
- Exploratory Data Analysis from Udacity
- Mining of Massive Datasets
- The Elements of Statistical Learning. Trevor Hastie, Robert Tibshirani and Jerome Friedman. This is a classic text available from their website
- Hands-On Data Sciece with R
- Data mining resources

Your first project is to help a hypothetical bike-share program expand operations by identifying characteristics of successful bike stations. You'll do that by predicting the expected number of rentals per station as a function of the surrounding landscape.

Linear Regression is a technique for estimating numerical outcomes such as number of rentals per day or the ride time in minutes.

Linear regression is a method to describe an outcome variable in terms of a linear combination of predictor variables. Regression with multiple predictors allows the us to answer the question "what is the best predictor of the outcome of interest?"

In data mining, linear regression is a technique that enables us to predict the value of a continuous variable, like ride duration or number of rentals, in terms of the values of other variables, like the number of crosswalks or museums nearby.

Regression approaches are great to have in your toolbelt due to their ease of

- interpretability
- implementation
- flexibility
- popularity

Regression coefficients have an explanatory interpretation. In the case of a regression with a single predictor, the coefficients are interpreted as the average change in the outcome for a unit change in an explanatory variable. Say for example we had the following regression equation:

$ duration = 15 + 30 * museums $The interpretation is that the average ride duration is 15 minutes when there are no nearby museums ($museums = 0$), and the average duration increases by 30 minutes for each museum within a quarter mile radius.

Most programming languages have out-of-the-box implementations of linear regression, so they can be applied across many applications.

In prediction settings, linear models can outperform more complex nonlinear techniques, particularly when working with small training sets or sparse data. This is because regression is an averaging technique, making it less sensitive to missingness and extreme values.

Even in situations where linearity assumptions fail, regression can be applied on transformations of variables such as logarithms.

Let's walk through the foundations of regression with just one explanatory variable/covariate. Regression with just one covariate is called simple linear regression (SLR). SLR has a simpler framework than multiple regression and is fundamental to understanding more complex models.

SLR describes the relationship between an outcome and a single predictor variable. The regession equation expresses how the average of the outcome variable changes over the range of the predictor variable.

*e.g.* How does the expected number of rentals change as the number of nearby crosswalks changes?

Given $n$ observations of $x$ and $y$,

$ (x_1, y_1), (x_2, y_2), \ldots , (x_n, y_n) $the model for a simple linear regression is expressed as

$ y_i = \beta_{0} + \beta_{1}x_i + \epsilon_i $Hey, kinda looks like $y = mx + b$ where $\beta_0$ is the intercept and $\beta_1$ is the slope of the line.

How do we choose $\beta_0$ and $\beta_1$? With visual inspection alone, there seems to be several lines that could be drawn through the center of the points. We need an algorithm for standardizing and automating the process of finding the best one.

Data points will not fall exactly on any straight line we choose, there will always be some error. The distance of a point from the regression line is called the residual error, $ \epsilon_i $. The errors are assumed to be independent and normally distributed with mean 0 and standard deviation $\sigma$.

The method of least squares selects the line that minimizes the sum of the squares of the errors. $\epsilon_i$ is the error of the prediction line for each observation.

$ \epsilon_i = (y_i - \beta_{0} - \beta_{1}x_i) $$ \epsilon_i^2 = (y_i - \beta_{0} - \beta_{1}x_i)^2 $Choose $\beta_0$ and $\beta_1$ so that $\sum\limits_{i=1}^n\epsilon_i^2$ is minimized.Each $y_i$ is an observed value of the outcome variable. $\beta_{0} + \beta_{1}x_i$ is the estimated, or expected, $y_i$ value. So when we minimize the sum of squared errors we're making $(observed - expected)^2$ as close to 0 as possible.

To safely apply linear regression without fear of weird results, some assumptions must be met:

- Linearity between the outcome and coefficients ($\beta$'s)
- Independence of the errors (no autocorrelation)
- Constant variance (homoscedasticity) of the errors
- Normality of errors

We might violate some assumptions and it's application-dependent whether we should be concerned about those violations.

In R, if you plot a linear model object, it'll give you a nice collection of diagnostic plots for assessing model assumptions.

```
model = lm(rentals ~ crossing, data = data)
par(mfrow = c(2, 2)) # make 4 plots in 2 rows, 2 cols
plot(model)
```

Regression equations have a nice interpretation. The intercept $\beta_0$ is the mean value of $y$ when $x = 0$. The slope of the regression line, $\beta_{1}$, is the change in the mean of y for each unit change in x.

```
model = lm(rentals ~ crossing, data = data)
summary(model)
Call:
lm(formula = rentals ~ crossing, data = data)
Residuals:
Min 1Q Median 3Q Max
-51.815 -21.176 -8.394 13.765 126.969
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.53163 2.59354 10.615 < 2e-16 ***
crossing 0.49159 0.07031 6.992 4.92e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 29.54 on 183 degrees of freedom
Multiple R-squared: 0.2108, Adjusted R-squared: 0.2065
F-statistic: 48.88 on 1 and 183 DF, p-value: 4.92e-11
```

We've discussed how regression is a tool for numerical prediction. When we use a model to make predictions, how do we know if the model is any good? To assess model quality, we quantify its *predictive accuracy*.

We measure prediction quality by simulating new data coming in. We do that by dividing the original data into a *training set* and a *testing set*. The *training set* is used to fit and tune the predictive model. Then, the *testing set* is used to validate the predictions. Generally construction of the train and test sets should be done completely at random.

- large sample size
- 60% train
- 20% test
- 20% validation

- medium sample size
- 60% train
- 40% test

- small sample size
- Do cross validation

$\text{mean absolute error (MAE)} = \\ \frac{1}{n} \sum\limits_{i=1}^n |predicted_i - observed_i|$

$\text{root mean squared error (RMSE)} = \\ \sqrt{\frac{1}{n} \sum\limits_{i=1}^n (predicted_i - observed_i)^2}$

Just try them all, each metric has strengths and weaknesses.

- MSE and RMSE are sensitive to outliers.
- Median absolute deviation is often more sensitive to outliers.

- OLS regression Explained Visually
- Practical Machine Learning is a great Coursera course and is part of the data science specialization from Johns Hopkins.
- A Short Introduction to the caret Package
- Building Predictive Models in R Using the caret Package

We'll refer to the regression with all available predictors as a *full model*. The full model of the number of rentals contains 71 predictors, which is too many to meaningfully interpret and may be challenging to apply in practice.

Further, models with many predictors risk being over-parameterized, or *overfit*.

Overfitting occurs when a model is trained too closely to a particular training set and as a result fails to generalize to new data.

This week we'll discuss methods for reducing model size while maintaining prediction accuracy that will make the final model practical to apply and that are not overfit.

*Reduce predictors*- combination of features
- feature selection

*Regularization*- keep all predictors, but shrink the parameter values- ridge regression
- lasso

Linear regression works well when the number of observations, $n$, is larger than the number of predictors, $p$. In a data mining context we sometimes have data where $p >> n$. An example of this from the biomedical science is microarray technology. One tissue sample can produce over 50,000 observations making $p$ orders of magnitude larger than $n$.

The "curse of dimensionality" is not a problem of high-dimensional data, but a joint problem of the data and the algorithm being applied. It arises when the algorithm does not scale well to high-dimensional data, typically due to needing an amount of time or memory that is exponential in the number of dimensions of the data.

The bikeshare data set includes information about nearby amenities like the number of restaurants, statues, parking lots and cafes. Since we're mostly working with counts, one simple way to reduce the number of variables is to combine like features.

```
nightlife = bar + club + pub + nightclub
tourism = tourism_artwork + tourism_hotel + tourism_information
```

Subset selection methods use OLS regression, but limit the number of predictors $k$ allowed to remain in the model. They iteratively add dor remove predictors to the model until $k$ predictors are included.

*Forward selection*starts with the intercept and sequentially adds predictors to the model.*Backward selection*starts with the full model and sequentially deletes predictors that have the least contribution to model fit.

Both forward and backward selection requires that we specify a parameter value, $k$. but how do we choose? Cross validation!

Cross-validation works by splitting the data up into sets of $n$ folds. Your model building and error estimation procedure is then repeated $n$ times. After completing a 5-fold cross-validation you'd have 5 error estimates and can average to obtain a robust estimate of the prediction error.

After reducing the full model using subset selection, the final model is interpretable and potentially has better predictive accuracy than the full model. However we got there by throwing predictors out of the model.

Selection methods do not guarantee that the subsets from the stepwise procedures contain the same variables or even be the "best" performing subset. There's also the possibility of multiple, equally good models.

In contexts where there are many predictors, some of them are likely correlated. Fitting regression with correlated predictors causes the model to be highly variable and produces unstable estimates of the coefficients.

When predictors are correlated they are said to exhibit multicollinearity, and in general it's something we want to avoid.

Take, for example, the regression model: $\text{num_rentals} = \beta_0 + \beta_1*\text{embassy} + \beta_2*\text{diplomatic_embassy}$.

These two predictors are virtually the same, have a correlation of 0.99, and models that include both of them display highly unstable coefficient values.

```
summary(lm(num_rentals ~ embassy + diplomatic_embassy, data = bikeshare))
Call:
lm(formula = num_rentals ~ embassy + diplomatic_embassy, data = bikeshare)
Residuals:
Min 1Q Median 3Q Max
-5857.3 -2112.1 -919.1 1118.9 24867.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2668.14 99.85 26.721 <2e-16 ***
embassy 2168.61 1102.25 1.967 0.0494 *
diplomatic_embassy -1260.87 1113.48 -1.132 0.2578
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3105 on 998 degrees of freedom
Multiple R-squared: 0.0297, Adjusted R-squared: 0.02775
F-statistic: 15.27 on 2 and 998 DF, p-value: 2.931e-07
```

Coefficient shrinkage methods penalize coefficient values so they do not exceed some threshold.

Ridge regression *shrinks* regression coefficients by penalizing based on the coefficient size. In ordinary least squares, we picked $\beta$s that minimize our squared errors. Now we do that, but include a penalty parameter.

$ \epsilon_i^2 = \sum\limits_{i} (y_i - x^T_i\beta)^2 + \lambda\sum\limits_{j = 1}^p \beta_j^2 $

or, OLS + penalty

$\lambda$ is a complexity parameter that controls the amount of coefficient shrinkage. The penalty parameter $\lambda$ controls the level of shrinking. $\lambda = 0$ will give you exactly the same coefficients as OLS. Beyond that, the choice of $\lambda$ will depend on the data and model, and we'll use cross validation to determine the best value.

Like ridge regression, lasso is a shrinkage method with a major difference.

$\epsilon_i^2 = \sum\limits_{i} (y_i - x^T_i\beta)^2$ subject to $\sum\limits_{j = 1}^p |{\beta_j}| \leq t$

Making t sufficiently small will cause some of the coefficients to be exactly zero, making the lasso method similar to a subset selection.

If $t_0 \geq \sum\limits_{j = 1}^p |\beta_j|$, the lasso estimates and the OLS estimates are identical. If $t = t_0/2$ then the least squares coefficients are shrunk by about 50% on average. Once again we can pick the $t$ that gives the lowest test error, and we'll learn how to do that in *caret*.

We discussed 3 methods for model selection, forward/backward selection, ridge regression, and the lasso, but which should you use and why? As usual, your choice in method is typically informed from your application.

If model size is a practical concern then forward/backward selection with a sensible $k$ could be a good choice. For example, if the data you're working with is quite expensive to collect like human surveys then in the future you'd like to reduce the number of variables to record.

If predictors come cheap and you want to tune the regression for better performance, then consider ridge regression or the lasso. For example maybe your data are performance metrics from webserver logs and data collection is simple.

*Full models*can be impractical and lead to overfitting.- Subset selection methods add, remove, or both add and remove predictors sequentially depending on the improvements in fit.
- Skrinkage methods penalize coefficient size to
- Subset selection and shrinkage methods all require additional parameters to work,
*e.g.*$k$, $\lambda$, and $t$. Cross-validation can be used to identify optimal values of required parameters.

We use classification techniques when we're trying to assign something to a category.

- Spam or not spam
- Diseased or not diseased
- Fraud or not fraud

In our case, the outcome $y$ is either 0 or 1: $y \in{\{0, 1\}}$

We could build a spam classifier with a linear regression predicting whether a message is spam. The predictor is the number of occurrences of the character '$' in a message.

Then to predict the class, create a threshold on $y$:

- if $x \geq 8$, spam
- if $x < 8$, not spam

Messages with more than eight '$'s, will be labeled as spam.

What if an extreme value is observed later? The current rule isn't violated, but the refit model begins to misclassify.

In binary classification the outcome, $y$, can only take the values $0$ or $1$, but linear regression produces predictions for $y$ that could be much larger than $1$ and much smaller than $0$. To make binary predictions in the last example, we imposed a threshold at $y=0.50$, but that threshold is arbitrary. Our previous prediction tool, linear regression, won't work well for classification problems. Instead we can use another type of regression called, logistic regression.

Articulated in Nelder and Wedderburn in 1972 paper in Journal of the Royal Statistical Society.

GLMs are composed of three parts:

- An exponential family model for the response
- A systematic component via a linear predictor
- A link function that connects the means of the response to the linear predictor

- Exponential family model:
- Linear predictor: $ \eta_{i} = \sum_{k = 1}^{p} X_{ik}\beta_k $
- Link: $ g(\mu) = \eta = \log{\mu \over 1-\mu} $

**Goal:** $0 \leq predictions \leq 1$

$y = \beta_0 + \beta_{1}x$ ➟ Simple Linear Regression

How can we transform the right hand size to produce values that are between 0 and 1, like the outcome?

$g(z) = \frac{1}{1+e^{-z}}$ ➟ Logistic (or sigmoid) function

$z = \beta_0 + \beta_{1}x$

$y = \frac{1}{1 + e^{-(\beta_0 + \beta_{1}x)}}$ ➟ Logistic Regression

Model predictions represent the probability that $y = 1$ given the data $x$.

*e.g.* The probability of a message being spam, given the number of $\$'s$. If I get a value of 0.70 from my model, then $p(\text{message is spam}\mid 10 \text{\$'s}) = 0.70$.

The *decision boundary* separates the region where the model predicts $y = 1$ from the region where the model predicts $y = 0$.

- If $y \geq 0.5$, then the message is spam
- If $y < 0.5$, then the message is not spam

$p(\text{spam}) = -0.80 + 0.10*\text{number of \$s}$

predict y = 1, if: $-0.80 + 0.10x \geq 0$

$0.10x \geq 0.80$

$x \geq 8$

The line at $x = 8$ is called the *decision boundary*.

Like many Internet giants Twitter makes money by selling ads, but they’ve got an insidious infestation eroding their advertising credibility: bots. More than 23 million of them. Twitter bots are automatons living in the Twittersphere and ranging wildly in capability. At their most complex they use speech patterns that can, at times, fool humans.

When advertisers pay for engagement, they aren’t interested in a four-hour flame war between a gamergate bot and a Kanye bot. When advertisers analyze social data they want to be sure the numbers are the result of human activity.

For project 2, we'll use Twitter user data to build a classifier for identifying bots.

Tree-based methods are a conceptually simple and powerful method for classification (and regression). They work well in non-linear settings where regression can underperform.

Besides working well in non-linear settings, an advantage of decision trees is their interpretability.

For example in a clinical setting, they mirror the way that a clinician might think when assessing patient disease risk.

Decisions trees are created by interatively splitting predictors to create increasingly homogenous groups. The basic algorithm can be summarized by the following steps:

Start with all predictors and repeat:

- Find the variable and split that best "separates" the outcome,
*i.e.*creates homogenous groups - Divide the data into two groups on that split
- Stop when the groups are too small or sufficiently homogenous

The first step in our algorithm is to determine the predictor that best splits the data into homogenous groups (sometimes refered to as *node purity*). There are many measures of node purity, but two common ones we'll see used in caret:

- Misclassification
- Gini Purity

This measure of node purity estimates the probability

$misclassification = 1 - p(y = 1), \text{if y = 1 is the reference class.}$

A misclassification error of 0 implies perfect purity and 0.5 implies an even mixture of both classes (*i.e.* bot and not) and no 'purity.'

Gini impurity is a measure of how often an observation would be incorrectly classified if it was randomly labeled according to the distribution of classes in the split. Gini impurity is computed by summing the probability of each item being chosen times the probability of a classification mistake, *i.e.* $1 - prob(\text{random guess is correct})$.

$\text{Gini Impurity} = \sum_{i=1}^{k} p_i(1-p_i) = 1 - \sum_{i=1}^{k} p_i^2$ where $k$ is the class (*e.g.* 'bot', 'not')

A gini impurity of 0 indicates all observations in the node fall into a single category.

Suppose we had two candidate splits:

Tree methods can suffer from overfitting, which occurs when our model is too tuned to the training data. When models are overfit they generalize poorly and exhibit high variance and poor prediction accuracy on new data.

When we were doing linear regression, we started by fitting the *full model*, which sometimes had the lowest RMSE, but was also prohibitively complex. Tree methods can suffer a similar weakness in that resulting trees tend to be large and too complex. Just like regression, a smaller tree with fewer splits often leads to lower variance, easier interpretation and lower test errors, at the cost of a little bias.

The complexity penality ranges from 0 to 1 and penalizes the number of terminal nodes in the tree. A high value of $Cp$ corresponds to a decreased likelihood of a branch split and lower $Cp$ penalizes less dramatically. In *caret*, the *train()* function will identify an optimal $Cp$ for us.

Single trees are simple to compute and pretty easy to understand, but they are susceptible to high variance when tested on new data. Methods such as boosting, bagging and random forests have been developed to overcome this problem.

Boosting works by combining the output of many potentially weak classifiers to produce a powerful consensus tree. A weak classifier is one whose error rate is only slightly better than random guessing.

Let's use this toy example to walk-through the process.

Short for adaptive boosting, AdaBoost is one of the most popular boosting algorithms. AdaBoost is adaptive in the sense that subsequent weak learners are boosted in favor of the cases previously misclassified.

Bagging procedures work by randomly resampling data and refitting the model multiple times. The final model is determined by either averaging or taking a *majority vote* to form a single model.

The goal of bagging is to reduce variance by averaging many models.

Random forest is a special type of bagging algorithm that resamples both the variables and the data used to fit the model. It's one of the best performing classifiers.

- Tree-based Methods. A great tutorial from Ryan Kelly.
- Decision Tree Learning
- Johns Hopkins Data Mining course notes. Great intro to decision trees.
- Boosting Tutorial by Ron Meir.
- Classification Trees overview from Statsoft

At this point you've built models to predict numerical and binary outcomes. Your models were based on a training set and predictive accuracy was assessed with a testing set. This class of problems is called *supervised learning* because we observe the true value of the outcome, and those observations serve as a "teacher" for that model, providing it the correct answer and allowing us to compute measures of accuracy. *e.g.* RMSE and accuracy measurements.

In *unsupervised learning* applications, the goal is to find hidden structure in data when we do not observe an outcome. Since we have no observed outcome to compare our predictions with, we cannot compute an error to measure the predictive accuracy of our models. This distinguishes unsupervised from supervised learning.

Association rule mining is a powerful and popular tool for discovering patterns between variables in large transactional databases. The goal is to find recurring associations.

In a retail context, association rule mining is often referred to as "market basket" analysis. In this context the observations are sales transactions, such as those occurring at the checkout counter of a store.

The resulting association rules can be very valuable for use in cross-marketing in promotions, catalog design, consumer segmentation based and product recommendation.

$I = \{item_1, item_2, \dots, item_n\}$ is a collection of binary attributes called *items*.

$D = \{t_1, t_2, \dots, t_m\}$ is a set of uniquely labeled transactions in the dataset.

*Rules* take the form $X \Rightarrow Y$ where $X, Y \subseteq I$

The pool of items is $I = \{milk, bread, butter, beer\}$

The transaction dataset:

Transaction Id | Itemsets |
---|---|

1 | milk, bread |

2 | bread, butter |

3 | beer |

4 | milk, bread, butter |

5 | bread, butter |

$X \Rightarrow Y$

The itemsets on the left-hand-side (LHS) are called the *antecedent* and the itemsets on the right-hand-side (RHS) are called the *consequent* of the rule.

To identify interesting rules from all possible combinations, we need to construct a measure of significance. The most common constraints are imposed with the *support*, *confidence*, and *lift*.

Association rules are rules which surpass a user-specified minimum support and minimum confidence threshold.

The *support* of an itemset is the proportion of transactions in the data which contain the itemset.

In the grocery example, the itemset $\{milk, bread\}$ occurs in 2 of 5 transactions, so it has a support of 2/5 = 0.4 = 40%.

What is the support for $\{bread, butter\}$?

The *confidence* of a rule, $X \Rightarrow Y$, answers the question "for what proportion of the transactions does this rule apply?", and is an estimate of the conditional probability $p(Y|X)$.

$confidence(X \Rightarrow Y) = \frac{\text{proportion of transactions containing the itemsets in X and Y}}{\text{Proportion of transactions containg the itemset X}}$

The rule $\{milk, bread\} \Rightarrow \{butter\}$ has a confidence of 0.2/0.4 = 0.5, meaning that for 50% of the transactions containing $\{milk, bread\}$ the rule holds.

When association rules are constructed, they must satisfy both minimal support and confidence thresholds, however even with these requirements we're often left with a long list of candidate rules. We can further reduce the pool of rules using the *lift* metric.

$lift(A \Rightarrow B) = \frac{support(A \cup B)}{support(A) \times support(B)}$

Lift is the deviation in the support of the whole rule from the support expected under independence given the supports of the LHS and the RHS. Larger lift indicates stronger associations.

A lift of 1 indicates that the association between the LHS and RHS is exactly what we'd expect if there is no relationship between the itemsets on either side, *i.e.* the itemsets are independent. Imagine you're at the grocery story buying your bread, butter and milk, and you remember that you need vacuum cleaner bags.

Lift values greater than 1 indicate that there's some depedence between the itemsets. For example chips and salsa, or beer and pretzels.

The *apriori* algorithm is an iterative, breadth-first approach to finding frequent itemsets and deriving association rules from them.

In project 3 we'll use a dataset released from the White House last year called the College Scorecard. The purpose of the scorecard is to give students and parents more information for making college decisions and includes data about colleges (*e.g.* size, composition of student body, common majors) and data about labor market outcomes (*e.g.* median earnings after graduation, unemployment rates, and debt).

The goal of cluster analysis (or segmentation) is to group observations into subsets (clusters) such that observations within a cluster are more closely related to one another than observations in different clusters.

Clustering is accomplished by grouping observations according to a distance metric.

Often cluster analysis is used to identify distinct subgroups in data. Retailers often use shopper "personas" to represent patterns of shopping, *e.g.* Spendy Sally and Cheap Chad. Clustering techniques could be applied in that context to identify categories of shoppers.

The essential component of all clustering techniqiues is a measure of similarity (or dissimilarity) between observations. Once the measure of similarity is defined, the clustering method groups the observations based on the measure of similarity.

One of the most popular clustering algorithms, the goal of $k$-means clustering is to group $n$ observations into $k$ clusters. $k$-means has essentially two steps:

*Cluster assignment step*- assign all observations to one of the centroids based on which is closest.*Centroid shift step*- move the centroids to the average value of the points within the cluster.- Stop when the centroids stop moving.

The number of centroids, $k$, is a required input for the $k$-means algorithm. The choice of the size of $k$ depends on the goal. For segmentation applications, $k$ is typically defined by the context of the problem. For example, a salesteam may have $k$ members, and the goal is to assign each salesperson a set of leads such that the customers assigned to each are similar as possible. But often $k$ is unknown.

Check out this cool video of $k$-means simulations with varying $k$s and $n$s.

The results of $k$-means depend on the value of $k$ and starting position of the centroids, but another clustering method, hierarchical clustering, does not require us to specificy the number of clusters. Instead we specify a distance metric between observations.

This method produces hierarchical representations in which the clusters at each level of the hierarchy are created by merging clustering at the next lower level. At the lowest level each cluster contains a single observation. At the highest level there is only one cluster containing all the data.

There are two main types of hierarchical clustering:

*Agglomerative*- a 'bottom up' approach in which each observation is initially its own cluster and pairs of observations are iteratively combined.*Divisive*- a 'top down' approach in which all observations start in a single cluster and recursively split.

To construct the hierarchy, we'll repeat the following steps until there's only one cluster left:

- Identify the closest set of clusters
- Merge them into one cluster
- Remove their unclustered nodes

To decide which clusters to combine or divide, a distance metric (also called a dissimilarity) between observations is required. How can we measure the distance between two observations?

You already know one distance metric: Euclidean distance. Distances are often referred to as *similarities* or *dissimilarities*.

A common tool for representing hierarchical clusters is the dendrogram. The distance between observations is represented on a dendrogram by the length of the edge.

- Nice presentation by Pham Cuong about clustering techniques for recommendations.
- Clustering and Data Mining in R
- Data Clustering with R, from RDataMining.com.

- What you've learned
- Prediction of a continuous outcome
- Linear regression
- Ridge regression
- Lasso regression

- Prediction of a binary outcome
- Logistic regression
- Decision trees
- Bagging
- Boosting
- Random Forest

- Unsupervised learning
- Association rule mining
- K-means clustering
- Hierarchical clustering

You've learned a ton about data mining, problem solving and writing technical reports. To make the most of your efforts, let's create digital portfolios of work with RPubs and Markdown.

Markdown is a text-to-HTML conversion tool. It allows you to write your text using an easy-to-read, easy-to-write plain text format, then convert it to structurally valid HTML. Markdown is

- a plain text formatting syntax and
- a software tool, written in Perl, that converts the plain text formatting to HTML

Markdown gives us a powerful communication tool that can be embedded with R code so that we can document analyses as we go. This is a powerful workflow because it keeps communication and clarity of thought top of mind. Ultimately our goal is to report and communicate findings, why not do that with a single tool?

Tools like Excel can be great for doing quick explorations, but as your analyses become complex, your work becomes increasingly unreproducable. That lack of reproducibility isn't just a pain for the consumers of your analyses, but it's a huge pain for you.

"Prior to shifting to R, my basic approach was to enter data in Excel, import the data into Systat, and then use the command line to do a bunch of manipulations in there. I generally viewed those as one-off manipulations (e.g., calculating log density), and, while I could have saved a command file for those things, I didn’t. This meant that, if I discovered an error in the original Excel file, I needed to redo all that manually." Meghan Duffy

Another advantage of using a combination of R and Markdown as both your analysis and communication tool is that it enables interactivity, which can help engage your audience and allow them to explore the data themselves. As an example, check out this benefits calculator that a co-worker and I built to help companies understand the pros and cons of offering paid leave to employees.

R Markdown is a custom flavor of Markdown and an authoring format that enables easy creation of dynamic documents, presentations, and reports from R. It combines the core syntax of Markdown with embedded,runnable R code. R Markdown documents are fully reproducible, so that they can be automatically regenerated whenever underlying R code or data changes.

The *knitr* package was designed to be a transparent engine for dynamic report generation with R. KnitR is the library that actually compiles the Markdown.

"When I was first starting out, I’d create a bunch of figures and tables and email them to my collaborator with a description of the findings in the body of the email. That was cumbersome for me and for the collaborator. ('Which figure are we talking about, again?')" Karl Broman

"...Now, I deliver my informal reports to collaborators as html documents that can be viewed in a browser. A big advantage to this is that I don’t have to worry about page breaks. For example, I can have very tall figures, with say 30 panels. That makes it easy to show the results in detail, and you don’t have to worry about how to get figures to fit nicely into a page." Karl Broman

Once you select a new Markdown document you can choose from documents, presentations, Shiny application or from a template.

Backticks ` and ```{r chunk_name, options}. If you want to put code snippets in line with your text just wrap them in `the ticks`. To make bigger code chunks, then do this:

```
```{r linear regression}
require(ggplot2)
data(diamonds)
model = lm(carat ~ depth, data = diamonds)
```
```

You can save as HTML, PDF or a Word Document.

RPubs is a free service from RStudio where you can upload your R Markdown files and view them online.

I’m a statistician, programmer, and senior data scientist at Zymergen. I’ve done research and data science in a broad range of industries including retail, cloud computing, and biotechnology. Along the way, I built product recommendations, web scrapers, interactive visualizations, and analyzed terabytes of data.