BI TECH CP303 - Data Mining

BI Tech CP303 - Data Mining

TA: Bryan Mayer - mayerbry@uw.edu

Welcome! These are lecture notes that accompany an applied course on data mining. Read the syllabus to find details about assignments and due dates.

Textbooks

Everything you need for this course is in these notes and R tutorials, but there are some great textbooks available for free online.

Course Philosophy

Data mining is an applied skill and this course focuses on applications. Each class is a blend of concepts and applications using the programming language R. There are three projects, each consisting of a data mining analysis to solve a business problem and a brief written report of your findings.

By the end of this course, you'll be able to approach unfamiliar data and quickly find and communicate results.

Navigating the notes

The notes progress from left to right, and for those interested in exploring more about a topic sometimes there's bonus material in the up, down direction.

Traverse the topics

You can also locate slides by topic using the navigation bar at the top.

Bookmark and check back often!

Data technology is a rapidly growing and changing industry. I chose a web-based format for course notes so the slides could be easily maintained and up-to-date. As you apply these skills outside of class, I hope these notes can be a valuable reference.

Notes improved with the support of viewers like you!

These slides are a work in progress and I need your help to make them rule. If you see an error or think something is unclear or omitted, please report it!

I'll do my best to keep the slides as updated and accurate as possible.

We are inundated with data

Governments, corporations, scientists, and consumers are creating and collecting more data than ever before. The unprecendented availability of data has transformed the modern economy and, for many, the human condition. Using clickstream technology, the modern advertiser can follow would-be customers as they browse online. These marketers are searching for clues about your preferences and shopping habits, evidence of your family composition, and even upcoming events in your life, all for the purposes of personalization. On an individual level, the Internet of Things allows us to continually monitor our bodies with Fitbits, Fuel bands, and Jawbone, and our homes with products like Nest.

Not only do consumers personally generate data, but they share it with others on social media. Similarly the biological and earth sciences are exploding with data as the cost of collecting it has plummetted. And the government, of course, wants all the data.

We are inundated with data

But that's just it, we're collecting data in faster than ever in history, but our growth of understanding of these data has not kept pace.

Data representation and storage is more heterogeneous than ever

Not only are we generating more data faster, we're developing more ways to represent and store data. Marketing personalization combines disperate data about a single person or household: demographics, transactional histories, social media activities and clickstream. Personalization efforts require the ability to ingest and combine data from many sources across many formats. Graph databases like Titan or neo4j represent complex interactions as network graphs and analysts query the network with traversal patterns. Algorithms like mapreduce on the other hand, revolutionized the way we query unstructured data, like text and weblogs.

We need new tools

In the context of big, heterogeneous data, the number of candidate features is too large to explore with traditional analysis techniques. Further, the volume of data creates new challenges that analyists historically haven't dealt with such as

• computational efficiency, i.e. speed
• memory use and allocation, i.e. computing resources
• data pipelines and integrity
• the ethical consequences of findings

Data mining is a process of knowledge discovery

Data Mining is the process of identifying patterns in large volumes of data for the purposes of prediction. Realistically, data cleaning and exploratory analysis will occupy the vast majority of your time with the rest left over for model discovery. Model deployment is optional and depends on the result of the first three steps. After completing the analysis you might find that long-term adoption of your model isn't worthwhile.

Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. Witten, Frank and Hall
The most commonly accepted definition of “data mining” is the discovery of “models” for data." Where 'model' could mean a statistical model, and machine learning model or an algorithm like mapreduce. Leskovec, Rajaraman, Ullman
Data Mining is an analytic process designed to explore data in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns. Hill & Lewicki

Same techniques, different goal

You'll find that many of the techniques you've used in an analysis setting are identical to those used in machine learning, for example, linear and logistic regression. The key difference is that in machine learning, we'll judge our model quality by assessing its ability to predict on hold-out data, rather than through goodness-of-fit metrics like $R^2$ or AIC.

Lets look at some examples

Data mining problems generally fall into one of three classes and we're going to get some hands-on experience with all three:

1. Prediction of continuous outcomes
2. Classification (or supervised learning)
3. Clustering and unsupervised learning

Forest fires

In the first unit, you'll use a collection of meteorological data and regression techniques to predict the burn area of a nature park in Portugal.

Credit card default

In the second unit you will build classification models with logistic regresson and decision trees to identify customers who will default on their next credit card payment.

College ranking

In the third unit you'll use the unsupervised learning approaches of association rule mining and clustering to help prospective students identify cheaper alternatives to the US News and World Reports college rankings.

Limitations of data mining

Like any analytical technique, and arguably more so than traditional analysis, data mining has limitations:

• many patterns are uninteresting
• patterns can be coincidental or a result of sampling bias
• data are almost always untidy or missing

How do these limitations influence the interpretation and application of the results of a data mining process? What are our standards for belief?

Ethical data mining

As a result of the volume and speed at which businesses can collect data, conversations about the ethical use of these data are more important than ever. We're used to reading headlines about ethical violations in big data, but often these incidents can be anticipated and avoided.

Here's a fairly horrifying list of articles about data mining and potential ethical violoations.

Ethics case study: Target

Take 20 minutes and read the following piece from the New York Times Magazine which describes how Target used transactional data to target expectant mothers for marketing campaigns: How Companies Learn Your Secrets

Specifically, the marketers said they wanted to send specially designed ads to women in their second trimester, which is when most expectant mothers begin buying all sorts of new things, like prenatal vitamins and maternity clothing. “Can you give us a list?” the marketers asked.

Discussion questions

Put yourself in the role of an analyst at Target.

• Why did people get upset?
• What data do you need to build a model to predict if a woman is pregnant?
• Is it ethical to merge data from outside of Target's scope in customers' lives (e.g. public birth records).
• Does the cost of backlash outweigh the benefits of the knowledge?
• How might Target have met the business goals, without losing customer credibility?
• To what degree should Andrew Pole be held accountable for how his analyses were used?

How should we think about ethics?

When we are faced with a data mining application, what framework should we use to think about ethics?

Ethical decision points

Ethical decision points provide a framework for exploring the relationship between what values you hold as individuals—and as members of a common enterprise—and aligning those values with the actions you take in building and managing products and services utilizing big data technologies.

1. Inquiry: what are are my organization's values? What are my values?
2. Analysis: what are the practice standards in my industry?
3. Articulation: where are there alignments and gaps between our values and proposed data practices?
4. Action: what must we do to close value alignment gaps?

Inquiry: what are my values?

Target has a values page on their website, but more relevant to this discussion is the privacy policy.

At Target, we want you to know how we collect, use, share and protect information about you. By interacting with Target, you consent to use of information that is collected or submitted as described in this privacy policy. We may change or add to this privacy policy, so we encourage you to review it periodically. To help you track the changes, we include a history of changes below.

Analysis: what are the standards of my industry?

In marketing a great place to start is the Digital Marketing Association's Ethics and Compliance Program and Guidelines for ethical business practice the latter of which covers privacy and marketing to children, something Target did whether they knew it or not, in depth.

Articulation: are there value gaps?

If we send someone a catalog and say, ‘Congratulations on your first child!’ and they’ve never told us they’re pregnant, that’s going to make some people uncomfortable,” Pole told me. “We are very conservative about compliance with all privacy laws. But even if you’re following the law, you can do things where people get queasy.

The girl was high school age, a fact that, presumably, Target knew. Why then did they choose to proceed with the mailer?

Action: how do we close the value gaps?

With the pregnancy products, though, we learned that some women react badly. Then we started mixing in all these ads for things we knew pregnant women would never buy, so the baby ads looked random. We’d put an ad for a lawn mower next to diapers. We’d put a coupon for wineglasses next to infant clothes. That way, it looked like all the products were chosen by chance.

... and then of course they say this dehumanizing thing.

And we found out that as long as a pregnant woman thinks she hasn’t been spied on, she’ll use the coupons. She just assumes that everyone else on her block got the same mailer for diapers and cribs. As long as we don’t spook her, it works.

Ethical behavior requires constant diligence

The ethical consequences of data mining isn't a one-time consideration. Everytime we sit down to conduct and analysis we should think about the potential for harm. We'll return to the discussion of ethics repeatedly throughout the course.

R for statistical computing

R is a leading tool in the industry for statistics and data science.

• is free and open-source
• has an active community, which means you've got tons of analysis options.
• is in demand!

Installation

1. Install R
2. Install R Studio

Introductory Resources

1. Data Mining with R: Learning with Case Studies
2. Ethics of Big Data: Balancing Risk and Innovation
3. Coursera course on R
4. Exploratory Data Analysis from Coursera
5. Exploratory Data Analysis from Udacity
6. Mining of Massive Datasets
7. The Elements of Statistical Learning. Trevor Hastie, Robert Tibshirani and Jerome Friedman. This is a classic text available from their website
8. Hands-On Data Sciece with R
9. Data mining resources

What makes a successful station?

Your first project is to help a hypothetical bike-share program expand operations by identifying characteristics of successful bike stations. You'll do that by predicting the expected number of rentals per station as a function of the surrounding landscape.

Linear Regression is a technique for estimating numerical outcomes such as number of rentals per day or the ride time in minutes.

What is linear regression?

Linear regression is a method to describe an outcome variable in terms of a linear combination of predictor variables. Regression with multiple predictors allows the us to answer the question "what is the best predictor of the outcome of interest?"

In data mining, linear regression is a technique that enables us to predict the value of a continuous variable, like ride duration or number of rentals, in terms of the values of other variables, like the number of crosswalks or museums nearby.

Why linear regression?

Regression approaches are great to have in your toolbelt due to their ease of

1. interpretability
2. implementation
3. flexibility
4. popularity

Interpretability

Regression coefficients have an explanatory interpretation. In the case of a regression with a single predictor, the coefficients are interpreted as the average change in the outcome for a unit change in an explanatory variable. Say for example we had the following regression equation:

$duration = 15 + 30 * museums$

The interpretation is that the average ride duration is 15 minutes when there are no nearby museums ($museums = 0$), and the average duration increases by 30 minutes for each museum within a quarter mile radius.

Ease of implementation

Most programming languages have out-of-the-box implementations of linear regression, so they can be applied across many applications.

Powerful

In prediction settings, linear models can outperform more complex nonlinear techniques, particularly when working with small training sets or sparse data. This is because regression is an averaging technique, making it less sensitive to missingness and extreme values.

Flexible

Even in situations where linearity assumptions fail, regression can be applied on transformations of variables such as logarithms.

Mathematical foundations

Let's walk through the foundations of regression with just one explanatory variable/covariate. Regression with just one covariate is called simple linear regression (SLR). SLR has a simpler framework than multiple regression and is fundamental to understanding more complex models.

Mathematical foundations

SLR describes the relationship between an outcome and a single predictor variable. The regession equation expresses how the average of the outcome variable changes over the range of the predictor variable.

e.g. How does the expected number of rentals change as the number of nearby crosswalks changes?

Mathematical foundations

Given $n$ observations of $x$ and $y$,

$(x_1, y_1), (x_2, y_2), \ldots , (x_n, y_n)$

the model for a simple linear regression is expressed as

$y_i = \beta_{0} + \beta_{1}x_i + \epsilon_i$

Hey, kinda looks like $y = mx + b$ where $\beta_0$ is the intercept and $\beta_1$ is the slope of the line.

Which line is the best?

How do we choose $\beta_0$ and $\beta_1$? With visual inspection alone, there seems to be several lines that could be drawn through the center of the points. We need an algorithm for standardizing and automating the process of finding the best one.

Errors

Data points will not fall exactly on any straight line we choose, there will always be some error. The distance of a point from the regression line is called the residual error, $\epsilon_i$. The errors are assumed to be independent and normally distributed with mean 0 and standard deviation $\sigma$.

Ordinary least squares ftw

The method of least squares selects the line that minimizes the sum of the squares of the errors. $\epsilon_i$ is the error of the prediction line for each observation.

$\epsilon_i = (y_i - \beta_{0} - \beta_{1}x_i)$

$\epsilon_i^2 = (y_i - \beta_{0} - \beta_{1}x_i)^2$

Choose $\beta_0$ and $\beta_1$ so that $\sum\limits_{i=1}^n\epsilon_i^2$ is minimized.

Intuition

Each $y_i$ is an observed value of the outcome variable. $\beta_{0} + \beta_{1}x_i$ is the estimated, or expected, $y_i$ value. So when we minimize the sum of squared errors we're making $(observed - expected)^2$ as close to 0 as possible.

Assumptions

To safely apply linear regression without fear of weird results, some assumptions must be met:

1. Linearity between the outcome and coefficients ($\beta$'s)
2. Independence of the errors (no autocorrelation)
3. Constant variance (homoscedasticity) of the errors
4. Normality of errors

We might violate some assumptions and it's application-dependent whether we should be concerned about those violations.

Diagnostic plots

In R, if you plot a linear model object, it'll give you a nice collection of diagnostic plots for assessing model assumptions.

model = lm(rentals ~ crossing, data = data)
par(mfrow = c(2, 2)) # make 4 plots in 2 rows, 2 cols
plot(model)

Interpretation

Regression equations have a nice interpretation. The intercept $\beta_0$ is the mean value of $y$ when $x = 0$. The slope of the regression line, $\beta_{1}$, is the change in the mean of y for each unit change in x.

R regression output

model = lm(rentals ~ crossing, data = data)
summary(model)

Call:
lm(formula = rentals ~ crossing, data = data)

Residuals:
Min      1Q  Median      3Q     Max
-51.815 -21.176  -8.394  13.765 126.969

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.53163    2.59354  10.615  < 2e-16 ***
crossing     0.49159    0.07031   6.992 4.92e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 29.54 on 183 degrees of freedom
Multiple R-squared:  0.2108,	Adjusted R-squared:  0.2065
F-statistic: 48.88 on 1 and 183 DF,  p-value: 4.92e-11


Assessing model quality

We've discussed how regression is a tool for numerical prediction. When we use a model to make predictions, how do we know if the model is any good? To assess model quality, we quantify its predictive accuracy.

Hold-out sets

We measure prediction quality by simulating new data coming in. We do that by dividing the original data into a training set and a testing set. The training set is used to fit and tune the predictive model. Then, the testing set is used to validate the predictions. Generally construction of the train and test sets should be done completely at random.

Quick heuristics for study design

large sample size
1. 60% train
2. 20% test
3. 20% validation
medium sample size
1. 60% train
2. 40% test
small sample size
1. Do cross validation

How to interpret predictions?

Model predictions represent the probability that $y = 1$ given the data $x$.

e.g. The probability of a message being spam, given the number of $\$'s$. If I get a value of 0.70 from my model, then$p(\text{message is spam}\mid 10 \text{\$'s}) = 0.70$.

How do we decide the class?

The decision boundary separates the region where the model predicts $y = 1$ from the region where the model predicts $y = 0$.

• If $y \geq 0.5$, then the message is spam
• If $y < 0.5$, then the message is not spam

How do we decide the class?

$p(\text{spam}) = -0.80 + 0.10*\text{number of \$s}$predict y = 1, if:$-0.80 + 0.10x \geq 00.10x \geq 0.80x \geq 8$The line at$x = 8$is called the decision boundary. Bot or Not? Like many Internet giants Twitter makes money by selling ads, but they’ve got an insidious infestation eroding their advertising credibility: bots. More than 23 million of them. Twitter bots are automatons living in the Twittersphere and ranging wildly in capability. At their most complex they use speech patterns that can, at times, fool humans. When advertisers pay for engagement, they aren’t interested in a four-hour flame war between a gamergate bot and a Kanye bot. When advertisers analyze social data they want to be sure the numbers are the result of human activity. Bot or Not? For project 2, we'll use Twitter user data to build a classifier for identifying bots. Not a bot... Spot a bot! Logistic Regression Resources Classification with Trees BI Tech CP303 - Data Mining R Tutorial Classification with trees Tree-based methods are a conceptually simple and powerful method for classification (and regression). They work well in non-linear settings where regression can underperform. Classification with trees Besides working well in non-linear settings, an advantage of decision trees is their interpretability. For example in a clinical setting, they mirror the way that a clinician might think when assessing patient disease risk. The construction of trees is iterative Decisions trees are created by interatively splitting predictors to create increasingly homogenous groups. The basic algorithm can be summarized by the following steps: Start with all predictors and repeat: 1. Find the variable and split that best "separates" the outcome, i.e. creates homogenous groups 2. Divide the data into two groups on that split 3. Stop when the groups are too small or sufficiently homogenous How to determine the splitting point? The first step in our algorithm is to determine the predictor that best splits the data into homogenous groups (sometimes refered to as node purity). There are many measures of node purity, but two common ones we'll see used in caret: • Misclassification • Gini Purity Misclassification purity This measure of node purity estimates the probability$misclassification = 1 - p(y = 1), \text{if y = 1 is the reference class.}$A misclassification error of 0 implies perfect purity and 0.5 implies an even mixture of both classes (i.e. bot and not) and no 'purity.' Gini Impurity Gini impurity is a measure of how often an observation would be incorrectly classified if it was randomly labeled according to the distribution of classes in the split. Gini impurity is computed by summing the probability of each item being chosen times the probability of a classification mistake, i.e.$1 - prob(\text{random guess is correct})$.$\text{Gini Impurity} = \sum_{i=1}^{k} p_i(1-p_i) = 1 - \sum_{i=1}^{k} p_i^2$where$k$is the class (e.g. 'bot', 'not') A gini impurity of 0 indicates all observations in the node fall into a single category. Example splits Suppose we had two candidate splits: Controlling tree complexity Tree methods can suffer from overfitting, which occurs when our model is too tuned to the training data. When models are overfit they generalize poorly and exhibit high variance and poor prediction accuracy on new data. Pruning When we were doing linear regression, we started by fitting the full model, which sometimes had the lowest RMSE, but was also prohibitively complex. Tree methods can suffer a similar weakness in that resulting trees tend to be large and too complex. Just like regression, a smaller tree with fewer splits often leads to lower variance, easier interpretation and lower test errors, at the cost of a little bias. Complexity penalty,$Cp$The complexity penality ranges from 0 to 1 and penalizes the number of terminal nodes in the tree. A high value of$Cp$corresponds to a decreased likelihood of a branch split and lower$Cp$penalizes less dramatically. In caret, the train() function will identify an optimal$Cp$for us. Reducing variance Single trees are simple to compute and pretty easy to understand, but they are susceptible to high variance when tested on new data. Methods such as boosting, bagging and random forests have been developed to overcome this problem. Boosting Boosting works by combining the output of many potentially weak classifiers to produce a powerful consensus tree. A weak classifier is one whose error rate is only slightly better than random guessing. Let's use this toy example to walk-through the process. AdaBoost Short for adaptive boosting, AdaBoost is one of the most popular boosting algorithms. AdaBoost is adaptive in the sense that subsequent weak learners are boosted in favor of the cases previously misclassified. Bootstrap aggregating (bagging) Bagging procedures work by randomly resampling data and refitting the model multiple times. The final model is determined by either averaging or taking a majority vote to form a single model. The goal of bagging is to reduce variance by averaging many models. Random Forest Random forest is a special type of bagging algorithm that resamples both the variables and the data used to fit the model. It's one of the best performing classifiers. Tree Resources 1. Tree-based Methods. A great tutorial from Ryan Kelly. 2. Decision Tree Learning 3. Johns Hopkins Data Mining course notes. Great intro to decision trees. 4. Boosting Tutorial by Ron Meir. 5. Classification Trees overview from Statsoft Association rule mining BI Tech CP303 - Data Mining R Tutorial Farewell to supervised learning At this point you've built models to predict numerical and binary outcomes. Your models were based on a training set and predictive accuracy was assessed with a testing set. This class of problems is called supervised learning because we observe the true value of the outcome, and those observations serve as a "teacher" for that model, providing it the correct answer and allowing us to compute measures of accuracy. e.g. RMSE and accuracy measurements. Unsupervised Learning In unsupervised learning applications, the goal is to find hidden structure in data when we do not observe an outcome. Since we have no observed outcome to compare our predictions with, we cannot compute an error to measure the predictive accuracy of our models. This distinguishes unsupervised from supervised learning. Association Rule Mining Association rule mining is a powerful and popular tool for discovering patterns between variables in large transactional databases. The goal is to find recurring associations. Market basket analysis In a retail context, association rule mining is often referred to as "market basket" analysis. In this context the observations are sales transactions, such as those occurring at the checkout counter of a store. The resulting association rules can be very valuable for use in cross-marketing in promotions, catalog design, consumer segmentation based and product recommendation. Formalizing market basket analysis$I = \{item_1, item_2, \dots, item_n\}$is a collection of binary attributes called items.$D = \{t_1, t_2, \dots, t_m\}$is a set of uniquely labeled transactions in the dataset. Rules take the form$X \Rightarrow Y$where$X, Y \subseteq I$Grocery store example The pool of items is$I = \{milk, bread, butter, beer\}$The transaction dataset: Transaction IdItemsets 1milk, bread 2bread, butter 3beer 4 milk, bread, butter 5 bread, butter Rules$X \Rightarrow Y$The itemsets on the left-hand-side (LHS) are called the antecedent and the itemsets on the right-hand-side (RHS) are called the consequent of the rule. Filtering rules To identify interesting rules from all possible combinations, we need to construct a measure of significance. The most common constraints are imposed with the support, confidence, and lift. Association rules are rules which surpass a user-specified minimum support and minimum confidence threshold. Support The support of an itemset is the proportion of transactions in the data which contain the itemset. In the grocery example, the itemset$\{milk, bread\}$occurs in 2 of 5 transactions, so it has a support of 2/5 = 0.4 = 40%. What is the support for$\{bread, butter\}$? Confidence The confidence of a rule,$X \Rightarrow Y$, answers the question "for what proportion of the transactions does this rule apply?", and is an estimate of the conditional probability$p(Y|X)$.$confidence(X \Rightarrow Y) = \frac{\text{proportion of transactions containing the itemsets in X and Y}}{\text{Proportion of transactions containg the itemset X}}$Confidence The rule$\{milk, bread\} \Rightarrow \{butter\}$has a confidence of 0.2/0.4 = 0.5, meaning that for 50% of the transactions containing$\{milk, bread\}$the rule holds. Constraining rules When association rules are constructed, they must satisfy both minimal support and confidence thresholds, however even with these requirements we're often left with a long list of candidate rules. We can further reduce the pool of rules using the lift metric. Lift$lift(A \Rightarrow B) = \frac{support(A \cup B)}{support(A) \times support(B)}$Lift is the deviation in the support of the whole rule from the support expected under independence given the supports of the LHS and the RHS. Larger lift indicates stronger associations. Interpreting lift A lift of 1 indicates that the association between the LHS and RHS is exactly what we'd expect if there is no relationship between the itemsets on either side, i.e. the itemsets are independent. Imagine you're at the grocery story buying your bread, butter and milk, and you remember that you need vacuum cleaner bags. Lift values greater than 1 indicate that there's some depedence between the itemsets. For example chips and salsa, or beer and pretzels. The apriori algorithm The apriori algorithm is an iterative, breadth-first approach to finding frequent itemsets and deriving association rules from them. College Scorecard Data In project 3 we'll use a dataset released from the White House last year called the College Scorecard. The purpose of the scorecard is to give students and parents more information for making college decisions and includes data about colleges (e.g. size, composition of student body, common majors) and data about labor market outcomes (e.g. median earnings after graduation, unemployment rates, and debt). Association Rules Resources Clustering BI Tech CP303 - Data Mining R Tutorial Clustering The goal of cluster analysis (or segmentation) is to group observations into subsets (clusters) such that observations within a cluster are more closely related to one another than observations in different clusters. Clustering is accomplished by grouping observations according to a distance metric. Discovering categories Often cluster analysis is used to identify distinct subgroups in data. Retailers often use shopper "personas" to represent patterns of shopping, e.g. Spendy Sally and Cheap Chad. Clustering techniques could be applied in that context to identify categories of shoppers. Defining similarity is the key The essential component of all clustering techniqiues is a measure of similarity (or dissimilarity) between observations. Once the measure of similarity is defined, the clustering method groups the observations based on the measure of similarity. $k$-means Clustering One of the most popular clustering algorithms, the goal of$k$-means clustering is to group$n$observations into$k$clusters.$k$-means has essentially two steps: 1. Cluster assignment step - assign all observations to one of the centroids based on which is closest. 2. Centroid shift step - move the centroids to the average value of the points within the cluster. 3. Stop when the centroids stop moving. $k$-means Algorithm Selecting$k$The number of centroids,$k$, is a required input for the$k$-means algorithm. The choice of the size of$k$depends on the goal. For segmentation applications,$k$is typically defined by the context of the problem. For example, a salesteam may have$k$members, and the goal is to assign each salesperson a set of leads such that the customers assigned to each are similar as possible. But often$k$is unknown. Check out this cool video of$k$-means simulations with varying$k$s and$n$s. Hierarchical Clustering The results of$k$-means depend on the value of$k\$ and starting position of the centroids, but another clustering method, hierarchical clustering, does not require us to specificy the number of clusters. Instead we specify a distance metric between observations.

This method produces hierarchical representations in which the clusters at each level of the hierarchy are created by merging clustering at the next lower level. At the lowest level each cluster contains a single observation. At the highest level there is only one cluster containing all the data.

Methods of hierarchical clustering

There are two main types of hierarchical clustering:

1. Agglomerative - a 'bottom up' approach in which each observation is initially its own cluster and pairs of observations are iteratively combined.
2. Divisive - a 'top down' approach in which all observations start in a single cluster and recursively split.

Agglomerative clustering

To construct the hierarchy, we'll repeat the following steps until there's only one cluster left:

1. Identify the closest set of clusters
2. Merge them into one cluster
3. Remove their unclustered nodes

Distance

To decide which clusters to combine or divide, a distance metric (also called a dissimilarity) between observations is required. How can we measure the distance between two observations?

Euclidean Distance

You already know one distance metric: Euclidean distance. Distances are often referred to as similarities or dissimilarities.

Dendrograms

A common tool for representing hierarchical clusters is the dendrogram. The distance between observations is represented on a dendrogram by the length of the edge.

Clustering Resources

1. Nice presentation by Pham Cuong about clustering techniques for recommendations.
2. Clustering and Data Mining in R
3. Data Clustering with R, from RDataMining.com.

Sharing is caring!

You've come a long way, baby

What you've learned
1. Prediction of a continuous outcome
• Linear regression
• Ridge regression
• Lasso regression
2. Prediction of a binary outcome
• Logistic regression
• Decision trees
• Bagging
• Boosting
• Random Forest
3. Unsupervised learning
• Association rule mining
• K-means clustering
• Hierarchical clustering

GREAT JOB!!!

You've learned a ton about data mining, problem solving and writing technical reports. To make the most of your efforts, let's create digital portfolios of work with RPubs and Markdown.

Markdown

Markdown is a text-to-HTML conversion tool. It allows you to write your text using an easy-to-read, easy-to-write plain text format, then convert it to structurally valid HTML. Markdown is

1. a plain text formatting syntax and
2. a software tool, written in Perl, that converts the plain text formatting to HTML

An analysis and communication tool combined

Markdown gives us a powerful communication tool that can be embedded with R code so that we can document analyses as we go. This is a powerful workflow because it keeps communication and clarity of thought top of mind. Ultimately our goal is to report and communicate findings, why not do that with a single tool?

Reproducibility

Tools like Excel can be great for doing quick explorations, but as your analyses become complex, your work becomes increasingly unreproducable. That lack of reproducibility isn't just a pain for the consumers of your analyses, but it's a huge pain for you.

"Prior to shifting to R, my basic approach was to enter data in Excel, import the data into Systat, and then use the command line to do a bunch of manipulations in there. I generally viewed those as one-off manipulations (e.g., calculating log density), and, while I could have saved a command file for those things, I didn’t. This meant that, if I discovered an error in the original Excel file, I needed to redo all that manually." Meghan Duffy

Interactivity

Another advantage of using a combination of R and Markdown as both your analysis and communication tool is that it enables interactivity, which can help engage your audience and allow them to explore the data themselves. As an example, check out this benefits calculator that a co-worker and I built to help companies understand the pros and cons of offering paid leave to employees.

R Markdown

R Markdown is a custom flavor of Markdown and an authoring format that enables easy creation of dynamic documents, presentations, and reports from R. It combines the core syntax of Markdown with embedded,runnable R code. R Markdown documents are fully reproducible, so that they can be automatically regenerated whenever underlying R code or data changes.

KnitR

The knitr package was designed to be a transparent engine for dynamic report generation with R. KnitR is the library that actually compiles the Markdown.

"When I was first starting out, I’d create a bunch of figures and tables and email them to my collaborator with a description of the findings in the body of the email. That was cumbersome for me and for the collaborator. ('Which figure are we talking about, again?')" Karl Broman

A better way to communicate

"...Now, I deliver my informal reports to collaborators as html documents that can be viewed in a browser. A big advantage to this is that I don’t have to worry about page breaks. For example, I can have very tall figures, with say 30 panels. That makes it easy to show the results in detail, and you don’t have to worry about how to get figures to fit nicely into a page." Karl Broman

Make a new R Markdown doc

Once you select a new Markdown document you can choose from documents, presentations, Shiny application or from a template.

Denoting code

Backticks  and {r chunk_name, options}. If you want to put code snippets in line with your text just wrap them in the ticks. To make bigger code chunks, then do this:

{r linear regression}
require(ggplot2)
data(diamonds)
model = lm(carat ~ depth, data = diamonds)


Output your report to a file

You can save as HTML, PDF or a Word Document.