- Website for board game geeks
- Getting data out of the site
- rvest for simple scraping & accessing BGGXMLAPI2
- SelectorGadget
- dplyr

In [1]:

```
# the beginnings of me doing a webscraping tutorial
# https://rpubs.com/Radcliffe/superbowl
library(rvest)
library(stringr)
library(tidyr)
```

In [2]:

```
url <- 'http://espn.go.com/nfl/superbowl/history/winners'
webpage <- read_html(url)
```

In [3]:

```
sb_table <- html_nodes(webpage, 'table')
sb <- html_table(sb_table)[[1]]
head(sb)
```

- Choosing variables
- Signal/noise problem
- Data Visualization is hard
- More predictors, more chance they are correlated
- Clustering is hard with more than 2/3 dimensions

- Map from higher dimensions to 2D
- Transform higher dimensional space to a new low dimensional space
- New space: linear/nonlinear transformation of original data
- Visualization/analysis can be performed on new space (transformed data)

- Reduce p features
- Find a hyperplane that captures most of the variation
- Best Linear dimension reduction method

- Linearity: assumes data to be linear combinations of variables
- Mean & Covariance: No guarantee that directions of maximum variance will contain good features for discrimination
- Large variances = important: Assumes larger variance = interesting and lower variance = noise

- Orthogonality: lines are orthogonal to each other
- "Explaining" X amount of variance
- "Component Loadings"

- Visually represent proximities between predictors
- Input is matrix of distances
- Goal: find projections that preserve original distances in input matrix in lower dimensional space
- Distances are preserved by optimizing a stress function
- Non-linear
- More on MDS

- Options for calculating distance vary, euclidian is default
- Packages (dist, cmdscale)
- mostly in terms of euclidian distance

- Non-metric(isoMDS, sammon in MASS package)
- Stress functions

- Use cmd for continuous data
- Use non-metric for categorical/ordinal data

- Works on dissimilarities
- Good for proximities
- Can start with distance, not raw data
- Does not assume anything about nature of data

- Gives arbitrary maps
- Slow
- Numerical optimization
- Not good with high dimensional settings
- Picking the right stress function

- Output number of dimensions for MDS depends on what the data will be used for
- PCA goal = trying to get at what components explain maximal variance
- MDS goal = trying to break down multiple dimensions into a manageable amount

In [ ]:

```
```