Lightning Talk: Board Games + R

Board Game Geek

  • Website for board game geeks
  • Getting data out of the site

My commentary

In [1]:
# the beginnings of me doing a webscraping tutorial
# https://rpubs.com/Radcliffe/superbowl
library(rvest)
library(stringr)
library(tidyr)
Loading required package: xml2
In [2]:
url <- 'http://espn.go.com/nfl/superbowl/history/winners'
webpage <- read_html(url)
In [3]:
sb_table <- html_nodes(webpage, 'table')
sb <- html_table(sb_table)[[1]]
head(sb)
X1X2X3X4
Super Bowl Winners and ResultsSuper Bowl Winners and ResultsSuper Bowl Winners and ResultsSuper Bowl Winners and Results
NO. DATE SITE RESULT
I Jan. 15, 1967 Los Angeles Memorial Coliseum Green Bay 35, Kansas City 10
II Jan. 14, 1968 Orange Bowl (Miami) Green Bay 33, Oakland 14
III Jan. 12, 1969 Orange Bowl (Miami) New York Jets 16, Baltimore 7
IV Jan. 11, 1970 Tulane Stadium (New Orleans) Kansas City 23, Minnesota 7

PCA & MDS

  • Choosing variables
  • Signal/noise problem
  • Data Visualization is hard
  • More predictors, more chance they are correlated
  • Clustering is hard with more than 2/3 dimensions

What is dimension reduction?

  • Map from higher dimensions to 2D
  • Transform higher dimensional space to a new low dimensional space
  • New space: linear/nonlinear transformation of original data
  • Visualization/analysis can be performed on new space (transformed data)

PCA: Principal Component Analysis

  • Reduce p features
  • Find a hyperplane that captures most of the variation
  • Best Linear dimension reduction method

PCA Assumptions

  • Linearity: assumes data to be linear combinations of variables
  • Mean & Covariance: No guarantee that directions of maximum variance will contain good features for discrimination
  • Large variances = important: Assumes larger variance = interesting and lower variance = noise

Choosing Assumptions

  • Orthogonality: lines are orthogonal to each other
  • "Explaining" X amount of variance
  • "Component Loadings"

Batch effects on RNA sequencing data

MDS: Multidimensional Scaling

  • Visually represent proximities between predictors
  • Input is matrix of distances
  • Goal: find projections that preserve original distances in input matrix in lower dimensional space
  • Distances are preserved by optimizing a stress function
  • Non-linear
  • More on MDS

MDS in R

  • Options for calculating distance vary, euclidian is default
  • Packages (dist, cmdscale)
    • mostly in terms of euclidian distance
  • Non-metric(isoMDS, sammon in MASS package)
    • Stress functions
  • Use cmd for continuous data
    • Use non-metric for categorical/ordinal data

Strengths of MDS

  • Works on dissimilarities
  • Good for proximities
  • Can start with distance, not raw data
  • Does not assume anything about nature of data

Weaknesses of MDS

  • Gives arbitrary maps
  • Slow
  • Numerical optimization
  • Not good with high dimensional settings
  • Picking the right stress function

Discussion

  • Output number of dimensions for MDS depends on what the data will be used for
  • PCA goal = trying to get at what components explain maximal variance
  • MDS goal = trying to break down multiple dimensions into a manageable amount
In [ ]: