# An Introduction to Machine Learning in Python¶

PyData 2016 Scikit-Learn tutorial github

## Intro to Machine Learning¶

• Supervised: imitating previous classification
• Regresssion: continuous input predicts continuous output
• Classification: predicting categorical
• Unsupervised: finding/discovery of patterns
• Clustering: finding groups of similar items
• Compression: finding the structure to simplify the analysis
• Reinforcement: learning by doing

## Linear Regression¶

• Explanatory and response variable
• Machine learning used when matrix solving is not feasible
• scikit-learn: numpy for machine learning/algorithms
• X array = features (columns), data points (rows)
• Y array = store the continuous target variable corresponding to the data points

### Workflow¶

1) Split into training/test (70/30)

2) Hyperparameter values (meta-values about the algorithm) + Learning algorithm

* Hyperparemeter = values of the algorithm, not the model- how many iterations, etc.



3) Test the model

* Make predictions using the model
* Compare prediction to actual output labels (model output to true labels)
* Compute performance metrics



4) Re-create the model with all data

* generally, the more data = better algorithm

### Code example of regression¶

In [3]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd


Source: R.J. Gladstone (1905). "A Study of the Relations of the Brain to to the Size of the Head", Biometrika, Vol. 4, pp105-123

Description: Brain weight (grams) and head size (cubic cm) for 237 adults classified by gender and age group.

Variables/Columns

• Gender (1=Male, 2=Female)
• Age Range (1=20-46, 2=46+)
• Brain weight (grams)
In [4]:
#load in the data with pandas and look at what it seems like
encoding='utf-8',
comment='#',
sep='\s+')
df.tail()

Out[4]:
232 2 2 3214 1110
233 2 2 3394 1215
234 2 2 3233 1104
235 2 2 3352 1170
236 2 2 3391 1120

Notes:

• Head size = continuous variable
• Brain-weight = continuous variable

Can the head size predict the brain-weight?

In [8]:
#look at whether or not it makes sense to build a model
plt.ylabel('Brain weight (grams)');


### Preparing the dataset¶

In [11]:
#creating numpy arrays using the pandas array: y first- what we want to predict
y = df['brain-weight'].values
y.shape

Out[11]:
(237,)
In [14]:
# another numpy array from pandas
# we need to add a second axis to the array- scikit-learn expects it
X = X[:, np.newaxis]
X.shape

Out[14]:
(237, 1)

Random state- the same random state will give you the same split. This is helpful for reproducibility.

In [15]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=123)

In [16]:
X_train.shape

Out[16]:
(165, 1)
In [17]:
plt.scatter(X_train, y_train, c='blue', marker='o')
plt.scatter(X_test, y_test, c='red', marker='s')
plt.ylabel('Brain weight (grams)');


### Fitting the model¶

In [24]:
from sklearn.linear_model import LinearRegression

#initialize linear regression object
lr = LinearRegression()
#provide both x and y, because it's supervised learning
lr.fit(X_train, y_train)
#making predictings using the 'predict' method
y_pred = lr.predict(X_test)


### Evaluating the Model¶

We can use r2 score to evaluate the model: done manually

In [22]:
sum_of_squares = ((y_test - y_pred) ** 2).sum()
res_sum_of_squares = ((y_test - y_test.mean()) ** 2).sum()
r2_score = 1 - (sum_of_squares / res_sum_of_squares)
print('R2 score: %.3f' % r2_score)

R2 score: 0.635

In [32]:
#check the the scikit-learn R2 with the one manually computed above
print('R2 score: %.3f' % lr.score(X_test, y_test))

R2 score: 0.635


sklearn.metrics has a number of different ways to evaluate the fit of a model.

In [30]:
lr.coef_, lr.intercept_

Out[30]:
(array([ 0.271117]), 302.03033196088086)

Use y = mx + b to plot the line below

In [28]:
min_pred = X_train.min() * lr.coef_ + lr.intercept_
max_pred = X_train.max() * lr.coef_ + lr.intercept_

plt.scatter(X_train, y_train, c='blue', marker='o')
plt.plot([X_train.min(), X_train.max()],
[min_pred, max_pred],
color='red',
linewidth=4)
plt.ylabel('Brain weight (grams)');


## Understanding Classification¶

If we use train/test split funtion, we may not get a training set with the same proportion of things that are classified. We need to stratify the split, which is a relatively new feature of scikit-learn.

In logistic regression, the activation function is a predicted probability, which is then converted to a unit step function, which then predicts the class label.

Logistic regression is a generalized linear model that gets weight coefficients that we use to make predictions on new data.

K-nearest-neighbors looks up the samples in the neighborhood and makes the classification prediction based on the neareast neighbors.

One thing to keep in mind: all features must be at the same scale.

### The Iris dataset¶

In [5]:
df = pd.read_csv('../../pydata-chicago2016-ml-tutorial/code/dataset_iris.txt',
encoding='utf-8',
comment='#',
sep=',')
df.tail()

Out[5]:
sepal_length sepal_width petal_length petal_width class
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica
In [37]:
X = df.iloc[:, :4].values
y = df['class'].values
#these are the unique values within y
np.unique(y)

Out[37]:
array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

### Class label encoding¶

In [51]:
from sklearn.preprocessing import LabelEncoder
# we want to convert the labels to integers (string to int)
l_encoder = LabelEncoder()
l_encoder.fit(y)
l_encoder.classes_

Out[51]:
array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)
In [49]:
y_enc = l_encoder.transform(y)
np.unique(y_enc)

Out[49]:
array([0, 1, 2])
In [50]:
np.unique(l_encoder.inverse_transform(y_enc))

Out[50]:
array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

### Scikit-learn's in-build datasets¶

In [53]:
from sklearn.datasets import load_iris

print(iris['DESCR'])

Iris Plants Database

Notes
-----
Data Set Characteristics:
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics:

============== ==== ==== ======= ===== ====================
Min  Max   Mean    SD   Class Correlation
============== ==== ==== ======= ===== ====================
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)
============== ==== ==== ======= ===== ====================

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

This is a copy of UCI ML iris datasets.
http://archive.ics.uci.edu/ml/datasets/Iris

The famous Iris database, first used by Sir R.A Fisher

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

References
----------
- Fisher,R.A. "The use of multiple measurements in taxonomic problems"
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments".  IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...



### Test/train splits¶

In [60]:
X, y = iris.data[:, :2], iris.target
# ! We only use 2 features for visual purposes (sepal length, width)

print('Class labels:', np.unique(y))
print('Class proportions:', np.bincount(y))

Class labels: [0 1 2]
Class proportions: [50 50 50]

In [64]:
#what happens if we just split without stratifying?
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=123)

print('Class labels:', np.unique(y_train))
print('Class proportions:', np.bincount(y_train))

Class labels: [0 1 2]
Class proportions: [32 40 33]

In [66]:
# the stratify=y option makes sure that the data remains in a good distribution
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=123,
stratify=y)

print('Class labels:', np.unique(y_train))
print('Class proportions:', np.bincount(y_train))

Class labels: [0 1 2]
Class proportions: [35 35 35]


### Logistic Regression¶

If you use the default setting, it will train three binary logistic regression classifiers, each one comparing 'one-vs-rest'. We use the softmax function to do a multinomial classification. If the classes are independent, it's reasonable to use softmax.

In [68]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(solver='newton-cg',
multi_class='multinomial',
random_state=1)

lr.fit(X_train, y_train)
print('Test accuracy %.2f' % lr.score(X_test, y_test))

Test accuracy 0.80

In [71]:
from mlxtend.evaluate import plot_decision_regions

plot_decision_regions

plot_decision_regions(X=X, y=y, clf=lr, X_highlight=X_test)
plt.xlabel('sepal length [cm]')
plt.xlabel('sepal width [cm]');


### K-Nearest Neighbors¶

In [81]:
from sklearn.neighbors import KNeighborsClassifier

#how do you pick the number of neighbors?
#how to break a tie
kn = KNeighborsClassifier(n_neighbors=4)

kn.fit(X_train, y_train)
print('Test accuracy %.2f' % kn.score(X_test, y_test))

Test accuracy 0.82

In [80]:
plot_decision_regions(X=X, y=y, clf=kn, X_highlight=X_test)
plt.xlabel('sepal length [cm]')
plt.xlabel('sepal width [cm]');


## Feature Preprocessing & scikit-learn Pipelines¶

### Categorical features: nominal vs ordinal¶

Nominal variables have no intrinsic ordering, so we can binarize them. Ordinal variables can be coded with numbers.

Types of feature normalization: minmax just rescales from 0-1, and z-score centers things around the mean. Z-score standardization for simple optimization like gradient descent: weights are updated more equally. Z-score can optimize in both directions.

Re-scaling is done to to make sure that the coefficients the model puts out are representative of how important input variables are, rather than their magnitudes.

In [83]:
import pandas as pd

df = pd.DataFrame([
['green', 'M', 10.0],
['red', 'L', 13.5],
['blue', 'XL', 15.3]])

df.columns = ['color', 'size', 'prize']
df

Out[83]:
color size prize
0 green M 10.0
1 red L 13.5
2 blue XL 15.3

Note to self: Look up what DictVectorizer does (QQQ)

In [84]:
from sklearn.feature_extraction import DictVectorizer

dvec = DictVectorizer(sparse=False)

X = dvec.fit_transform(df.transpose().to_dict().values())
X

Out[84]:
array([[  0. ,   1. ,   0. ,  10. ,   0. ,   1. ,   0. ],
[  0. ,   0. ,   1. ,  13.5,   1. ,   0. ,   0. ],
[  1. ,   0. ,   0. ,  15.3,   0. ,   0. ,   1. ]])
In [85]:
size_mapping = {
'XL': 3,
'L': 2,
'M': 1}

df['size'] = df['size'].map(size_mapping)
df

Out[85]:
color size prize
0 green 1 10.0
1 red 2 13.5
2 blue 3 15.3
In [86]:
X = dvec.fit_transform(df.transpose().to_dict().values())
X

Out[86]:
array([[  0. ,   1. ,   0. ,  10. ,   1. ],
[  0. ,   0. ,   1. ,  13.5,   2. ],
[  1. ,   0. ,   0. ,  15.3,   3. ]])

### Normalization¶

In [87]:
df = pd.DataFrame([1., 2., 3., 4., 5., 6.], columns=['feature'])
df

Out[87]:
feature
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
In [88]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

mmxsc = MinMaxScaler()
stdsc = StandardScaler()

X = df['feature'].values[:, np.newaxis]

df['minmax'] = mmxsc.fit_transform(X)
df['z-score'] = stdsc.fit_transform(X)

df

Out[88]:
feature minmax z-score
0 1.0 0.0 -1.46385
1 2.0 0.2 -0.87831
2 3.0 0.4 -0.29277
3 4.0 0.6 0.29277
4 5.0 0.8 0.87831
5 6.0 1.0 1.46385

### Pipeline¶

In [89]:
from sklearn.pipeline import make_pipeline
from sklearn.cross_validation import train_test_split

X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=123,
stratify=y)

lr = LogisticRegression(solver='newton-cg',
multi_class='multinomial',
random_state=1)

lr_pipe = make_pipeline(StandardScaler(), lr)

lr_pipe.fit(X_train, y_train)
lr_pipe.score(X_test, y_test)

Out[89]:
0.9555555555555556
In [90]:
lr_pipe.named_steps

Out[90]:
{'logisticregression': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='multinomial',
n_jobs=1, penalty='l2', random_state=1, solver='newton-cg',
tol=0.0001, verbose=0, warm_start=False),
'standardscaler': StandardScaler(copy=True, with_mean=True, with_std=True)}
In [91]:
lr_pipe.named_steps['standardscaler'].transform(X[:5])

Out[91]:
array([[-0.92257899,  1.04365194, -1.36204734, -1.33302226],
[-1.17919036, -0.09310588, -1.36204734, -1.33302226],
[-1.43580174,  0.36159725, -1.42034941, -1.33302226],
[-1.56410743,  0.13424568, -1.30374528, -1.33302226],
[-1.05088468,  1.2710035 , -1.36204734, -1.33302226]])

Things to look up (QQQ):

Model evaluation

• Repeated Holdout
• Bootstrapping
• Cross-validation

Visualization Classification results:

• It's hard to visualize things in higher dimensions

# Recommendation Engine with Neo4j¶

Things are done instead of indexing but through relationships.

• Uniqueness constraints also auto-index (label, property)
CREATE CONSTRAINT ON (t:Topic) ASSRT t.id IS UNIQUE
• Indexes are only used to find the starting point for queries
• Relational databases use index scans to look up rows in tables and join them with rows from other tables

How to run the neo4j guide from this tutorial

:play http://guides.neo4j.com/pydatachi

• Indexes will be the starting point for recommendation traverses
• Need to install APOC (awesome procedures on Cypher)
• Can write stored procedures and drop into neo4j as a plugin
• There are implicit group by's in Cypher queries
• Assuming that things with the thing that has 'has_topic' relationship are topics
• Clustering topics

### Periodic Commit¶

• Cypher keeps all transaction state in memory when running a query
• When using LOAD CSV, we can get an outofmemory exception
USING PERIODIC COMMIT [insert number here; default 1000]
LOAD CSV
• Happy medium tends to be 5K - 10K rows

The WITH clause allows chaining of query parts that are then passed to other MATCH clauses.

• Filter on aggregated values

Topic Similarity Jupyter notebook

Look at this later (QQQ)

• Neo4j defaults to understanding everything as a string
• Imagine if we included event attendance in that algorithm
• Neo4j load csv is a transactional import; if something goes wrong, tie things together later

# PyMC3¶

• Set up models using with statmeent
• Can set up a burn-in preiod
• look for betas to be sharp
• what to look for in the plots?

## Linear Regression¶

• don't use Pandas dataframes with PyMC3
• Cast boolean as integer - PyMC3 doesn't like booleans
• PyMC3 needs to be numpy arrays
• if you don't specify a sampler, pymc3 will pick one for you
• betas look like they're in the right direction
• Cast all the variables with errors
• Instead of coming up with one single value for parameters, coming up with a value and know the error
• What if the beta changes over time?
• What is probabilistic programming

## Hierarchical Linear/Logistic Regression¶

• black box that does parameter estimatino
• make sure that there are no NaNs and Infinities
• Can get an estimate of the distributions
• Can use a NUTS sampler with the standard deviations

## API-fying¶

• can use posterier samples to predict
• even if you put in the wrong actual value, it'll predict something

# Hacking the Jupyter Kernels¶

1. What is a kernel?
• A program that runs and introspects the user's code
• Tracks the state of the kernel itself
• Essentially a communicator
• Single kernel can be connected to multiple front-ends
1. What does jupyter notebook do?

• Starts a kernel + passes a connection file that defines how it should communicate to the front-end
• Kernel + front-end communicate through JSON payloads
2. What does a connetion file look like?

{
"control_port": 50160,
"shell_port": 57503,
"transport": "tcp",
"signature_scheme": "hmac-sha256",
"stdin_port": 52597,
"hb_port": 42540,
"ip": "127.0.0.1",
"iopub_port": 40885,
"key": "a0436f6c-1916-498b-8eb9-e81ab9368e84"
}


### How do we write a kernel for Jupyter?¶

1. Write it in Python
• ipykernel provides all objects that you need to overwrite all standard python functions
• There has to be a repl (repl console system) using pexpect
2. Write the kernel in the native language

Jupyter messaging protocol

1. Defines the spec for how messages look between front-end and kernel
{
'msg_id' : uuid,
'session' : uuid,
'date': str,
'msg_type' : str,
'version' : '5.0',
},
'content' : dict,
}

1. Types of kernel sockets
• Shell
• Managing communication with all of the frontends
• sending to that port to that socket to the kernel, parsing the message, and sending the message back.
{
'msg_id' : uuid,
'session' : uuid,
'date': str,
'msg_type' : ‘is_complete_request’,
'version' : '5.0',
},
'content' : {
‘code’: ‘prin’
}
}


example: tab-complete

    - <b>use casess</b>
- statement completeness
- tab completion
- information about the connected kernel
- send out requests to execute code
* IO Publication
- any kind of image rendering
- streams output that are side effects of execution
* Standard Input
- when the kernel wants to request input from the front end
- executing python code asking for raw_input
* Control
- Does everything that the shell can do
- Can trump things in the queue
- restarting the kernel in the middle of execution is done by control socket
* Heartbeat
- Sends bytestring back and forth
- Used to make sure the kernel is still alive

### What makes this magic happen?¶

0MQ is a messaging library that provides an API that has bindings across programming langauges to open up communication.

An agreement on how messages should be sent.

Random notes:

• Banner: what is printed out when your kernel is connected in console mode.
• code_mirror- what type of syntax highlighting
• Execution count makes a call to another function called "start bash"
• the do_execute tells what to do with the code
• kernel base class only handles do_execute. can write do_complete

A fill-in-the-blank version of how to build a kernel

• Parallel numpy and pandas
• Responsive
• Parallel programming + Task scheduling
• It's a data structure on top of a bunch of numpy arrays
• Does opportunistic cacheing
• Number of chunks is arbitrary
• dask.get() is an alias for the synchronous backend for tracebacks
• Thread is the smallest unit
• io tasks are not blocked by the GIL
• Can do pandas dataframes distributed
• Can build up arbitrarily complex things
• h5py (hierarchical data format)
• fast lookups
• bag take is like head
• don't use groupby unless you have to
• There's a map_blocks for dask arrays

# JUSTIN'S NOTES¶

## Building Recommendation Enginers With Neo4j:¶

-Use link above in Neo4j for full tutorial -Some links might have file instead of url address, change those! -“WITH” clause allows chaining of arguments in Neo4J

## Luigi and Data Piplines¶

github.com/spotify/luigi

• Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
• Works with slack and other chat: https://github.com/houzz/hubot-luigi

• Basic structure is built around Tasks & Targets Tasks: atomic unit of work. Takes a target(s), produces targets) Targets: ..... slide moved too fast

• It seems like LUIGI is made to tie together different functions, tasks, jobs, etc. : it is a framework for scripts tied together
• Has both command line interface and GUI interface
• Speaker used building permit data from City of Chicago: https://data.cityofchicago.org

## JULIA¶

#### Tutorial: clone git@github.com:nassarhuda/PyDataChi2016.git¶

• run code in the cloud, including code from github https://juliabox.com

• Need to correct this error to load some datasets https://github.com/JuliaLang/julia/issues/14746

• Able to select options to do things in parallel

• Able to change methods to increase speed
• Typed like python for the most part
• “Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments. It provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and an extensive mathematical function library”
• Julia's goal: write like Python, run like C, easily access existing libraries and legacy code. (more here)
• Julie packages - http://pkg.julialang.org
• JULIA interface with AWS - https://github.com/JuliaCloud/AWS.jl
• Benefit of JULIA = easy to bring in packages from other languages
• Benefit of JULIA = easy to change type and sometimes possible to change assumptions of what types funtions or modules need
• “Don’t be afraid of doing a for loop in Julia, they can be faster than vectorized operations that you might go to for speed in python”
• Something about self-generated types… .that you can change on the fly with short-cuts?
• There was a short-cut to show you the code behind the functions being called, down to maybe machine code? Or at least C?
• PyCall is one of the most popular packages in Julia as it can run any python code or almost any python code in Julia.
• Interesting data frames syntax in Julia that, at least in places, seem a bit more straight forward than Pandas? See DataFrames_example Jupyter Notebook
In [ ]: