1) Split into training/test (70/30)
2) Hyperparameter values (meta-values about the algorithm) + Learning algorithm
* Hyperparemeter = values of the algorithm, not the model- how many iterations, etc.
3) Test the model
* Make predictions using the model
* Compare prediction to actual output labels (model output to true labels)
* Compute performance metrics
4) Re-create the model with all data
* generally, the more data = better algorithm
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
Source: R.J. Gladstone (1905). "A Study of the Relations of the Brain to to the Size of the Head", Biometrika, Vol. 4, pp105-123
Description: Brain weight (grams) and head size (cubic cm) for 237 adults classified by gender and age group.
Variables/Columns
#load in the data with pandas and look at what it seems like
df = pd.read_csv('../../pydata-chicago2016-ml-tutorial/code/dataset_brain.txt',
encoding='utf-8',
comment='#',
sep='\s+')
df.tail()
Notes:
Can the head size predict the brain-weight?
#look at whether or not it makes sense to build a model
plt.scatter(df['head-size'], df['brain-weight'])
plt.xlabel('Head size (cm^3)')
plt.ylabel('Brain weight (grams)');
#creating numpy arrays using the pandas array: y first- what we want to predict
y = df['brain-weight'].values
y.shape
# another numpy array from pandas
X = df['head-size'].values
# we need to add a second axis to the array- scikit-learn expects it
X = X[:, np.newaxis]
X.shape
Random state- the same random state will give you the same split. This is helpful for reproducibility.
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=123)
X_train.shape
plt.scatter(X_train, y_train, c='blue', marker='o')
plt.scatter(X_test, y_test, c='red', marker='s')
plt.xlabel('Head size (cm^3)')
plt.ylabel('Brain weight (grams)');
from sklearn.linear_model import LinearRegression
#initialize linear regression object
lr = LinearRegression()
#provide both x and y, because it's supervised learning
lr.fit(X_train, y_train)
#making predictings using the 'predict' method
y_pred = lr.predict(X_test)
We can use r2 score to evaluate the model: done manually
sum_of_squares = ((y_test - y_pred) ** 2).sum()
res_sum_of_squares = ((y_test - y_test.mean()) ** 2).sum()
r2_score = 1 - (sum_of_squares / res_sum_of_squares)
print('R2 score: %.3f' % r2_score)
#check the the scikit-learn R2 with the one manually computed above
print('R2 score: %.3f' % lr.score(X_test, y_test))
sklearn.metrics has a number of different ways to evaluate the fit of a model.
lr.coef_, lr.intercept_
Use y = mx + b to plot the line below
min_pred = X_train.min() * lr.coef_ + lr.intercept_
max_pred = X_train.max() * lr.coef_ + lr.intercept_
plt.scatter(X_train, y_train, c='blue', marker='o')
plt.plot([X_train.min(), X_train.max()],
[min_pred, max_pred],
color='red',
linewidth=4)
plt.xlabel('Head size (cm^3)')
plt.ylabel('Brain weight (grams)');
If we use train/test split funtion, we may not get a training set with the same proportion of things that are classified. We need to stratify the split, which is a relatively new feature of scikit-learn.
In logistic regression, the activation function is a predicted probability, which is then converted to a unit step function, which then predicts the class label.
Logistic regression is a generalized linear model that gets weight coefficients that we use to make predictions on new data.
K-nearest-neighbors looks up the samples in the neighborhood and makes the classification prediction based on the neareast neighbors.
One thing to keep in mind: all features must be at the same scale.
df = pd.read_csv('../../pydata-chicago2016-ml-tutorial/code/dataset_iris.txt',
encoding='utf-8',
comment='#',
sep=',')
df.tail()
X = df.iloc[:, :4].values
y = df['class'].values
#these are the unique values within y
np.unique(y)
from sklearn.preprocessing import LabelEncoder
# we want to convert the labels to integers (string to int)
l_encoder = LabelEncoder()
l_encoder.fit(y)
l_encoder.classes_
y_enc = l_encoder.transform(y)
np.unique(y_enc)
np.unique(l_encoder.inverse_transform(y_enc))
from sklearn.datasets import load_iris
iris = load_iris()
print(iris['DESCR'])
X, y = iris.data[:, :2], iris.target
# ! We only use 2 features for visual purposes (sepal length, width)
print('Class labels:', np.unique(y))
print('Class proportions:', np.bincount(y))
#what happens if we just split without stratifying?
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=123)
print('Class labels:', np.unique(y_train))
print('Class proportions:', np.bincount(y_train))
# the stratify=y option makes sure that the data remains in a good distribution
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=123,
stratify=y)
print('Class labels:', np.unique(y_train))
print('Class proportions:', np.bincount(y_train))
If you use the default setting, it will train three binary logistic regression classifiers, each one comparing 'one-vs-rest'. We use the softmax function to do a multinomial classification. If the classes are independent, it's reasonable to use softmax.
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver='newton-cg',
multi_class='multinomial',
random_state=1)
lr.fit(X_train, y_train)
print('Test accuracy %.2f' % lr.score(X_test, y_test))
from mlxtend.evaluate import plot_decision_regions
plot_decision_regions
plot_decision_regions(X=X, y=y, clf=lr, X_highlight=X_test)
plt.xlabel('sepal length [cm]')
plt.xlabel('sepal width [cm]');
from sklearn.neighbors import KNeighborsClassifier
#how do you pick the number of neighbors?
#how to break a tie
kn = KNeighborsClassifier(n_neighbors=4)
kn.fit(X_train, y_train)
print('Test accuracy %.2f' % kn.score(X_test, y_test))
plot_decision_regions(X=X, y=y, clf=kn, X_highlight=X_test)
plt.xlabel('sepal length [cm]')
plt.xlabel('sepal width [cm]');
Nominal variables have no intrinsic ordering, so we can binarize them. Ordinal variables can be coded with numbers.
Types of feature normalization: minmax just rescales from 0-1, and z-score centers things around the mean. Z-score standardization for simple optimization like gradient descent: weights are updated more equally. Z-score can optimize in both directions.
Re-scaling is done to to make sure that the coefficients the model puts out are representative of how important input variables are, rather than their magnitudes.
import pandas as pd
df = pd.DataFrame([
['green', 'M', 10.0],
['red', 'L', 13.5],
['blue', 'XL', 15.3]])
df.columns = ['color', 'size', 'prize']
df
Note to self: Look up what DictVectorizer does (QQQ)
from sklearn.feature_extraction import DictVectorizer
dvec = DictVectorizer(sparse=False)
X = dvec.fit_transform(df.transpose().to_dict().values())
X
size_mapping = {
'XL': 3,
'L': 2,
'M': 1}
df['size'] = df['size'].map(size_mapping)
df
X = dvec.fit_transform(df.transpose().to_dict().values())
X
df = pd.DataFrame([1., 2., 3., 4., 5., 6.], columns=['feature'])
df
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
mmxsc = MinMaxScaler()
stdsc = StandardScaler()
X = df['feature'].values[:, np.newaxis]
df['minmax'] = mmxsc.fit_transform(X)
df['z-score'] = stdsc.fit_transform(X)
df
from sklearn.pipeline import make_pipeline
from sklearn.cross_validation import train_test_split
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=123,
stratify=y)
lr = LogisticRegression(solver='newton-cg',
multi_class='multinomial',
random_state=1)
lr_pipe = make_pipeline(StandardScaler(), lr)
lr_pipe.fit(X_train, y_train)
lr_pipe.score(X_test, y_test)
lr_pipe.named_steps
lr_pipe.named_steps['standardscaler'].transform(X[:5])
Things to look up (QQQ):
Model evaluation
Visualization Classification results:
Things are done instead of indexing but through relationships.
CREATE CONSTRAINT ON (t:Topic) ASSRT t.id IS UNIQUE
How to run the neo4j guide from this tutorial
:play http://guides.neo4j.com/pydatachi
USING PERIODIC COMMIT [insert number here; default 1000]
LOAD CSV
The WITH
clause allows chaining of query parts that are then passed to other MATCH
clauses.
Topic Similarity Jupyter notebook
Look at this later (QQQ)
What does jupyter notebook
do?
What does a connetion file look like?
{
"control_port": 50160,
"shell_port": 57503,
"transport": "tcp",
"signature_scheme": "hmac-sha256",
"stdin_port": 52597,
"hb_port": 42540,
"ip": "127.0.0.1",
"iopub_port": 40885,
"key": "a0436f6c-1916-498b-8eb9-e81ab9368e84"
}
Jupyter messaging protocol
{
'header' : {
'msg_id' : uuid,
'username' : str,
'session' : uuid,
'date': str,
'msg_type' : str,
'version' : '5.0',
},
'parent_header' : dict,
'metadata' : dict,
'content' : dict,
}
{
'header' : {
'msg_id' : uuid,
'username' : str,
'session' : uuid,
'date': str,
'msg_type' : ‘is_complete_request’,
'version' : '5.0',
},
'parent_header' : dict,
'metadata' : dict,
'content' : {
‘code’: ‘prin’
}
}
example: tab-complete
- <b>use casess</b>
- statement completeness
- tab completion
- information about the connected kernel
- send out requests to execute code
* IO Publication
- any kind of image rendering
- provides information about current kernel state (live/dead)
- streams output that are side effects of execution
* Standard Input
- when the kernel wants to request input from the front end
- executing python code asking for raw_input
* Control
- Does everything that the shell can do
- Can trump things in the queue
- restarting the kernel in the middle of execution is done by control socket
* Heartbeat
- Sends bytestring back and forth
- Used to make sure the kernel is still alive
0MQ
is a messaging library that provides an API that has bindings across programming langauges to open up communication.
An agreement on how messages should be sent.
Random notes:
Dask:
:play http://guides.neo4j.com/pydatachi
-Use link above in Neo4j for full tutorial -Some links might have file instead of url address, change those! -“WITH” clause allows chaining of arguments in Neo4J
github.com/spotify/luigi
Works with slack and other chat: https://github.com/houzz/hubot-luigi
Basic structure is built around Tasks & Targets Tasks: atomic unit of work. Takes a target(s), produces targets) Targets: ..... slide moved too fast
run code in the cloud, including code from github https://juliabox.com
Need to correct this error to load some datasets https://github.com/JuliaLang/julia/issues/14746
Able to select options to do things in parallel