Why do it?
What we should do?
Future of feather
How do we find the least terrible line using gradient descent?
import numpy as np import sklearn
x = np.array([[0 , 0.1], [3, 0.2], [5, 0.1]]) y = np.array([1,2,1])
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis model = LinearDiscriminantAnalysis() model.fit(x,y)
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/discriminant_analysis.py:387: UserWarning: Variables are collinear. warnings.warn("Variables are collinear.")
LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None, solver='svd', store_covariance=False, tol=0.0001)
new_point = np.array([1, .3]) print(model.predict(new_point))
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample. DeprecationWarning)
Most of the time, we don't use a Linear Discriminant Classifier. Other models include logistic regression. When there's a gradient between the two classes that can be described with the logistic function.
from sklearn.linear_model import LogisticRegression model = LogisticRegression
SVM = better definition of 'terrible'
from sklearn.svm import LinearSVC model = LinearSVC
from sklearn.neighbors import NearestNeighbors
from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier
import pydot sklearn.tree.export_graphviz() + pydot
Ensemble Model: deals with overfitting problems
from sklearn.ensemble import [insert model here]
How Do I pick?
|Nonlinear boundary||Probability Estimate||Feature importance|
|Logistic Regression||Yes||Not really||no|
|KNN||Yes||Sort of (% nearby points)||No|
|Ensemble||Yes||kinda (% agreement)||kinda|
|Can I update?||Easy to parallel|
|SVM||kinda, depending on kernel||yes for some kernels, no for others|
|Decision Tree||no||no (but it's really fast)|
|Ensemble||kinda, by adding new models||yes|
|Boosted||kinda, by adding new models||no|
SVM: Pick a kernel
KNN: need to define what 'similarity' is in a good way. Fast to train, slow to classify
Naive bayes: Have to choose the distribution. Can deal with missing data
Decision Tree: Can provide literal flow charts, sensitive to outliers
Ensemble: Less prone to overfitting
Boosted: More parameters to tweak, most prone to overfit than normal ensembles
* Can I detect ____ early? * Can I determine which ___ to prioritize? * Which policies do I modify to improve ____? * How much impact is ________ having? * Can I get data that helps me ?
"Write a program that prints the numbers from 1 to 100. But for multiples of three print “Fizz” instead of the number and for the multiples of five print “Buzz”. For numbers which are multiples of both three and five print “FizzBuzz”."
import numpy def x(i): return np.array([1, i% 3 ==0, i%5 ==0])
What if we aren't that clever?
Set of numbers in binary encoding, say 10 digits up to 1023
So train on 101-1023, then use 1-100 as test set.
Linear regression is a neural network with no hidden layers
Logistic regression is a neural net with no hidden layers and a sigmoid activation function
Decimal encoding is really good at 'divisible by 5' and terrible at everything else. Back to binary.
Train neural network by # of hidden units
How does this work?
Example: Poisson Regression
Elastic net takes the best of both worlds!
Elastic net regression
Optimize penalized NLL with gradient desceint.
What are pipelines?
Building blocks of sklearn pipelines
Pipelinerouts output of
transformeras input into
Featureunionjoins the results of both pipeline outputs
Why pipelines ?
What is a transformer?
transform (x, [y])
fit (x, [y])
transform on the transformer is identical to manually subselecting target factor.
What is an estimator?
Any sklearn objects which make predictions
What is a pipeline?
Pipeline of transforms with a final estimator
What's a feature union ?
Ensemblage = combining the output of multiple pipelines
conda-forge easter eggs
conda install -c conda-forge tensorflow
Safia and nteract
Christy Comp Bio @ Field Museum
Problems with jupyter
from multiproessing.pool import ThreadPool pool = ThreadPool(4) pool.apply_async()
apply_asynchas a callback function
from slack import slack_msg
Excel is real
Jose: figuring out a file format
I'm not crazy