Nuclear Engineer @ UIUC
Tools
Why do it?
What we should do?
Idea
Simplicity
feather.read_dataframe(path, columns=None)
Future of feather
Use case
Caveat
How do we find the least terrible line using gradient descent?
import numpy as np
import sklearn
x = np.array([[0 , 0.1], [3, 0.2], [5, 0.1]])
y = np.array([1,2,1])
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
model = LinearDiscriminantAnalysis()
model.fit(x,y)
new_point = np.array([1, .3])
print(model.predict(new_point))
Most of the time, we don't use a Linear Discriminant Classifier. Other models include logistic regression. When there's a gradient between the two classes that can be described with the logistic function.
from sklearn.linear_model import LogisticRegression
model = LogisticRegression
SVM = better definition of 'terrible'
from sklearn.svm import LinearSVC
model = LinearSVC
KNN
from sklearn.neighbors import NearestNeighbors
Decision Tree learners
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier
import pydot
sklearn.tree.export_graphviz() + pydot
Ensemble Model: deals with overfitting problems
Bagging:
Random Forest:
Extra trees:
Voting:
Boosting:
from sklearn.ensemble import [insert model here]
How Do I pick?
Nonlinear boundary | Probability Estimate | Feature importance | |
---|---|---|---|
Logistic Regression | Yes | Not really | no |
KNN | Yes | Sort of (% nearby points) | No |
Naive Bayes | Yes | Yes | No |
Decision Tree | Yes | No | Kinda |
Ensemble | Yes | kinda (% agreement) | kinda |
Can I update? | Easy to parallel | |
---|---|---|
Logistic regression | kinda | kinda |
SVM | kinda, depending on kernel | yes for some kernels, no for others |
KNN | yes | yes |
Naive Bayes | yes | yes |
Decision Tree | no | no (but it's really fast) |
Ensemble | kinda, by adding new models | yes |
Boosted | kinda, by adding new models | no |
Other quirks:
SVM: Pick a kernel
KNN: need to define what 'similarity' is in a good way. Fast to train, slow to classify
Naive bayes: Have to choose the distribution. Can deal with missing data
Decision Tree: Can provide literal flow charts, sensitive to outliers
Ensemble: Less prone to overfitting
Boosted: More parameters to tweak, most prone to overfit than normal ensembles
* Can I detect ____ early?
* Can I determine which ___ to prioritize?
* Which policies do I modify to improve ____?
* How much impact is ________ having?
* Can I get data that helps me ?
"Write a program that prints the numbers from 1 to 100. But for multiples of three print “Fizz” instead of the number and for the multiples of five print “Buzz”. For numbers which are multiples of both three and five print “FizzBuzz”."
import numpy
def x(i):
return np.array([1, i% 3 ==0, i%5 ==0])
What if we aren't that clever?
Set of numbers in binary encoding, say 10 digits up to 1023
So train on 101-1023, then use 1-100 as test set.
Neural nets:
Solving
Using Keras
Linear regression is a neural network with no hidden layers
Logistic regression is a neural net with no hidden layers and a sigmoid activation function
Decimal encoding is really good at 'divisible by 5' and terrible at everything else. Back to binary.
Train neural network by # of hidden units
Deep Learning
How does this work?
Lessons Learned
Go look at:
Example: Poisson Regression
We assume:
Elastic net takes the best of both worlds!
Elastic net regression
Optimize penalized NLL with gradient desceint.
Kevin Goetsch
import sklearn
What are pipelines?
Building blocks of sklearn pipelines
Pipeline
routs output of transformer
as input into estimator
Featureunion
joins the results of both pipeline outputsWhy pipelines ?
What is a transformer?
transform (x, [y])
fit (x, [y])
Calling transform
on the transformer is identical to manually subselecting target factor.
What is an estimator?
Any sklearn objects which make predictions
What is a pipeline?
Pipeline of transforms with a final estimator
What's a feature union ?
Horizontal pipeline.
DS Lifecycle
Feature Engineering
Ensemblage = combining the output of multiple pipelines
Model Selection
joblib.dump(pipeline, 'file.pkl')
Reloading
RMarkdown
Corn
Dicompyler
conda-forge easter eggs
conda install -c conda-forge tensorflow
Data Ethics
Safia and nteract
Christy Comp Bio @ Field Museum
Problems with jupyter
from multiproessing.pool import ThreadPool
pool = ThreadPool(4)
pool.apply_async()
apply_async
has a callback functionfrom slack import slack_msg
import smtplib
Bob: Geopandas
Excel is real
Jose: figuring out a file format
I'm not crazy