Table of Contents¶

1. Keynote: Scaling Human Learning
2. Deconstructing Feather
3. Microsoft AI
4. Explaining Classification Algorithms
5. Keynote: Data Science For Social Good
6. Fizzbuzz & Neural Networks
7. Pyglmnet- porting from R
8. Scikit-learn pipelines
9. Lightening talks

Keynote: Scaling Human Learning¶

Katy Huff

Nuclear Engineer @ UIUC

The Hacker Within
- Models that have worked
- Models that have failed
  - single point-of-failure
Software Carpentry
Data Carpentry
Tools
- Binder
- Data Science Textbook
- Effective Computation in Physics
Why do it?
- community
- travel
- teaching experience
- felt need
What we should do?
- Lower the barrier, not the standards
- New Dev instructions
- Document well
- Curate low hanging fruit
- Targeted sprints
- Appoint ambassador
- Consider Users Conferences

Feather¶

Bill Lattner

Feather

Slides for the talk

Exchange tabular data between Python, R, and others
Fast read/write
Represent categorical features
it's about the metadata

memory access cost depends on both location + predictability
sequential access FTW

Idea

on-disk representation should be similar to in-memory representation
columnar layout is good fit for analytic workflows
- columnar layout from arrow

Simplicity

feather.read_dataframe(path, columns=None)

each column is serialized into a dataframe
bitmask of nulls
values
looks liike dataframe in R
no current in-place concatenations
how does it handle escape characters

Future of feather

in-place operations
zero parsing or copying to Pandas memory
mmap the feather file
input to sklearn or statsmodels
output from PostgreSQL

Use case

passing data between R and Python in computation
As part of a Luigi Pipeline

Microsoft Cognitive Services¶

David Girard

Examples

Cognitive Services API

Times per month
Times per minute

Jupyter Notebook Demo

Demos

Explaining classification algorithms¶

Brian Lange

Popular examples¶

Sorting Hat
Spam filter

Notes¶

Need labelled data for training
feature = dimension - column = attribute
class= categorization

Caveat

Choosing good features or getting more data will help more than changing algorithms

How do we find the least terrible line using gradient descent?

Implementing a spam filter¶

import numpy as np
import sklearn

x = np.array([[0 , 0.1], [3, 0.2], [5, 0.1]])
y = np.array([1,2,1])

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
model = LinearDiscriminantAnalysis()
model.fit(x,y)

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")

LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,
              solver='svd', store_covariance=False, tol=0.0001)

new_point = np.array([1, .3])
print(model.predict(new_point))

[1]

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)

Most of the time, we don't use a Linear Discriminant Classifier. Other models include logistic regression. When there's a gradient between the two classes that can be described with the logistic function.

from sklearn.linear_model import LogisticRegression
model = LogisticRegression

SVM = better definition of 'terrible'

lines can turn into non-linear shapes if you transform your data
the kernel trick: take the square of each number'
RBF SVM: radial-basis svm. Creating more complex shapes. Most popular kernel
SVM also tries to maximize the margins

from sklearn.svm import LinearSVC
model = LinearSVC

KNN

What do similar cases look like?
k = how many?
Tie-breaking

from sklearn.neighbors import NearestNeighbors

Decision Tree learners

Make a flow chart of it
In higher dimensions
Prone to overfitting
Use Pydot

from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier

import pydot
sklearn.tree.export_graphviz() + pydot

Ensemble Model: deals with overfitting problems

Bagging:

Split training set
Train one model each
Models 'vote'
Sum of the decision boundaries of its components

Random Forest:

Like bagging
At each split randomly constrain features to choose from

Extra trees:

For each split, make it random, non-optimally
Compensate by making a ton of trees

Voting:

Combine a bunch of different models of your design, have them 'vote' on the correct answer
For example (KNN, SVM, Decision Tree)

Boosting:

Train models in order, make the other ones focus on the points the earliest ones missed

from sklearn.ensemble import [insert model here]

How Do I pick?

Nonlinear decision boundary
Providing probability estimates
Tell how important a feature is to the model

	Nonlinear boundary	Probability Estimate	Feature importance
Logistic Regression	Yes	Not really	no
KNN	Yes	Sort of (% nearby points)	No
Naive Bayes	Yes	Yes	No
Decision Tree	Yes	No	Kinda
Ensemble	Yes	kinda (% agreement)	kinda

	Can I update?	Easy to parallel
Logistic regression	kinda	kinda
SVM	kinda, depending on kernel	yes for some kernels, no for others
KNN	yes	yes
Naive Bayes	yes	yes
Decision Tree	no	no (but it's really fast)
Ensemble	kinda, by adding new models	yes
Boosted	kinda, by adding new models	no

Other quirks:

SVM: Pick a kernel

KNN: need to define what 'similarity' is in a good way. Fast to train, slow to classify

Naive bayes: Have to choose the distribution. Can deal with missing data

Decision Tree: Can provide literal flow charts, sensitive to outliers

Ensemble: Less prone to overfitting

Boosted: More parameters to tweak, most prone to overfit than normal ensembles

Data Science for Social Good¶

Rayid Ghani

DSSG Github

Police brutality¶

Early Intervention System
Currently: Inaccurate and unreliable
Arrests/dispatches
Match data + interventions
Can we predict which dispatches can become adverse?

Lead Poisoning¶

Our defacto policy is that kids are lead detectors
Combine blood tests + lead inspections + open data about buildings
-6 to 12 months and predict kid's chance of lead poisoning
Can get inspectors to go before birth
Trying to implement this into the electronic medical records system

High School Dropouts¶

Schools don't know how to prioritize the kids that are at risk
Identifying kids not in 9th grade, but in 7th grade
Can start designing interventions that are targeted towards high-risk kids

EPA Hazardous Waste¶

Find out who is likely to violate in the future
The goal isn't really to find violations, it's actually to deter and change
You can be "very efficient, but completely usless"

Home Inspections¶

Can we find out the code violations that lead to blight?

Recitivism in the Criminal Justice system¶

Criminal Justice + ER + Mental Health intersections
Cycles happen at the intersections; and this happens before long stays in jail
If we can identify people early through homeless shelters/ERs so that we can do preventative work?

Problem Templates¶

* Can I detect ____ early?
* Can I determine which ___ to prioritize?
* Which policies do I modify to improve ____?
* How much impact is ________ having?
* Can I get data that helps me ?

Common Challenges¶

Privacy
Security
Interpretability
Transparency
Fairness and Ethics

What we need¶

Problem formulation
Programming
Stats and ML
Econometrics & Social Science Methods
Experimental Design
Ethics and Legal Issues
Communication

FizzBuzz with TensorFlow¶

Joel Grus

Based on this blog post

"Write a program that prints the numbers from 1 to 100. But for multiples of three print “Fizz” instead of the number and for the multiples of five print “Buzz”. For numbers which are multiples of both three and five print “FizzBuzz”."

Outputs¶

Output number
Output fizz
Output buzz
Output fizzbuzz

Feature selection:¶

Divisible by 3
Divisible by 5

import numpy
def x(i):
    return np.array([1, i% 3 ==0, i%5 ==0])

What if we aren't that clever?

Set of numbers in binary encoding, say 10 digits up to 1023

So train on 101-1023, then use 1-100 as test set.

Neural nets:

Inputs (multiply by weights) -> Hidden layer 1 (do computation) (apply an activation) -> output

Solving

Using Tensorflow
Using Keras
- Standard import
- List layer
- Compile model
- Train the model
Linear regression is a neural network with no hidden layers
A "dense layer" is just lineaer regression
Binary input isn't good enough; decimal makes it easy whether or not it's divisible by 5
- gets buzz right
Logistic regression is a neural net with no hidden layers and a sigmoid activation function
Decimal encoding is really good at 'divisible by 5' and terrible at everything else. Back to binary.
Train neural network by # of hidden units
- at 25 hidden units, it gets pretty good, then starts overfitting to the training set
Deep Learning
- add another hidden layer with some dropouts
- with 2000X2000 layers, it mostly works

How does this work?

25-hidden-neuron shallow net has simplest interesting model. Gets all the divisible by 15 right.
- Which inputs produce largest "fizz buzz" values?
- Last column only needs to be larger than other columns
- Pairs of numbers that differ by 120 produce similar outputs
- If two numbers differ by a multiple of 15, same output
- If a network could ignore differences that are multiples of 15 would be a good start
- Then only have to learn each equivalence class
Which outputs are closest to the output for 450?
- Look at the binary representations
- There's a lot of bit flips that end up being multiples of 15
- If a network treats those bits the same, the network will do the same thing on those things
- Output of last dense layer (+8 -128)
- Output of first dense layer (+8 -128) learning to ignore 120 differences
- Also suggests why binary enconding does better

Lessons Learned

Feature selection is important
Stupid problems sometimes contain subtleties
Sometimes 'blackbox models' can reveal such subtleties

Github

Go look at:

LSTM

Pyglmnet¶

Pavan Ramkumar

Tutorial here

Generalized linear models encompass a broad class of models
Elastic net is an excellent algorithm for regularization
Scikit-learn only has implementation of linear and logistic models
Often want to model spike counts (modelled as poisson process)
To fit the paremeters in a linear regression, we minimize the mean squared loss
Negative log-likelihood function identical to mean-square loss
To go from linear regression to GLM (replace normal distribution with exponential family + pointwise nonlinearity)

Example: Poisson Regression

We assume:

Nonlinearity
Noise distribution

Regularization
Ridge Regression
- Good for problems with lots of parameters
- Doens't work well when only a few features are predictive
Lasso Regression

Elastic net takes the best of both worlds!

Elastic net regression
- Active set
- Cyclic coordinate descent with Newton update
- Theano
- Tensorflow
Optimize penalized NLL with gradient desceint.
Optimize better with active set + coordinate descent + Newton update
Use sympy for calculus or use theano

ML using Scikit-learn pipelines¶

Kevin Goetsch

import sklearn

What are pipelines?

Container of steps
- Transformer
- Estimator
- Pipeline
- FeatureUnion
Used to package a model

Building blocks of sklearn pipelines

Pipeline routs output of transformer as input into estimator
Featureunion joins the results of both pipeline outputs

Why pipelines ?

transformations written out at once
easy to swap out pieces
readability
keeps intermediate steps together

What is a transformer?

transform (x, [y])

applies transformations on X
```
fit (x, [y])
```
applies fit logic

Calling transform on the transformer is identical to manually subselecting target factor.

What is an estimator?

Any sklearn objects which make predictions

What is a pipeline?

Pipeline of transforms with a final estimator

What's a feature union ?

Horizontal pipeline.

DS Lifecycle

Feature Engineering

Reuse transformers
Apply identical transformations to training and test

Ensemblage = combining the output of multiple pipelines

Model Selection

Track interactions + hierarchy
Easy model stacking
No tracking intermediate data

joblib.dump(pipeline, 'file.pkl')

Reloading

can see what parameters went into each step
Gridsearch can tune hyperparameters in featureunion

Lightning talks¶

RMarkdown

Like Rjupyter but better

Rcloud is cool

Corn

Crop data + weather data

Dicompyler

Aditya Panchal
dicompyler
dicompyler
Take cat scan data + dose + statistics

conda-forge easter eggs

Jonathan J. Helmus
Collection of recipes, build infrastructure, and packages

builds are done on CI services

conda install -c conda-forge tensorflow

Conda 1.5
True and false weren't added until 2.3
No exit command

Data Ethics

@herdingbats
tmcgovern@oreilly.com
Not law and not privacy
Ethics is about the design, law is about the application
We have imperfect data
If you're not making things better, you're making things work
Power/Surveilance
Privacy "the right to be ignored" ~ Louis Brandeis and Samuel Warren
either hypervisible or ignored: the poor/rich
Ethical problems are revealed in the way we treat the most vulnerable.
Ongoing finegrained consent
Designing for fairness in the age of algorithms

Safia and nteract

nteract
Tenants
- composability
- simplicity
- transparency
- kind community
Open notebooks from the file explorer
React components
Git integration
Real-time collaboration

Christy Comp Bio @ Field Museum

R package 'ape'
currently working on interactive visualization of phylogenetic tree for python
Ivy

Problems with jupyter

Long running processes lock up Jupyter

Use async calls

from multiproessing.pool import ThreadPool
pool = ThreadPool(4)
pool.apply_async()

Not getting performance, able to do an asynchronous process
apply_async has a callback function
```
from slack import slack_msg
```
for callback - can do a lambda function that calls a slack message
```
import smtplib
```
Send yourself an email or text

Bob: Geopandas

Adding geographic data to pandas
Geopandas
Github repo with examples

Elizabeth Wickes

Rules of crap
- know where it is
- know how your crap works
- make metacrap to understand it
- preserve your crap
- name your crap with meaningful crap

Excel is real

There are many people who still love and use Excel

Jose: figuring out a file format

2009-2015 trading data
6 million messages/day in 2009
60 million messages/day in 2015
fixtools

I'm not crazy

Building everything from source
Keeping source separate from build
Can see all the build artefacts
Writing own virtualenv for zsh

PyData 2016 - Chicago - Day 2