PyData 2016 - Chicago - Day 2

Table of Contents

Keynote: Scaling Human Learning

Katy Huff

Nuclear Engineer @ UIUC

  • The Hacker Within

    • Models that have worked
    • Models that have failed
      • single point-of-failure
  • Software Carpentry

  • Data Carpentry

  • Tools

  • Why do it?

    • community
    • travel
    • teaching experience
    • felt need
  • What we should do?

    • Lower the barrier, not the standards
    • New Dev instructions
    • Document well
    • Curate low hanging fruit
    • Targeted sprints
    • Appoint ambassador
    • Consider Users Conferences


  • Exchange tabular data between Python, R, and others
  • Fast read/write
  • Represent categorical features
  • it's about the metadata
  • memory access cost depends on both location + predictability
  • sequential access FTW


  • on-disk representation should be similar to in-memory representation
  • columnar layout is good fit for analytic workflows
    • columnar layout from arrow


feather.read_dataframe(path, columns=None)
  • each column is serialized into a dataframe
  • bitmask of nulls
  • values
  • looks liike dataframe in R
  • no current in-place concatenations
  • how does it handle escape characters

Future of feather

  • in-place operations
  • zero parsing or copying to Pandas memory
  • mmap the feather file
  • input to sklearn or statsmodels
  • output from PostgreSQL

Use case

  • passing data between R and Python in computation
  • As part of a Luigi Pipeline

Microsoft Cognitive Services


Cognitive Services API

  • Times per month
  • Times per minute

Explaining classification algorithms

  • Sorting Hat
  • Spam filter


  • Need labelled data for training
  • feature = dimension - column = attribute
  • class= categorization


  • Choosing good features or getting more data will help more than changing algorithms

How do we find the least terrible line using gradient descent?

Implementing a spam filter

In [15]:
import numpy as np
import sklearn
In [16]:
x = np.array([[0 , 0.1], [3, 0.2], [5, 0.1]])
y = np.array([1,2,1])
In [17]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
model = LinearDiscriminantAnalysis(),y)
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/ UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,
              solver='svd', store_covariance=False, tol=0.0001)
In [18]:
new_point = np.array([1, .3])
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/utils/ DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.

Most of the time, we don't use a Linear Discriminant Classifier. Other models include logistic regression. When there's a gradient between the two classes that can be described with the logistic function.

In [26]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression

SVM = better definition of 'terrible'

  • lines can turn into non-linear shapes if you transform your data
  • the kernel trick: take the square of each number'
  • RBF SVM: radial-basis svm. Creating more complex shapes. Most popular kernel
  • SVM also tries to maximize the margins
In [29]:
from sklearn.svm import LinearSVC
model = LinearSVC


  • What do similar cases look like?
  • k = how many?
  • Tie-breaking
In [27]:
from sklearn.neighbors import NearestNeighbors

Decision Tree learners

  • Make a flow chart of it
  • In higher dimensions
  • Prone to overfitting
  • Use Pydot
In [30]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier
import pydot
sklearn.tree.export_graphviz() + pydot

Ensemble Model: deals with overfitting problems


  • Split training set
  • Train one model each
  • Models 'vote'
  • Sum of the decision boundaries of its components

Random Forest:

  • Like bagging
  • At each split randomly constrain features to choose from

Extra trees:

  • For each split, make it random, non-optimally
  • Compensate by making a ton of trees


  • Combine a bunch of different models of your design, have them 'vote' on the correct answer
  • For example (KNN, SVM, Decision Tree)


  • Train models in order, make the other ones focus on the points the earliest ones missed
from sklearn.ensemble import [insert model here]

How Do I pick?

  • Nonlinear decision boundary
  • Providing probability estimates
  • Tell how important a feature is to the model
Nonlinear boundary Probability Estimate Feature importance
Logistic Regression Yes Not really no
KNN Yes Sort of (% nearby points) No
Naive Bayes Yes Yes No
Decision Tree Yes No Kinda
Ensemble Yes kinda (% agreement) kinda
Can I update? Easy to parallel
Logistic regression kinda kinda
SVM kinda, depending on kernel yes for some kernels, no for others
KNN yes yes
Naive Bayes yes yes
Decision Tree no no (but it's really fast)
Ensemble kinda, by adding new models yes
Boosted kinda, by adding new models no

Other quirks:

SVM: Pick a kernel

KNN: need to define what 'similarity' is in a good way. Fast to train, slow to classify

Naive bayes: Have to choose the distribution. Can deal with missing data

Decision Tree: Can provide literal flow charts, sensitive to outliers

Ensemble: Less prone to overfitting

Boosted: More parameters to tweak, most prone to overfit than normal ensembles

Data Science for Social Good

Police brutality

  • Early Intervention System
  • Currently: Inaccurate and unreliable
  • Arrests/dispatches
  • Match data + interventions
  • Can we predict which dispatches can become adverse?

Lead Poisoning

  • Our defacto policy is that kids are lead detectors
  • Combine blood tests + lead inspections + open data about buildings
  • -6 to 12 months and predict kid's chance of lead poisoning
  • Can get inspectors to go before birth
  • Trying to implement this into the electronic medical records system

High School Dropouts

  • Schools don't know how to prioritize the kids that are at risk
  • Identifying kids not in 9th grade, but in 7th grade
  • Can start designing interventions that are targeted towards high-risk kids

EPA Hazardous Waste

  • Find out who is likely to violate in the future
  • The goal isn't really to find violations, it's actually to deter and change
  • You can be "very efficient, but completely usless"

Home Inspections

  • Can we find out the code violations that lead to blight?

Recitivism in the Criminal Justice system

  • Criminal Justice + ER + Mental Health intersections
  • Cycles happen at the intersections; and this happens before long stays in jail
  • If we can identify people early through homeless shelters/ERs so that we can do preventative work?

Problem Templates

* Can I detect ____ early?
* Can I determine which ___ to prioritize?
* Which policies do I modify to improve ____?
* How much impact is ________ having?
* Can I get data that helps me ?

Common Challenges

  • Privacy
  • Security
  • Interpretability
  • Transparency
  • Fairness and Ethics

What we need

  • Problem formulation
  • Programming
  • Stats and ML
  • Econometrics & Social Science Methods
  • Experimental Design
  • Ethics and Legal Issues
  • Communication

FizzBuzz with TensorFlow

"Write a program that prints the numbers from 1 to 100. But for multiples of three print “Fizz” instead of the number and for the multiples of five print “Buzz”. For numbers which are multiples of both three and five print “FizzBuzz”."


  1. Output number
  2. Output fizz
  3. Output buzz
  4. Output fizzbuzz

Feature selection:

  1. Divisible by 3

  2. Divisible by 5

In [36]:
import numpy
def x(i):
    return np.array([1, i% 3 ==0, i%5 ==0])

What if we aren't that clever?

Set of numbers in binary encoding, say 10 digits up to 1023

So train on 101-1023, then use 1-100 as test set.

Neural nets:

  • Inputs (multiply by weights) -> Hidden layer 1 (do computation) (apply an activation) -> output


  • Using Tensorflow
  • Using Keras

    • Standard import
    • List layer
    • Compile model
    • Train the model
  • Linear regression is a neural network with no hidden layers

  • A "dense layer" is just lineaer regression
  • Binary input isn't good enough; decimal makes it easy whether or not it's divisible by 5
    • gets buzz right
  • Logistic regression is a neural net with no hidden layers and a sigmoid activation function

  • Decimal encoding is really good at 'divisible by 5' and terrible at everything else. Back to binary.

  • Train neural network by # of hidden units

    • at 25 hidden units, it gets pretty good, then starts overfitting to the training set
  • Deep Learning

    • add another hidden layer with some dropouts
    • with 2000X2000 layers, it mostly works

How does this work?

  • 25-hidden-neuron shallow net has simplest interesting model. Gets all the divisible by 15 right.
    • Which inputs produce largest "fizz buzz" values?
    • Last column only needs to be larger than other columns
    • Pairs of numbers that differ by 120 produce similar outputs
    • If two numbers differ by a multiple of 15, same output
    • If a network could ignore differences that are multiples of 15 would be a good start
    • Then only have to learn each equivalence class
  • Which outputs are closest to the output for 450?
    • Look at the binary representations
    • There's a lot of bit flips that end up being multiples of 15
    • If a network treats those bits the same, the network will do the same thing on those things
    • Output of last dense layer (+8 -128)
    • Output of first dense layer (+8 -128) learning to ignore 120 differences
    • Also suggests why binary enconding does better

Lessons Learned

  • Feature selection is important
  • Stupid problems sometimes contain subtleties
  • Sometimes 'blackbox models' can reveal such subtleties


Go look at:


  • Generalized linear models encompass a broad class of models
  • Elastic net is an excellent algorithm for regularization
  • Scikit-learn only has implementation of linear and logistic models
  • Often want to model spike counts (modelled as poisson process)
  • To fit the paremeters in a linear regression, we minimize the mean squared loss
  • Negative log-likelihood function identical to mean-square loss
  • To go from linear regression to GLM (replace normal distribution with exponential family + pointwise nonlinearity)

Example: Poisson Regression

We assume:

  1. Nonlinearity
  2. Noise distribution
  • Regularization
  • Ridge Regression
    • Good for problems with lots of parameters
    • Doens't work well when only a few features are predictive
  • Lasso Regression

Elastic net takes the best of both worlds!

  • Elastic net regression

    • Active set
    • Cyclic coordinate descent with Newton update
    • Theano
    • Tensorflow
  • Optimize penalized NLL with gradient desceint.

  • Optimize better with active set + coordinate descent + Newton update
  • Use sympy for calculus or use theano

ML using Scikit-learn pipelines

Kevin Goetsch

In [45]:
import sklearn

What are pipelines?

  • Container of steps
    • Transformer
    • Estimator
    • Pipeline
    • FeatureUnion
  • Used to package a model

Building blocks of sklearn pipelines

  • Pipeline routs output of transformer as input into estimator
  • Featureunion joins the results of both pipeline outputs

Why pipelines ?

  • transformations written out at once
  • easy to swap out pieces
  • readability
  • keeps intermediate steps together

What is a transformer?

transform (x, [y])
  • applies transformations on X
    fit (x, [y])
  • applies fit logic

Calling transform on the transformer is identical to manually subselecting target factor.

What is an estimator?

Any sklearn objects which make predictions

What is a pipeline?

Pipeline of transforms with a final estimator

What's a feature union ?

Horizontal pipeline.

DS Lifecycle

Feature Engineering

  • Reuse transformers
  • Apply identical transformations to training and test

Ensemblage = combining the output of multiple pipelines

Model Selection

  • Track interactions + hierarchy
  • Easy model stacking
  • No tracking intermediate data
joblib.dump(pipeline, 'file.pkl')


  • can see what parameters went into each step
  • Gridsearch can tune hyperparameters in featureunion

Lightning talks


  • Like Rjupyter but better

Rcloud is cool


  • Crop data + weather data


conda-forge easter eggs

  • Jonathan J. Helmus
  • Collection of recipes, build infrastructure, and packages
  • builds are done on CI services
    conda install -c conda-forge tensorflow
  • Conda 1.5
  • True and false weren't added until 2.3
  • No exit command

Data Ethics

  • @herdingbats
  • Not law and not privacy
  • Ethics is about the design, law is about the application
  • We have imperfect data
  • If you're not making things better, you're making things work
  • Power/Surveilance
  • Privacy "the right to be ignored" ~ Louis Brandeis and Samuel Warren
  • either hypervisible or ignored: the poor/rich
  • Ethical problems are revealed in the way we treat the most vulnerable.
  • Ongoing finegrained consent
  • Designing for fairness in the age of algorithms

Safia and nteract

  • nteract
  • Tenants
    • composability
    • simplicity
    • transparency
    • kind community
  • Open notebooks from the file explorer
  • React components
  • Git integration
  • Real-time collaboration

Christy Comp Bio @ Field Museum

  • R package 'ape'
  • currently working on interactive visualization of phylogenetic tree for python
  • Ivy

Problems with jupyter

  • Long running processes lock up Jupyter
  • Use async calls
    from multiproessing.pool import ThreadPool
    pool = ThreadPool(4)
  • Not getting performance, able to do an asynchronous process
  • apply_async has a callback function
    from slack import slack_msg
  • for callback - can do a lambda function that calls a slack message
    import smtplib
  • Send yourself an email or text

Bob: Geopandas

Elizabeth Wickes

  • Rules of crap
    • know where it is
    • know how your crap works
    • make metacrap to understand it
    • preserve your crap
    • name your crap with meaningful crap

Excel is real

  • There are many people who still love and use Excel

Jose: figuring out a file format

  • 2009-2015 trading data
  • 6 million messages/day in 2009
  • 60 million messages/day in 2015
  • fixtools

I'm not crazy

  • Building everything from source
  • Keeping source separate from build
  • Can see all the build artefacts
  • Writing own virtualenv for zsh
In [ ]: