PyData 2016 - Chicago - Day 3

Table of Contents

Developing Communities to Develop Themselves

Ingestion | Representation | Analysis | Visualization

Phases:

  • Mostly developers, few non-dev users
  • Increasing non-dev users, many drive-by devs
  • Now there's some non-user devs

Sustaining Software

  • Actually, "sustaining inquiry"
  • "Conducting inquiry via software

Do we actually need to sustain specific software?

  • We want to sustain users, devs, their career, and their future as a whole?

What is sustainability?

  • Keep up with bug reports?
  • Add new features?
  • Compatible with new hardware?
  • People can learn to use it?
  • Grows new features?
  • Produces same results?
  • Transitions between people?

$S(N) = \frac{1}{(1-P)+(P/N)}$

Quantifying bottlenecks. A representation of the bottlenecking in a community.

Challenges of software maintainance:

  • bitrot
  • burnout
  • boringness

Empowering:

Empowerment sets up a power dynamic

Community

  • How do we increase diversity?
  • How can we foster careers?
  • How can we lower barriers?
    • Acquire self-determination
    • Leadership
    • Usage of software
    • Foster self-determination

Improv Science

Values

  • Two axes: technically challenging vs. socially challenging
  • Detrimental/humanitarian vs. Fuctional/Problematic
  • Product vs. project
    • Project has a bidirectional flow
    • Product is dictated to you "the thing"
  • Early-stage researchers are the ones who are most damaged by a product that suddenly changes direction.

Models

  • Funded
  • Productized
  • Volunteer

Ashe Dryden's blog post

Barriers to entry (technical & social)

Engagement (methods & modes)

Investment

Credit is not a zero-sum game.

Code is not everything

Request for Commit Podcast

Scale

  • Metcalfe's Law of scale

How does this look technically?

  • Provide clear mechanisms for feedback, credit, and contribution
  • Don't bottleneck on one person who can describe all the contributions

Combining open source tech to implement a ML workflow

Custom Machine Learning

  • Focus: supervised, discriminative deep learing with gradient-based optimization
    • Alternatives (tree ensembles, probabilistic graphical models (PyMC3, STAN)
  • Goal:
    • find a function $f ( y | X, W)$ that:
      • parameters: weights W
      • takes input: features X
      • produces output: y
  • Procedures for finding f

    • define an objective function $g (X, y, w)$
      • find a setting for W that optimizes g
      • often g is the difference between predicted y and actual y
    • Define a function to compute derivative of g with respect to W
      • Often duplicates engineering effort from implementing f and g
      • For DL models, g, and derivative can be complicated
    • Iterate
  • Graph-based numerical computation

    • step 1 (model + objective implementation) is easy
    • step 2 (differentiation)
    • step 3 (iterative learning) is fast
  • TensorFlow + alternatives

    • theano
    • keras (wraps TF and theano): mostly for neural nets and DL
    • mxnet
    • torch
    • caffe

Multilayer Preceptrons

  • feed-forward neural network
    • no recursion, no convolutions
    • input layer, 1+ hidden layers, output layer
  • model interactions betwee features
    • can approximate any function
  • old idea (Rosenblatt, 1961), but can be used with other methods
    • ReLUs, dropout, Adam learn algorithm

Standard Interface for ML

  • Key feature: sklearn.utils.estimator_checks.check_estimator to check API conformity of a custom estimator
  • Facilitates:

    • pipelines of transformations
    • gridsearch over hyperparameters
  • Custom models need

    • fit
    • predict
    • proba
    • init should attach args (fit does a lot of this)
    • everything should be serializable with pickle
      • joblib expects this
  • Civis MLP implementation

  • Pickle
    • TF doesn't support pickling very well
    • TF has a Saver class
    • one can override __getstate__

Simple Distributed Computing

  • ML models have lots of hyperparametrs
    • deep learning: numbers and sizes of hidden layers
    • probabilistic graphical models: parameters for pirors
    • tree ensembles: depth, learning rate, samples per split
  • grid search
    • try lots of settings
    • often combined with k-fold CV
    • crude but parallel
  • JobLib

    • originally focused on single machine parallelizing
  • scikit-learn uses joblib to parallelize hyperparameter evaluation in GridSearchCV

  • if joblib uses a custom backend (like dask), sklearn will use that
  • Distributed grid search > distributed model fitting

WFC (Water for commerce)

  • Investment fund and short-term lending platform for small-to-medium businesses
  • Over 5 years of daily invoices and adjustments
  • 40 day duration $\rightarrow$ 6.25% yield

Why bother SMB lending?

  • Champion of the supplier
  • All risk is not created equal
  • Possible to do it without
    • rage gauging borrowers
    • misleading investors
  • Great data/tools

Risk

Concentration Risk: measuring diversity

  • More diversity of accounts receivable is better
  • Less concentration with 'junk' buyers

Default risk- forecasting accounts receivable

  • Want to forecast in R (use RPy2)
  • Model types

    • ARMA/ARIMA/SARIMA
    • Exponential smoothing (Holt-Winters)
    • Bayesian Structural Time Series
    • Regression (OLS, polynomial) -lm function
    • Currently evaluating pyflux
  • Best model seeks to minimize the mean absolute percentage error

  • plots using matplotlib and seaborn
  • use statsmodels: seasonal decomposition
    • Visually decouple

Default risk- predicting AR discontinuation

"AR Discontinuation": supplier AR dropping to zero with all C2FO buyers

Challenges:

  • Data Leakage
    • do not observe that which would not have been observable at the time of prediction
    • establish prediction cutoff
    • remove all history after the cutoff date
    • explore different winners
  • Engineering features
  • Training
    • use scikit-learn for feature engineering
      • encoding categoricals
      • creating polynomials
      • scaling features
      • dimensionality reduction/feature selection
    • use xgboost to train gradient-boosted trees. use in conjunction with hyperopt.
      • currently evaluating spearmint
  • Primarily concerned with model recall & not overfitting

Default risk- predicting bankruptcy

  • Different than AR discontinuation
  • Prediction labels are different
  • Daily feeds from national bankruptcy database
  • Matching process
  • Perform data truncation and feature engineering
  • Enrich with macroeconomic data from the right pointi n time
  • Address class imbalances
  • Train models
    • Clean data
    • Match on 'unique values'
      • TAX IDs & Phone numbers
    • Use string matching on company names
      • Levenshtei distance, jaro-winkler distance, jaccard distance
    • geographical distance between known addresses
      • Haversine distance
    • tinker with weighting strategy
  • Tips
    • use Cython or Numba

Fraud risk - Screening calls

  • Use NLP to transcribe and mine calls
  • spacy makes tokenization/lemmatization fast
  • identify conversations with red flags

Fraud risk - Analyze invoice congruency

  • Getting buyer-supplier score, then supplier-level score

Who to lend to?

  • WFC scores and train a classfier
  • loan duration
  • higher decile - creater % of n day forecast cumulative sum
  • Rates are calculated
    • by observing suppliers' rates in C2FO markets
    • adjusting for additioal risk when applicable

Who to continue lending to?

  • Triggers
    • Level shifts in AR patterns
    • C2FO bid changes
    • WFC score chagnes
    • Adjustments
    • Buyer reserves

Miscellaneous tools

  • anaconda
  • luigi
  • dask
  • spyre

So what?

  • Objectivity often gives way to innovation
  • Tradeoffs must be evaluated in light of constraints
    • would i might learning and/or maintaining the code
    • do you need a beautiful front end?
    • how fast is fast enough?

Pyxley

Inspired by Shiny

  • ui.R and server.R
  • Abstraction
    • components handle flask stuff (specify JSON output format for charts)
  • Decided on a JS pattern, wrote python wrappers, and didn't write documentation or tests

Lessons

  1. Do not plan for "Zero interest"
  2. Read some guides first: python open source basics
  3. Make straightforward examples (and documentation)
    • Make MVP examples instead of as much functionality at once
  4. Make sure it's something you love

Builtin superheroes

Secret weapon

  • builtin types
    • tuple, list, set, dict
    • collections module
    • various builtin operations

Everywhere

  • the built-in types are always available
  • built-in types are fast-- for coding

Fun

  • built-in types are fun to use
  • Cleverness is rewarded

Tuple

In [12]:
row = ('Dave', 'Beazley', '4312 N Clark ST')
In [13]:
row[1]
Out[13]:
'Beazley'
In [14]:
row[2]
Out[14]:
'4312 N Clark ST'
In [15]:
from collections import namedtuple
In [16]:
Person = namedtuple('Person', ['first', 'last', 'address'])
In [17]:
row = Person('Dave', 'Beazley', 'address')
In [18]:
row.first
Out[18]:
'Dave'

List (enforcing order)

In [19]:
names = ['Dave', 'Thomas', 'Paula', 'Dave']
names
Out[19]:
['Dave', 'Thomas', 'Paula', 'Dave']

Set (uniqueness, membership)

In [20]:
names = set(['Dave', 'Thomas', 'Paula', 'Dave'])
names
Out[20]:
{'Dave', 'Paula', 'Thomas'}

Dictionary: mapping

In [22]:
prices = {
    'ACME': 94.23,
    'YOW':45.2
}
prices['ACME']
Out[22]:
94.23

Other types

In [25]:
from collections import Counter
c = Counter('xyzzy')
c
Out[25]:
Counter({'x': 1, 'y': 2, 'z': 2})
In [26]:
c['a'] +=10
c['b'] +=13
c
Out[26]:
Counter({'a': 10, 'b': 13, 'x': 1, 'y': 2, 'z': 2})
In [28]:
#one to many relationships, grouping, multidicts
from collections import defaultdict
In [29]:
d = defaultdict(list)
d['spam'].append(42)
d['blah'].append(13)
d['spam'].append(10)
d
Out[29]:
defaultdict(list, {'blah': [13], 'spam': [42, 10]})

Basic Powers

In [ ]:
## loops
## iteration
## reductions (sum, min, max, any, all)
## variants (enumerate, zip)

Superpowers

In [30]:
## List comprehesnsions
## set comprehensions
## Dict comprehensions
nums = [1,2,3,4,5,6]
squares = []
for x in nums:
    squares.append(x*x)
squares
Out[30]:
[1, 4, 9, 16, 25, 36]
In [31]:
squares = [x*x for x in nums]
squares
Out[31]:
[1, 4, 9, 16, 25, 36]

Iterpowers

In [32]:
# generator expressions + reductions
squares = (x*x for x in nums)
squares
Out[32]:
<generator object <genexpr> at 0x107630990>
In [33]:
for n in squares:
    print(n)
1
4
9
16
25
36

Some Data Fun

In [3]:
import csv
In [4]:
food = list(csv.DictReader(open('Food_Inspections.csv')))
In [5]:
type(food)
Out[5]:
list
In [38]:
food[1]
Out[38]:
{'AKA Name': 'JIMMY BEANS, A LOGAN SQUARE ROASTER',
 'Address': '2553 W FULLERTON AVE ',
 'City': 'CHICAGO',
 'DBA Name': 'JIMMY BEANS, A LOGAN SQUARE ROASTER',
 'Facility Type': 'Restaurant',
 'Inspection Date': '08/18/2016',
 'Inspection ID': '1950659',
 'Inspection Type': 'License',
 'Latitude': '41.92475050821695',
 'License #': '2483159',
 'Location': '(41.92475050821695, -87.69188517704096)',
 'Longitude': '-87.69188517704096',
 'Results': 'Pass',
 'Risk': 'Risk 2 (Medium)',
 'State': 'IL',
 'Violations': '32. FOOD AND NON-FOOD CONTACT SURFACES PROPERLY DESIGNED, CONSTRUCTED AND MAINTAINED - Comments: NOTED NO SPLASH GUARD AT THE HAND WASH SINK OF THE FRONT PREP AREA BY THE PREP TABLE. INSTRUCTED TO PROVIDE A SPLASH GUARD.',
 'Zip': '60647'}
In [40]:
#all possible outcomes because sets are unique
{ row['Results'] for row in food}
Out[40]:
{'Business Not Located',
 'Fail',
 'No Entry',
 'Not Ready',
 'Out of Business',
 'Pass',
 'Pass w/ Conditions'}
In [41]:
fail = [row for row in food if row['Results']=='Fail']
In [42]:
len(fail)
Out[42]:
25328
In [43]:
fail[0]
Out[43]:
{'AKA Name': 'YUM DUM',
 'Address': '2300 S THROOP ST ',
 'City': 'CHICAGO',
 'DBA Name': 'YUM DUM TRUCK',
 'Facility Type': 'Mobile Food Preparer',
 'Inspection Date': '08/17/2016',
 'Inspection ID': '1950614',
 'Inspection Type': 'License',
 'Latitude': '41.85045102427',
 'License #': '2483952',
 'Location': '(41.85045102427, -87.65879785567869)',
 'Longitude': '-87.65879785567869',
 'Results': 'Fail',
 'Risk': 'Risk 1 (High)',
 'State': 'IL',
 'Violations': '34. FLOORS: CONSTRUCTED PER CODE, CLEANED, GOOD REPAIR, COVING INSTALLED, DUST-LESS CLEANING METHODS USED - Comments: OBSERVED FLOORS UNDER PREP AND HOT HOLDING TABLES WITH EXCESSIVE GREASE AND FOOD DEBRIS. INSTRUCTED TO CLEAN AND MAINTAIN. | 35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTRUCTED PER CODE: GOOD REPAIR, SURFACES CLEAN AND DUST-LESS CLEANING METHODS - Comments: OBSERVED EXCESSIVE ACCUMULATED DUST ON THE VENT FANS LOCATED ON THE CEILING. INSTRUCTED TO CLEAN AND MAINTAIN. | 38. VENTILATION: ROOMS AND EQUIPMENT VENTED AS REQUIRED: PLUMBING: INSTALLED AND MAINTAINED - Comments: OBSERVED A LEAK UNDER THE EXPOSED HAND SINK LOCATED IN THE FOOD PREP AREA. INSTRUCTED TO REPAIR AND MAINTAIN. | 30. FOOD IN ORIGINAL CONTAINER, PROPERLY LABELED: CUSTOMER ADVISORY POSTED AS NEEDED - Comments: OBSERVED NEW LICENSE NUMBER MUST BE PRINTED ON BOTH SIDES OF THE TRUCK. REMOVE THE OLD LICENSE NUMBER.',
 'Zip': '60608'}
In [44]:
worst = Counter(row['DBA Name'] for row in fail)
In [45]:
worst.most_common(5)
Out[45]:
[('SUBWAY', 205),
 ('DUNKIN DONUTS', 132),
 ("MCDONALD'S", 89),
 ('7-ELEVEN', 44),
 ('MCDONALDS', 39)]
In [46]:
worst.most_common(15)
Out[46]:
[('SUBWAY', 205),
 ('DUNKIN DONUTS', 132),
 ("MCDONALD'S", 89),
 ('7-ELEVEN', 44),
 ('MCDONALDS', 39),
 ('CHIPOTLE MEXICAN GRILL', 34),
 ('POTBELLY SANDWICH WORKS LLC', 34),
 ("HAROLD'S CHICKEN SHACK", 32),
 ('CITGO', 30),
 ("PAPA JOHN'S PIZZA", 27),
 ("McDONALD'S", 27),
 ('Subway', 23),
 ('MARATHON', 22),
 ('LAS ISLAS MARIAS', 22),
 ("DOMINO'S PIZZA", 22)]
In [47]:
# taking a dictionary row and making new things
fail = [{**row, 'DBA Name': row['DBA Name'].replace("'", '').upper()}
       for row in fail]
In [49]:
worst = Counter(row['DBA Name'] for row in fail)
In [50]:
worst.most_common(5)
Out[50]:
[('SUBWAY', 228),
 ('MCDONALDS', 168),
 ('DUNKIN DONUTS', 144),
 ('CHIPOTLE MEXICAN GRILL', 48),
 ('7-ELEVEN', 47)]
In [51]:
worst.most_common(20)
Out[51]:
[('SUBWAY', 228),
 ('MCDONALDS', 168),
 ('DUNKIN DONUTS', 144),
 ('CHIPOTLE MEXICAN GRILL', 48),
 ('7-ELEVEN', 47),
 ('POTBELLY SANDWICH WORKS LLC', 34),
 ('HAROLDS CHICKEN SHACK', 33),
 ('CITGO', 30),
 ('PAPA JOHNS PIZZA', 30),
 ('JIMMY JOHNS', 30),
 ('DOMINOS PIZZA', 23),
 ('MC DONALDS', 22),
 ('AU BON PAIN', 22),
 ('SUBWAY SANDWICHES', 22),
 ('MARATHON', 22),
 ('LAS ISLAS MARIAS', 22),
 ('KFC', 22),
 ('FOREVER YOGURT', 22),
 ('DUNKIN DONUTS/BASKIN ROBBINS', 22),
 ('SHARKS FISH & CHICKEN', 22)]
In [52]:
bad = Counter(row['Address'] for row in fail)
In [53]:
bad.most_common(5)
Out[53]:
[('11601 W TOUHY AVE ', 180),
 ('324 N LEAVITT ST ', 59),
 ('500 W MADISON ST ', 58),
 ('2300 S THROOP ST ', 33),
 ('2637 S THROOP ST ', 33)]
In [54]:
by_year = defaultdict(Counter)
In [57]:
for row in fail:
    by_year[row['Inspection Date'][-4:]][row['Address']]+=1
In [58]:
by_year['2015'].most_common(5)
Out[58]:
[('11601 W TOUHY AVE ', 39),
 ('500 W MADISON ST ', 12),
 ('324 N LEAVITT ST ', 9),
 ('307 S KEDZIE AVE ', 9),
 ('12 S MICHIGAN AVE ', 8)]
In [59]:
by_year['2014'].most_common(5)
Out[59]:
[('11601 W TOUHY AVE ', 32),
 ('500 W MADISON ST ', 17),
 ('324 N LEAVITT ST ', 15),
 ('113-125 N GREEN ST ', 12),
 ('131 N CLINTON ST ', 10)]
In [60]:
by_year['2013'].most_common(5)
Out[60]:
[('11601 W TOUHY AVE ', 37),
 ('700 E GRAND AVE ', 10),
 ('2300 S THROOP ST ', 10),
 ('301 E NORTH WATER ST ', 9),
 ('12760 S HALSTED ST ', 8)]
In [61]:
bad.most_common(5)
Out[61]:
[('11601 W TOUHY AVE ', 180),
 ('324 N LEAVITT ST ', 59),
 ('500 W MADISON ST ', 58),
 ('2300 S THROOP ST ', 33),
 ('2637 S THROOP ST ', 33)]
In [62]:
_[0][0]
Out[62]:
'11601 W TOUHY AVE '
In [67]:
ohare = [row for row in fail if row['Address'].startswith('11601 W TOUHY')]
In [68]:
len(ohare)
Out[68]:
181
In [69]:
{row['Address'] for row in ohare}
Out[69]:
{'11601 W TOUHY AVE ', '11601 W TOUHY AVE T2 F12'}
In [71]:
{row['DBA Name'] for row in ohare}
Out[71]:
{'AMERICAN AIRLINES',
 'AMERICAS DOG',
 'ANDIAMOS OHARE, LLC',
 'ARAMARK AT UNITED AIRLINES',
 'ARGO TEA',
 'ARGO TEA CAFE-OHARE T2',
 'AUNTIE ANNES',
 'AUNTIE ANNES PRETZELS',
 'B JS  MARKET',
 'BRITISH AIRWAYS',
 'BURRITO BEACH',
 'CAFFE  MERCATO',
 'CHICAGO BLACKHAWKS STANLEYS T2 BAR',
 'CHICAGO NEWS & GIFTS',
 'CHILIS T - 3',
 'CHILIS T-I',
 'CHILIS- G CONCOURSE',
 'CNN',
 'EFIES CANTEEN INC',
 'ELIS CHEESECAKE',
 'FARMERS FRIDGE',
 'FRESH ON THE FLY',
 'FRONTERA TORTAS  BY RICK BAYLESS GATE K4 T3',
 'FRONTERA TORTAS BY RICK  BAYLESS',
 'GARRETT POPCORN SHOPS',
 'GATEGOURMET',
 'GOLD COAST DOGS',
 'GREEN MARKET',
 'HILTON OHARE',
 'HOST INTERNATIONAL B05',
 'HOST INTERNATIONAL INC',
 'HOST INTERNATIONAL INC, CHILIS T-2',
 'HOST INTERNATIONAL INC-GOOSE ISLAND T3',
 'HOST INTERNATIONAL INC-PRAIRIE TAP',
 'HOST INTERNATIONAL INC.',
 'HOT DOG EXPRESS',
 'HUDSON',
 'HUDSON NEWS',
 'HUDSON NEWS OHARE JOINT VENTURE',
 'ICE BAR',
 'INTELLIGENTSIA',
 'JAMBA JUICE',
 'KOREAN AIR LOUNGE',
 'LA TAPENADES GATE H14',
 'LOU MITCHELLS EXPRESS INC',
 'MACARONI GRILL',
 'MCDONALDS',
 'MCDONALDS RESTAURANT',
 'NATURAL BREAK',
 'NUTS ON CLARK',
 'OHARE BAR',
 'OHARE HILTON HOTEL',
 'PARADES A CHICAGO BAR',
 'PUBLICAN TAVERN K1',
 'REGGIOS PIZZA EXPRESS',
 'ROCKY MOUNTAIN CHOCOLATE FACTORY',
 'RUSH STREET',
 'SALAD WORKS',
 'SARAHS CANDIES',
 'SKYBRIDGE RESTAURANT & BAR',
 'STARBUCKS',
 'STARBUCKS HK APEX',
 'STARBUCKS L03',
 'SUBWAY SANDWICH',
 'THE GODDESS & GROCER',
 'THE GREAT AMERICAN BAGEL',
 'TOCCO',
 'TORTAS FRONTERA',
 'TRAVEL TRADERS #3081 @ HILTON OHARE',
 'TUSCANY CAFE',
 'UNITED CLUB',
 'UNITED CLUB ,T-1  CONCOURSE C',
 'UNITED CLUB, TERMINAL 2 CONCOURSE F',
 'UNITED CLUB,TERMINAL 1 CONCOURSE B SOUTH',
 'UNITED FIRST INTERNATIONAL LOUNGE T1,CONCOURSE C',
 'WOLFGANG EXPRESS',
 'WOLFGANG PUCK, T-3',
 'ZOOTS'}
In [72]:
ohare[0]
Out[72]:
{'AKA Name': "REGGIO'S PIZZA EXPRESS (T3 G3)",
 'Address': '11601 W TOUHY AVE ',
 'City': 'CHICAGO',
 'DBA Name': 'REGGIOS PIZZA EXPRESS',
 'Facility Type': 'Restaurant',
 'Inspection Date': '08/16/2016',
 'Inspection ID': '1950494',
 'Inspection Type': 'License',
 'Latitude': '42.008536400868735',
 'License #': '2428080',
 'Location': '(42.008536400868735, -87.91442843927047)',
 'Longitude': '-87.91442843927047',
 'Results': 'Fail',
 'Risk': 'Risk 1 (High)',
 'State': 'IL',
 'Violations': '',
 'Zip': '60666'}
In [74]:
# find the worst location in o'hare to eat
c = Counter(row['AKA Name'] for row in ohare)
In [76]:
c.most_common(10)
Out[76]:
[('MACARONI GRILL (T3-K2)', 6),
 ('Gategourmet (BLDG 741)', 5),
 ('ADMIRALS CLUB/AMERICAN AIRLINES (T3/H&K)', 5),
 ('United Employee Cafeteria (T1 C LL)', 4),
 ("CHILI'S TOO (T2  F4)", 4),
 ('HUDSON NEWS', 4),
 ('ARGO TEA  (T3 ROTUNDA)', 4),
 ("CHILI'S  TOO (T3-H2)", 4),
 ('UNITED CLUB (T1 B6)', 4),
 ('WOLFGANG PUCK (T3 K1)', 4)]
In [77]:
ohare[0]
Out[77]:
{'AKA Name': "REGGIO'S PIZZA EXPRESS (T3 G3)",
 'Address': '11601 W TOUHY AVE ',
 'City': 'CHICAGO',
 'DBA Name': 'REGGIOS PIZZA EXPRESS',
 'Facility Type': 'Restaurant',
 'Inspection Date': '08/16/2016',
 'Inspection ID': '1950494',
 'Inspection Type': 'License',
 'Latitude': '42.008536400868735',
 'License #': '2428080',
 'Location': '(42.008536400868735, -87.91442843927047)',
 'Longitude': '-87.91442843927047',
 'Results': 'Fail',
 'Risk': 'Risk 1 (High)',
 'State': 'IL',
 'Violations': '',
 'Zip': '60666'}
In [79]:
inspections = defaultdict(list)
In [80]:
for row in ohare:
    inspections[row['License #']].append(row)
In [81]:
inspections['2428080']
Out[81]:
[{'AKA Name': "REGGIO'S PIZZA EXPRESS (T3 G3)",
  'Address': '11601 W TOUHY AVE ',
  'City': 'CHICAGO',
  'DBA Name': 'REGGIOS PIZZA EXPRESS',
  'Facility Type': 'Restaurant',
  'Inspection Date': '08/16/2016',
  'Inspection ID': '1950494',
  'Inspection Type': 'License',
  'Latitude': '42.008536400868735',
  'License #': '2428080',
  'Location': '(42.008536400868735, -87.91442843927047)',
  'Longitude': '-87.91442843927047',
  'Results': 'Fail',
  'Risk': 'Risk 1 (High)',
  'State': 'IL',
  'Violations': '',
  'Zip': '60666'}]
In [82]:
inspections.keys()
Out[82]:
dict_keys(['34154', '34201', '34229', '2363771', '56366', '1885160', '2289520', '2124574', '2016732', '29570', '2428079', '2277363', '1142116', '34183', '1909532', '1333098', '2447055', '1909539', '2124567', '2289525', '2289531', '85188', '2232034', '34203', '2261733', '2192969', '1621425', '2141979', '1884293', '34211', '34205', '2289511', '2021757', '1333242', '1141505', '34173', '2192968', '1927556', '34142', '2192963', '1069382', '2114331', '23894', '2289495', '2451545', '34224', '1947515', '34192', '2289515', '2204037', '2103989', '34199', '1120626', '2109577', '1879167', '2289527', '2428080', '2016727', '2363760', '2017724', '1879166', '2363762', '34190', '34169', '2289524', '34234', '1333092', '2289084', '34212', '2261728', '2069938', '1069379', '64032', '34167', '1879164', '2125489', '0', '2232035', '1974743', '2009092', '2277391', '1333235', '1884292', '15531', '1898075', '1224624', '1916161', '34146', '2009095', '1381615', '1916219', '1888807', '2184012', '2125246', '1878675', '56367', '37170', '2463991', '34139', '2013208', '1926528', '1942304', '2016729', '51206', '1718776', '1675026', '2284294', '34220', '2284027', '1042895', '2299087'])
In [83]:
#finding failing inspections date
[row['Inspection Date']for row in inspections['34192']]
Out[83]:
['04/07/2016', '09/04/2014', '09/20/2011', '01/26/2010']
In [86]:
#what is the most common way that a place at o'hare fails the inspection
#numeric codes and comments
ohare[1]
Out[86]:
{'AKA Name': "REGGIO'S PIZZA EXPRESS (T3 G3)",
 'Address': '11601 W TOUHY AVE ',
 'City': 'CHICAGO',
 'DBA Name': 'REGGIOS PIZZA EXPRESS',
 'Facility Type': 'Restaurant',
 'Inspection Date': '08/16/2016',
 'Inspection ID': '1950490',
 'Inspection Type': 'License',
 'Latitude': '42.008536400868735',
 'License #': '2428079',
 'Location': '(42.008536400868735, -87.91442843927047)',
 'Longitude': '-87.91442843927047',
 'Results': 'Fail',
 'Risk': 'Risk 1 (High)',
 'State': 'IL',
 'Violations': '40. REFRIGERATION AND METAL STEM THERMOMETERS PROVIDED AND CONSPICUOUS - Comments: INSTRUCTED TO PROVIDE THERMOMETER VISIBLE AND ACCURATE INSIDE PIZZA HOT HOLDING UNIT. | 33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSILS CLEAN, FREE OF ABRASIVE DETERGENTS - Comments: INSTRUCTED TO DETAIL CLEAN AND MAINTAIN INTERIOR SURFACES OF 2 DOOR PREP COOLER (BY EXPOSED HAND SINK). | 16. FOOD PROTECTED DURING STORAGE, PREPARATION, DISPLAY, SERVICE AND TRANSPORTATION - Comments: FOUND INTERIOR SURFACES OF ICE MACHINE IN REAR NOT CLEAN WITH PINK AND BLACK MOLD LIKE BUILD-UP. INSTRUCTED TO WASH, RINSE AND SANITIZE THE AFFECTED AREAS. \nSERIOUS VIOLATION 7-38-005 (A)',
 'Zip': '60666'}
In [87]:
ohare[1]['Violations'].split('|')
Out[87]:
['40. REFRIGERATION AND METAL STEM THERMOMETERS PROVIDED AND CONSPICUOUS - Comments: INSTRUCTED TO PROVIDE THERMOMETER VISIBLE AND ACCURATE INSIDE PIZZA HOT HOLDING UNIT. ',
 ' 33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSILS CLEAN, FREE OF ABRASIVE DETERGENTS - Comments: INSTRUCTED TO DETAIL CLEAN AND MAINTAIN INTERIOR SURFACES OF 2 DOOR PREP COOLER (BY EXPOSED HAND SINK). ',
 ' 16. FOOD PROTECTED DURING STORAGE, PREPARATION, DISPLAY, SERVICE AND TRANSPORTATION - Comments: FOUND INTERIOR SURFACES OF ICE MACHINE IN REAR NOT CLEAN WITH PINK AND BLACK MOLD LIKE BUILD-UP. INSTRUCTED TO WASH, RINSE AND SANITIZE THE AFFECTED AREAS. \nSERIOUS VIOLATION 7-38-005 (A)']
In [88]:
violations = _
In [89]:
violations
Out[89]:
['40. REFRIGERATION AND METAL STEM THERMOMETERS PROVIDED AND CONSPICUOUS - Comments: INSTRUCTED TO PROVIDE THERMOMETER VISIBLE AND ACCURATE INSIDE PIZZA HOT HOLDING UNIT. ',
 ' 33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSILS CLEAN, FREE OF ABRASIVE DETERGENTS - Comments: INSTRUCTED TO DETAIL CLEAN AND MAINTAIN INTERIOR SURFACES OF 2 DOOR PREP COOLER (BY EXPOSED HAND SINK). ',
 ' 16. FOOD PROTECTED DURING STORAGE, PREPARATION, DISPLAY, SERVICE AND TRANSPORTATION - Comments: FOUND INTERIOR SURFACES OF ICE MACHINE IN REAR NOT CLEAN WITH PINK AND BLACK MOLD LIKE BUILD-UP. INSTRUCTED TO WASH, RINSE AND SANITIZE THE AFFECTED AREAS. \nSERIOUS VIOLATION 7-38-005 (A)']
In [92]:
[v[:v.find('- Comments:')].strip() for v in violations] 
Out[92]:
['40. REFRIGERATION AND METAL STEM THERMOMETERS PROVIDED AND CONSPICUOUS',
 '33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSILS CLEAN, FREE OF ABRASIVE DETERGENTS',
 '16. FOOD PROTECTED DURING STORAGE, PREPARATION, DISPLAY, SERVICE AND TRANSPORTATION']
In [93]:
all_violations = [row['Violations'].split('|') for row in ohare]
In [94]:
c = Counter()
In [97]:
for violations in all_violations:
    for v in violations:
        c[v[:v.find('- Comments:')].strip()]+=1
In [98]:
c.most_common(5)
Out[98]:
[('33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSILS CLEAN, FREE OF ABRASIVE DETERGENTS',
  107),
 ('34. FLOORS: CONSTRUCTED PER CODE, CLEANED, GOOD REPAIR, COVING INSTALLED, DUST-LESS CLEANING METHODS USED',
  104),
 ('35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTRUCTED PER CODE: GOOD REPAIR, SURFACES CLEAN AND DUST-LESS CLEANING METHODS',
  89),
 ('18. NO EVIDENCE OF RODENT OR INSECT OUTER OPENINGS PROTECTED/RODENT PROOFED, A WRITTEN LOG SHALL BE MAINTAINED AVAILABLE TO THE INSPECTORS',
  80),
 ('32. FOOD AND NON-FOOD CONTACT SURFACES PROPERLY DESIGNED, CONSTRUCTED AND MAINTAINED',
  69)]

When to use pandas?

  • CSVs
  • Production/corner cases

Built-in stuff is useful for very unstructured/messy data

Productionalizing a Data Science Model

Tudor Radoaca: Software Engineer

Nicole Carlson: Data Scientist

Scoring specialists/Developing Model

  • Quantifies reliability of a specialist based on retention and completion metrics
  • Checking score distribution for workers

Handoff

  • Jupyter notebook with queries and weights and flexibility to change them

Implementation

  • Specialists can view the score
  • Persistent Score History
  • Scalable implementation
  • Flexible architecture
  • Accesibility to Data Scientist

Celery

  • asynchronous task queue using multi-processing
  • grouped tasks returning results to a comon parent task

Final solution: Bolero

  • Implementing Score Features
  • Software engineers changed the model

Takeaways

  1. Talk in person if things get heated
  2. Document everything
  3. Ask question
  4. Involve stakeholders every step of the way
  5. We can build something amazing together

sean@shiftgig.com for data careers

Using Exploratory Data Analysis to Discover Patterns in Image and Document Collections

Two problems

  • How do we know we've engineered our data to be ML friendly? How do we know we engineered the right way?
  • We have lots of data: now what?

What to do with large amounts of unlabeled data?

Lev Manovich

Pyimageplot

The simple graph has brought more information to the data analyst's mind than any other device ~ John Tukey

EDA: two main approaches

  • Statistical modeling (summary statistics)
    • Scalable
    • Cons: models are based on assumptions that may be wrong
  • Visualization

    • not scalable
    • no assumptions are made
  • EDA on a handful of variables is straightforward

    • just plot every pair!
  • Otherwise

    • PCA
    • Multidimensional scaling
    • t-SNE

Feature engineering

  • Manually designing the input x

Pipeline

$Data \rightarrow Extract \> Features \rightarrow Visualize$

  • Montage
  • Image histogram

  • Pretrained neural networks can be used as feature extractors

  • get features from a pre-trained neural net in Caffe
  • SkiCaffe : scikit_learn wrapper for extracting features

Visualizing collections of documents

  • generate term-frequency for each document
  • visualize with word clouds
  • generate tf-idf
    • term frequency-inverse document frequency

Genotype-Phenotype Associations and Machine Learning

ML Problems in Medicine

Drug Design

  • Prediction problem: predict if a particular molecule will stick onto a part of the bacteria to a point where it doesn't move
  • Extract features from bacterial protein + molecule, train classifier, and look for binary outcome ### Vaccine discovery
  • Predict which parts of the infectious disease are useful for developing a vaccine ### Disease prediction
  • given someone's dna, predict if this person has a disease

Diseases

Background

  • DNA: a raw feature vector representation for life
  • genome: subset of dna that creates proteins

Classification Formulation

  • Standard supervised ML task
  • Training pairs, feature engineering, train model #### Problems
  • Representing sequential data
  • Missing genetic data
  • High dimensional space
  • Typically imbalanced/skewed class distribution

Genome feature vector representations

  • by definition, something that is fixed length
  • SNP Vectors
    • Single nucleotide polymoriphisms
    • Random insertions/deletions or substitutions
    • Aline with reference genome (use alignment algorithm)
    • record snps: record the differences
    • problem: genomes have 3.3 billion amino acids
  • Gene presence/Absence Vectors
    • apply a clustering algorithm on the genes
    • cluster membership determines new features
    • Has or has not gene
    • does not account for mutations
    • DBSCAN: similarity metric = edit distance (usually hamming distance)
  • Character n-grams
    • K-mers
    • as K -> 1, sparser feature space
    • as K-> m, binary feature space
    • sliding window representation
    • alignment-based approaches are computationally expensive

Gene Feature Selection

  • Human genome has 3.3 billion nucleic acids, 20K genes
  • How to perform feature selection?
    • computational limitations
    • constraints on hypothesis space
    • interpret important features
  • Two generic FS approaches
    • univariate feature selection
    • multivariate selection
  • Univariate feature selection
    • compute a metric between each feature and label: accuracy, correlation
  • Multivariate feature selection
    • construct all models with feature subsets of size K
    • evaluate accuract for all models
    • keep all features with a certain accuracy ### Gene feature imputation
  • Missing data problem
  • Missing at random
  • missing by design
  • missing at completely random
  • Mean/deletion strategies
  • Model-based imputation (python libary fancyimpute)

Machine Learning for Cleaning Data

Sources of Ambiguity

  • Data entered by humans
  • Data involving free text
  • Data without unique identifiers ## Machine Learning
  • Work at scale
  • leave an audit trail
  • because regex is often painful ## Libraries
  • US Addresses
  • a python library for inferring structure in addresses
  • use conditional random fields
    • learns feature of individual components
    • learns relative order of components
  • Put together a lot of training data (real, parsed addresses)
  • feed through model which pays attention to features and order
  • Probable People
    • Campaign finance data
  • Parserator ## More data cleaning problems ### De-duplicating data
  • What is similarity within a column?
  • What is similarity across columns?
  • How can those decisions be made quickly with lots of data?
  • Dedupe package
    • record similarity
    • learns from the dataset
    • Smart comparisons
      • only compare records that share the first 5 characters
  • Can handle ~ 1 million records in a couple hours
    • 90% precision & 90% recall on test datasets
    • dedupe.io as service
      • data review & validation steps with active learning
      • distributed tasks
      • build-in record linkage
      • API access

Resources from talks I didn't attend

Data Cleaning Tool