Table of Contents¶

1. Keynote: Developing Communities to Develop Themselves
2. Implementing distributed grid search for deep learning using scikit-learn and joblib
3. Risk Management System using Python
4. Pyxley for flask + react powered dashboards
5. Keynote: Builtin Superheroes
6. Productionalizing
7. Exploratory Data Analysis
8. Genotype-phenotype associations and machine learning

Developing Communities to Develop Themselves¶

Matthew Turk

yt-project

Ingestion | Representation | Analysis | Visualization

Phases:

Mostly developers, few non-dev users
Increasing non-dev users, many drive-by devs
Now there's some non-user devs

~~Sustaining Software~~

Actually, "sustaining inquiry"
"Conducting inquiry via software

Do we actually need to sustain specific software?

We want to sustain users, devs, their career, and their future as a whole?

What is sustainability?

Keep up with bug reports?
Add new features?
Compatible with new hardware?
People can learn to use it?
Grows new features?
Produces same results?
Transitions between people?

$S(N) = \frac{1}{(1-P)+(P/N)}$

Quantifying bottlenecks. A representation of the bottlenecking in a community.

Challenges of software maintainance:

bitrot
burnout
boringness

~~Empowering:~~

Empowerment sets up a power dynamic

Community

How do we increase diversity?
How can we foster careers?
How can we lower barriers?
- Acquire self-determination
- Leadership
- Usage of software
- Foster self-determination

Improv Science

Values

Two axes: technically challenging vs. socially challenging
Detrimental/humanitarian vs. Fuctional/Problematic
Product vs. project
- Project has a bidirectional flow
- Product is dictated to you "the thing"
Early-stage researchers are the ones who are most damaged by a product that suddenly changes direction.

Models

Funded
Productized
Volunteer

Ashe Dryden's blog post

Barriers to entry (technical & social)

Engagement (methods & modes)

Investment

Credit is not a zero-sum game.

Code is not everything

Request for Commit Podcast

Scale

Metcalfe's Law of scale

How does this look technically?

Provide clear mechanisms for feedback, credit, and contribution
Don't bottleneck on one person who can describe all the contributions

Combining open source tech to implement a ML workflow¶

Mike Heilman

Slides

Custom Machine Learning¶

Focus: supervised, discriminative deep learing with gradient-based optimization
- Alternatives (tree ensembles, probabilistic graphical models (PyMC3, STAN)
Goal:
- find a function $f ( y | X, W)$ that:
  - parameters: weights W
  - takes input: features X
  - produces output: y
Procedures for finding f
- define an objective function $g (X, y, w)$
  - find a setting for W that optimizes g
  - often g is the difference between predicted y and actual y
- Define a function to compute derivative of g with respect to W
  - Often duplicates engineering effort from implementing f and g
  - For DL models, g, and derivative can be complicated
- Iterate
Graph-based numerical computation
- step 1 (model + objective implementation) is easy
- step 2 (differentiation)
- step 3 (iterative learning) is fast
TensorFlow + alternatives
- theano
- keras (wraps TF and theano): mostly for neural nets and DL
- mxnet
- torch
- caffe

Multilayer Preceptrons

feed-forward neural network
- no recursion, no convolutions
- input layer, 1+ hidden layers, output layer
model interactions betwee features
- can approximate any function
old idea (Rosenblatt, 1961), but can be used with other methods
- ReLUs, dropout, Adam learn algorithm

Standard Interface for ML¶

Key feature: sklearn.utils.estimator_checks.check_estimator to check API conformity of a custom estimator
Facilitates:
- pipelines of transformations
- gridsearch over hyperparameters
Custom models need
- fit
- predict
- proba
- init should attach args (fit does a lot of this)
- everything should be serializable with pickle
  - joblib expects this
Civis MLP implementation
- GPU support
- sparse/dense matrix support
- github link

Pickle
- TF doesn't support pickling very well
- TF has a Saver class
- one can override __getstate__

Simple Distributed Computing¶

ML models have lots of hyperparametrs
- deep learning: numbers and sizes of hidden layers
- probabilistic graphical models: parameters for pirors
- tree ensembles: depth, learning rate, samples per split
grid search
- try lots of settings
- often combined with k-fold CV
- crude but parallel
JobLib
- originally focused on single machine parallelizing
scikit-learn uses joblib to parallelize hyperparameter evaluation in GridSearchCV
if joblib uses a custom backend (like dask), sklearn will use that
Distributed grid search > distributed model fitting

Contemporary Risk Management ¶

Piero Ferrante

C2F0

WFC (Water for commerce)¶

Investment fund and short-term lending platform for small-to-medium businesses
Over 5 years of daily invoices and adjustments
40 day duration $\rightarrow$ 6.25% yield

Why bother SMB lending?¶

Champion of the supplier
All risk is not created equal
Possible to do it without
- rage gauging borrowers
- misleading investors
Great data/tools

Risk¶

Concentration Risk: measuring diversity¶

More diversity of accounts receivable is better
Less concentration with 'junk' buyers

Default risk- forecasting accounts receivable¶

Want to forecast in R (use RPy2)
Model types
- ARMA/ARIMA/SARIMA
- Exponential smoothing (Holt-Winters)
- Bayesian Structural Time Series
- Regression (OLS, polynomial) -lm function
- Currently evaluating pyflux
Best model seeks to minimize the mean absolute percentage error
plots using matplotlib and seaborn
use statsmodels: seasonal decomposition
- Visually decouple

Default risk- predicting AR discontinuation¶

"AR Discontinuation": supplier AR dropping to zero with all C2FO buyers

Challenges:

Data Leakage
- do not observe that which would not have been observable at the time of prediction
- establish prediction cutoff
- remove all history after the cutoff date
- explore different winners
Engineering features
Training
- use scikit-learn for feature engineering
  - encoding categoricals
  - creating polynomials
  - scaling features
  - dimensionality reduction/feature selection
- use xgboost to train gradient-boosted trees. use in conjunction with hyperopt.
  - currently evaluating spearmint
Primarily concerned with model recall & not overfitting

Default risk- predicting bankruptcy¶

Different than AR discontinuation
Prediction labels are different
Daily feeds from national bankruptcy database
Matching process
Perform data truncation and feature engineering
Enrich with macroeconomic data from the right pointi n time
Address class imbalances
Train models
- Clean data
- Match on 'unique values'
  - TAX IDs & Phone numbers
- Use string matching on company names
  - Levenshtei distance, jaro-winkler distance, jaccard distance
- geographical distance between known addresses
  - Haversine distance
- tinker with weighting strategy
Tips
- use Cython or Numba

Fraud risk - Screening calls¶

Use NLP to transcribe and mine calls
spacy makes tokenization/lemmatization fast
identify conversations with red flags

Fraud risk - Analyze invoice congruency¶

Getting buyer-supplier score, then supplier-level score

Who to lend to?¶

WFC scores and train a classfier
loan duration
higher decile - creater % of n day forecast cumulative sum
Rates are calculated
- by observing suppliers' rates in C2FO markets
- adjusting for additioal risk when applicable

Who to continue lending to?¶

Triggers
- Level shifts in AR patterns
- C2FO bid changes
- WFC score chagnes
- Adjustments
- Buyer reserves

Miscellaneous tools¶

anaconda
luigi
dask
spyre

So what?¶

Objectivity often gives way to innovation
Tradeoffs must be evaluated in light of constraints
- would i might learning and/or maintaining the code
- do you need a beautiful front end?
- how fast is fast enough?

Pyxley¶

Nicholas Kridler

Twitter

Merch Algorithms

Inspired by Shiny¶

ui.R and server.R
Abstraction
- components handle flask stuff (specify JSON output format for charts)
Decided on a JS pattern, wrote python wrappers, and didn't write documentation or tests

Lessons¶

Do not plan for "Zero interest"
Read some guides first: python open source basics
- PyPi has a test server
- Conda environments
- virtualenv
- PyPi quick and dirty
- Open sourcing a python package the right way
Make straightforward examples (and documentation)
- Make MVP examples instead of as much functionality at once
Make sure it's something you love

Builtin superheroes¶

David Beazley

Secret weapon¶

builtin types
- tuple, list, set, dict
- collections module
- various builtin operations

Everywhere¶

the built-in types are always available
built-in types are fast-- for coding

Fun¶

built-in types are fun to use
Cleverness is rewarded

Tuple¶

row = ('Dave', 'Beazley', '4312 N Clark ST')

row[1]

'Beazley'

row[2]

'4312 N Clark ST'

from collections import namedtuple

Person = namedtuple('Person', ['first', 'last', 'address'])

row = Person('Dave', 'Beazley', 'address')

row.first

'Dave'

List (enforcing order)¶

names = ['Dave', 'Thomas', 'Paula', 'Dave']
names

['Dave', 'Thomas', 'Paula', 'Dave']

Set (uniqueness, membership)¶

names = set(['Dave', 'Thomas', 'Paula', 'Dave'])
names

{'Dave', 'Paula', 'Thomas'}

Dictionary: mapping¶

prices = {
    'ACME': 94.23,
    'YOW':45.2
}
prices['ACME']

94.23

Other types¶

from collections import Counter
c = Counter('xyzzy')
c

Counter({'x': 1, 'y': 2, 'z': 2})

c['a'] +=10
c['b'] +=13
c

Counter({'a': 10, 'b': 13, 'x': 1, 'y': 2, 'z': 2})

#one to many relationships, grouping, multidicts
from collections import defaultdict

d = defaultdict(list)
d['spam'].append(42)
d['blah'].append(13)
d['spam'].append(10)
d

defaultdict(list, {'blah': [13], 'spam': [42, 10]})

Basic Powers¶

## loops
## iteration
## reductions (sum, min, max, any, all)
## variants (enumerate, zip)

Superpowers¶

## List comprehesnsions
## set comprehensions
## Dict comprehensions
nums = [1,2,3,4,5,6]
squares = []
for x in nums:
    squares.append(x*x)
squares

[1, 4, 9, 16, 25, 36]

squares = [x*x for x in nums]
squares

[1, 4, 9, 16, 25, 36]

Iterpowers¶

# generator expressions + reductions
squares = (x*x for x in nums)
squares

<generator object <genexpr> at 0x107630990>

for n in squares:
    print(n)

1
4
9
16
25
36

Some Data Fun¶

import csv

Food inspections data here

food = list(csv.DictReader(open('Food_Inspections.csv')))

type(food)

list

food[1]

{'AKA Name': 'JIMMY BEANS, A LOGAN SQUARE ROASTER',
 'Address': '2553 W FULLERTON AVE ',
 'City': 'CHICAGO',
 'DBA Name': 'JIMMY BEANS, A LOGAN SQUARE ROASTER',
 'Facility Type': 'Restaurant',
 'Inspection Date': '08/18/2016',
 'Inspection ID': '1950659',
 'Inspection Type': 'License',
 'Latitude': '41.92475050821695',
 'License #': '2483159',
 'Location': '(41.92475050821695, -87.69188517704096)',
 'Longitude': '-87.69188517704096',
 'Results': 'Pass',
 'Risk': 'Risk 2 (Medium)',
 'State': 'IL',
 'Violations': '32. FOOD AND NON-FOOD CONTACT SURFACES PROPERLY DESIGNED, CONSTRUCTED AND MAINTAINED - Comments: NOTED NO SPLASH GUARD AT THE HAND WASH SINK OF THE FRONT PREP AREA BY THE PREP TABLE. INSTRUCTED TO PROVIDE A SPLASH GUARD.',
 'Zip': '60647'}

#all possible outcomes because sets are unique
{ row['Results'] for row in food}

{'Business Not Located',
 'Fail',
 'No Entry',
 'Not Ready',
 'Out of Business',
 'Pass',
 'Pass w/ Conditions'}

fail = [row for row in food if row['Results']=='Fail']

len(fail)

25328

fail[0]

{'AKA Name': 'YUM DUM',
 'Address': '2300 S THROOP ST ',
 'City': 'CHICAGO',
 'DBA Name': 'YUM DUM TRUCK',
 'Facility Type': 'Mobile Food Preparer',
 'Inspection Date': '08/17/2016',
 'Inspection ID': '1950614',
 'Inspection Type': 'License',
 'Latitude': '41.85045102427',
 'License #': '2483952',
 'Location': '(41.85045102427, -87.65879785567869)',
 'Longitude': '-87.65879785567869',
 'Results': 'Fail',
 'Risk': 'Risk 1 (High)',
 'State': 'IL',
 'Violations': '34. FLOORS: CONSTRUCTED PER CODE, CLEANED, GOOD REPAIR, COVING INSTALLED, DUST-LESS CLEANING METHODS USED - Comments: OBSERVED FLOORS UNDER PREP AND HOT HOLDING TABLES WITH EXCESSIVE GREASE AND FOOD DEBRIS. INSTRUCTED TO CLEAN AND MAINTAIN. | 35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTRUCTED PER CODE: GOOD REPAIR, SURFACES CLEAN AND DUST-LESS CLEANING METHODS - Comments: OBSERVED EXCESSIVE ACCUMULATED DUST ON THE VENT FANS LOCATED ON THE CEILING. INSTRUCTED TO CLEAN AND MAINTAIN. | 38. VENTILATION: ROOMS AND EQUIPMENT VENTED AS REQUIRED: PLUMBING: INSTALLED AND MAINTAINED - Comments: OBSERVED A LEAK UNDER THE EXPOSED HAND SINK LOCATED IN THE FOOD PREP AREA. INSTRUCTED TO REPAIR AND MAINTAIN. | 30. FOOD IN ORIGINAL CONTAINER, PROPERLY LABELED: CUSTOMER ADVISORY POSTED AS NEEDED - Comments: OBSERVED NEW LICENSE NUMBER MUST BE PRINTED ON BOTH SIDES OF THE TRUCK. REMOVE THE OLD LICENSE NUMBER.',
 'Zip': '60608'}

worst = Counter(row['DBA Name'] for row in fail)

worst.most_common(5)

[('SUBWAY', 205),
 ('DUNKIN DONUTS', 132),
 ("MCDONALD'S", 89),
 ('7-ELEVEN', 44),
 ('MCDONALDS', 39)]

worst.most_common(15)

[('SUBWAY', 205),
 ('DUNKIN DONUTS', 132),
 ("MCDONALD'S", 89),
 ('7-ELEVEN', 44),
 ('MCDONALDS', 39),
 ('CHIPOTLE MEXICAN GRILL', 34),
 ('POTBELLY SANDWICH WORKS LLC', 34),
 ("HAROLD'S CHICKEN SHACK", 32),
 ('CITGO', 30),
 ("PAPA JOHN'S PIZZA", 27),
 ("McDONALD'S", 27),
 ('Subway', 23),
 ('MARATHON', 22),
 ('LAS ISLAS MARIAS', 22),
 ("DOMINO'S PIZZA", 22)]

# taking a dictionary row and making new things
fail = [{**row, 'DBA Name': row['DBA Name'].replace("'", '').upper()}
       for row in fail]

worst = Counter(row['DBA Name'] for row in fail)

worst.most_common(5)

[('SUBWAY', 228),
 ('MCDONALDS', 168),
 ('DUNKIN DONUTS', 144),
 ('CHIPOTLE MEXICAN GRILL', 48),
 ('7-ELEVEN', 47)]

worst.most_common(20)

[('SUBWAY', 228),
 ('MCDONALDS', 168),
 ('DUNKIN DONUTS', 144),
 ('CHIPOTLE MEXICAN GRILL', 48),
 ('7-ELEVEN', 47),
 ('POTBELLY SANDWICH WORKS LLC', 34),
 ('HAROLDS CHICKEN SHACK', 33),
 ('CITGO', 30),
 ('PAPA JOHNS PIZZA', 30),
 ('JIMMY JOHNS', 30),
 ('DOMINOS PIZZA', 23),
 ('MC DONALDS', 22),
 ('AU BON PAIN', 22),
 ('SUBWAY SANDWICHES', 22),
 ('MARATHON', 22),
 ('LAS ISLAS MARIAS', 22),
 ('KFC', 22),
 ('FOREVER YOGURT', 22),
 ('DUNKIN DONUTS/BASKIN ROBBINS', 22),
 ('SHARKS FISH & CHICKEN', 22)]

bad = Counter(row['Address'] for row in fail)

bad.most_common(5)

[('11601 W TOUHY AVE ', 180),
 ('324 N LEAVITT ST ', 59),
 ('500 W MADISON ST ', 58),
 ('2300 S THROOP ST ', 33),
 ('2637 S THROOP ST ', 33)]

by_year = defaultdict(Counter)

for row in fail:
    by_year[row['Inspection Date'][-4:]][row['Address']]+=1

by_year['2015'].most_common(5)

[('11601 W TOUHY AVE ', 39),
 ('500 W MADISON ST ', 12),
 ('324 N LEAVITT ST ', 9),
 ('307 S KEDZIE AVE ', 9),
 ('12 S MICHIGAN AVE ', 8)]

by_year['2014'].most_common(5)

[('11601 W TOUHY AVE ', 32),
 ('500 W MADISON ST ', 17),
 ('324 N LEAVITT ST ', 15),
 ('113-125 N GREEN ST ', 12),
 ('131 N CLINTON ST ', 10)]

by_year['2013'].most_common(5)

[('11601 W TOUHY AVE ', 37),
 ('700 E GRAND AVE ', 10),
 ('2300 S THROOP ST ', 10),
 ('301 E NORTH WATER ST ', 9),
 ('12760 S HALSTED ST ', 8)]

bad.most_common(5)

[('11601 W TOUHY AVE ', 180),
 ('324 N LEAVITT ST ', 59),
 ('500 W MADISON ST ', 58),
 ('2300 S THROOP ST ', 33),
 ('2637 S THROOP ST ', 33)]

_[0][0]

'11601 W TOUHY AVE '

ohare = [row for row in fail if row['Address'].startswith('11601 W TOUHY')]

len(ohare)

181

{row['Address'] for row in ohare}

{'11601 W TOUHY AVE ', '11601 W TOUHY AVE T2 F12'}

{row['DBA Name'] for row in ohare}

{'AMERICAN AIRLINES',
 'AMERICAS DOG',
 'ANDIAMOS OHARE, LLC',
 'ARAMARK AT UNITED AIRLINES',
 'ARGO TEA',
 'ARGO TEA CAFE-OHARE T2',
 'AUNTIE ANNES',
 'AUNTIE ANNES PRETZELS',
 'B JS  MARKET',
 'BRITISH AIRWAYS',
 'BURRITO BEACH',
 'CAFFE  MERCATO',
 'CHICAGO BLACKHAWKS STANLEYS T2 BAR',
 'CHICAGO NEWS & GIFTS',
 'CHILIS T - 3',
 'CHILIS T-I',
 'CHILIS- G CONCOURSE',
 'CNN',
 'EFIES CANTEEN INC',
 'ELIS CHEESECAKE',
 'FARMERS FRIDGE',
 'FRESH ON THE FLY',
 'FRONTERA TORTAS  BY RICK BAYLESS GATE K4 T3',
 'FRONTERA TORTAS BY RICK  BAYLESS',
 'GARRETT POPCORN SHOPS',
 'GATEGOURMET',
 'GOLD COAST DOGS',
 'GREEN MARKET',
 'HILTON OHARE',
 'HOST INTERNATIONAL B05',
 'HOST INTERNATIONAL INC',
 'HOST INTERNATIONAL INC, CHILIS T-2',
 'HOST INTERNATIONAL INC-GOOSE ISLAND T3',
 'HOST INTERNATIONAL INC-PRAIRIE TAP',
 'HOST INTERNATIONAL INC.',
 'HOT DOG EXPRESS',
 'HUDSON',
 'HUDSON NEWS',
 'HUDSON NEWS OHARE JOINT VENTURE',
 'ICE BAR',
 'INTELLIGENTSIA',
 'JAMBA JUICE',
 'KOREAN AIR LOUNGE',
 'LA TAPENADES GATE H14',
 'LOU MITCHELLS EXPRESS INC',
 'MACARONI GRILL',
 'MCDONALDS',
 'MCDONALDS RESTAURANT',
 'NATURAL BREAK',
 'NUTS ON CLARK',
 'OHARE BAR',
 'OHARE HILTON HOTEL',
 'PARADES A CHICAGO BAR',
 'PUBLICAN TAVERN K1',
 'REGGIOS PIZZA EXPRESS',
 'ROCKY MOUNTAIN CHOCOLATE FACTORY',
 'RUSH STREET',
 'SALAD WORKS',
 'SARAHS CANDIES',
 'SKYBRIDGE RESTAURANT & BAR',
 'STARBUCKS',
 'STARBUCKS HK APEX',
 'STARBUCKS L03',
 'SUBWAY SANDWICH',
 'THE GODDESS & GROCER',
 'THE GREAT AMERICAN BAGEL',
 'TOCCO',
 'TORTAS FRONTERA',
 'TRAVEL TRADERS #3081 @ HILTON OHARE',
 'TUSCANY CAFE',
 'UNITED CLUB',
 'UNITED CLUB ,T-1  CONCOURSE C',
 'UNITED CLUB, TERMINAL 2 CONCOURSE F',
 'UNITED CLUB,TERMINAL 1 CONCOURSE B SOUTH',
 'UNITED FIRST INTERNATIONAL LOUNGE T1,CONCOURSE C',
 'WOLFGANG EXPRESS',
 'WOLFGANG PUCK, T-3',
 'ZOOTS'}

ohare[0]

{'AKA Name': "REGGIO'S PIZZA EXPRESS (T3 G3)",
 'Address': '11601 W TOUHY AVE ',
 'City': 'CHICAGO',
 'DBA Name': 'REGGIOS PIZZA EXPRESS',
 'Facility Type': 'Restaurant',
 'Inspection Date': '08/16/2016',
 'Inspection ID': '1950494',
 'Inspection Type': 'License',
 'Latitude': '42.008536400868735',
 'License #': '2428080',
 'Location': '(42.008536400868735, -87.91442843927047)',
 'Longitude': '-87.91442843927047',
 'Results': 'Fail',
 'Risk': 'Risk 1 (High)',
 'State': 'IL',
 'Violations': '',
 'Zip': '60666'}

# find the worst location in o'hare to eat
c = Counter(row['AKA Name'] for row in ohare)

c.most_common(10)

[('MACARONI GRILL (T3-K2)', 6),
 ('Gategourmet (BLDG 741)', 5),
 ('ADMIRALS CLUB/AMERICAN AIRLINES (T3/H&K)', 5),
 ('United Employee Cafeteria (T1 C LL)', 4),
 ("CHILI'S TOO (T2  F4)", 4),
 ('HUDSON NEWS', 4),
 ('ARGO TEA  (T3 ROTUNDA)', 4),
 ("CHILI'S  TOO (T3-H2)", 4),
 ('UNITED CLUB (T1 B6)', 4),
 ('WOLFGANG PUCK (T3 K1)', 4)]

ohare[0]

{'AKA Name': "REGGIO'S PIZZA EXPRESS (T3 G3)",
 'Address': '11601 W TOUHY AVE ',
 'City': 'CHICAGO',
 'DBA Name': 'REGGIOS PIZZA EXPRESS',
 'Facility Type': 'Restaurant',
 'Inspection Date': '08/16/2016',
 'Inspection ID': '1950494',
 'Inspection Type': 'License',
 'Latitude': '42.008536400868735',
 'License #': '2428080',
 'Location': '(42.008536400868735, -87.91442843927047)',
 'Longitude': '-87.91442843927047',
 'Results': 'Fail',
 'Risk': 'Risk 1 (High)',
 'State': 'IL',
 'Violations': '',
 'Zip': '60666'}

inspections = defaultdict(list)

for row in ohare:
    inspections[row['License #']].append(row)

inspections['2428080']

[{'AKA Name': "REGGIO'S PIZZA EXPRESS (T3 G3)",
  'Address': '11601 W TOUHY AVE ',
  'City': 'CHICAGO',
  'DBA Name': 'REGGIOS PIZZA EXPRESS',
  'Facility Type': 'Restaurant',
  'Inspection Date': '08/16/2016',
  'Inspection ID': '1950494',
  'Inspection Type': 'License',
  'Latitude': '42.008536400868735',
  'License #': '2428080',
  'Location': '(42.008536400868735, -87.91442843927047)',
  'Longitude': '-87.91442843927047',
  'Results': 'Fail',
  'Risk': 'Risk 1 (High)',
  'State': 'IL',
  'Violations': '',
  'Zip': '60666'}]

inspections.keys()

dict_keys(['34154', '34201', '34229', '2363771', '56366', '1885160', '2289520', '2124574', '2016732', '29570', '2428079', '2277363', '1142116', '34183', '1909532', '1333098', '2447055', '1909539', '2124567', '2289525', '2289531', '85188', '2232034', '34203', '2261733', '2192969', '1621425', '2141979', '1884293', '34211', '34205', '2289511', '2021757', '1333242', '1141505', '34173', '2192968', '1927556', '34142', '2192963', '1069382', '2114331', '23894', '2289495', '2451545', '34224', '1947515', '34192', '2289515', '2204037', '2103989', '34199', '1120626', '2109577', '1879167', '2289527', '2428080', '2016727', '2363760', '2017724', '1879166', '2363762', '34190', '34169', '2289524', '34234', '1333092', '2289084', '34212', '2261728', '2069938', '1069379', '64032', '34167', '1879164', '2125489', '0', '2232035', '1974743', '2009092', '2277391', '1333235', '1884292', '15531', '1898075', '1224624', '1916161', '34146', '2009095', '1381615', '1916219', '1888807', '2184012', '2125246', '1878675', '56367', '37170', '2463991', '34139', '2013208', '1926528', '1942304', '2016729', '51206', '1718776', '1675026', '2284294', '34220', '2284027', '1042895', '2299087'])

#finding failing inspections date
[row['Inspection Date']for row in inspections['34192']]

['04/07/2016', '09/04/2014', '09/20/2011', '01/26/2010']

#what is the most common way that a place at o'hare fails the inspection
#numeric codes and comments
ohare[1]

{'AKA Name': "REGGIO'S PIZZA EXPRESS (T3 G3)",
 'Address': '11601 W TOUHY AVE ',
 'City': 'CHICAGO',
 'DBA Name': 'REGGIOS PIZZA EXPRESS',
 'Facility Type': 'Restaurant',
 'Inspection Date': '08/16/2016',
 'Inspection ID': '1950490',
 'Inspection Type': 'License',
 'Latitude': '42.008536400868735',
 'License #': '2428079',
 'Location': '(42.008536400868735, -87.91442843927047)',
 'Longitude': '-87.91442843927047',
 'Results': 'Fail',
 'Risk': 'Risk 1 (High)',
 'State': 'IL',
 'Violations': '40. REFRIGERATION AND METAL STEM THERMOMETERS PROVIDED AND CONSPICUOUS - Comments: INSTRUCTED TO PROVIDE THERMOMETER VISIBLE AND ACCURATE INSIDE PIZZA HOT HOLDING UNIT. | 33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSILS CLEAN, FREE OF ABRASIVE DETERGENTS - Comments: INSTRUCTED TO DETAIL CLEAN AND MAINTAIN INTERIOR SURFACES OF 2 DOOR PREP COOLER (BY EXPOSED HAND SINK). | 16. FOOD PROTECTED DURING STORAGE, PREPARATION, DISPLAY, SERVICE AND TRANSPORTATION - Comments: FOUND INTERIOR SURFACES OF ICE MACHINE IN REAR NOT CLEAN WITH PINK AND BLACK MOLD LIKE BUILD-UP. INSTRUCTED TO WASH, RINSE AND SANITIZE THE AFFECTED AREAS. \nSERIOUS VIOLATION 7-38-005 (A)',
 'Zip': '60666'}

ohare[1]['Violations'].split('|')

['40. REFRIGERATION AND METAL STEM THERMOMETERS PROVIDED AND CONSPICUOUS - Comments: INSTRUCTED TO PROVIDE THERMOMETER VISIBLE AND ACCURATE INSIDE PIZZA HOT HOLDING UNIT. ',
 ' 33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSILS CLEAN, FREE OF ABRASIVE DETERGENTS - Comments: INSTRUCTED TO DETAIL CLEAN AND MAINTAIN INTERIOR SURFACES OF 2 DOOR PREP COOLER (BY EXPOSED HAND SINK). ',
 ' 16. FOOD PROTECTED DURING STORAGE, PREPARATION, DISPLAY, SERVICE AND TRANSPORTATION - Comments: FOUND INTERIOR SURFACES OF ICE MACHINE IN REAR NOT CLEAN WITH PINK AND BLACK MOLD LIKE BUILD-UP. INSTRUCTED TO WASH, RINSE AND SANITIZE THE AFFECTED AREAS. \nSERIOUS VIOLATION 7-38-005 (A)']

violations = _

violations

['40. REFRIGERATION AND METAL STEM THERMOMETERS PROVIDED AND CONSPICUOUS - Comments: INSTRUCTED TO PROVIDE THERMOMETER VISIBLE AND ACCURATE INSIDE PIZZA HOT HOLDING UNIT. ',
 ' 33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSILS CLEAN, FREE OF ABRASIVE DETERGENTS - Comments: INSTRUCTED TO DETAIL CLEAN AND MAINTAIN INTERIOR SURFACES OF 2 DOOR PREP COOLER (BY EXPOSED HAND SINK). ',
 ' 16. FOOD PROTECTED DURING STORAGE, PREPARATION, DISPLAY, SERVICE AND TRANSPORTATION - Comments: FOUND INTERIOR SURFACES OF ICE MACHINE IN REAR NOT CLEAN WITH PINK AND BLACK MOLD LIKE BUILD-UP. INSTRUCTED TO WASH, RINSE AND SANITIZE THE AFFECTED AREAS. \nSERIOUS VIOLATION 7-38-005 (A)']

[v[:v.find('- Comments:')].strip() for v in violations]

['40. REFRIGERATION AND METAL STEM THERMOMETERS PROVIDED AND CONSPICUOUS',
 '33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSILS CLEAN, FREE OF ABRASIVE DETERGENTS',
 '16. FOOD PROTECTED DURING STORAGE, PREPARATION, DISPLAY, SERVICE AND TRANSPORTATION']

all_violations = [row['Violations'].split('|') for row in ohare]

c = Counter()

for violations in all_violations:
    for v in violations:
        c[v[:v.find('- Comments:')].strip()]+=1

c.most_common(5)

[('33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSILS CLEAN, FREE OF ABRASIVE DETERGENTS',
  107),
 ('34. FLOORS: CONSTRUCTED PER CODE, CLEANED, GOOD REPAIR, COVING INSTALLED, DUST-LESS CLEANING METHODS USED',
  104),
 ('35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTRUCTED PER CODE: GOOD REPAIR, SURFACES CLEAN AND DUST-LESS CLEANING METHODS',
  89),
 ('18. NO EVIDENCE OF RODENT OR INSECT OUTER OPENINGS PROTECTED/RODENT PROOFED, A WRITTEN LOG SHALL BE MAINTAINED AVAILABLE TO THE INSPECTORS',
  80),
 ('32. FOOD AND NON-FOOD CONTACT SURFACES PROPERLY DESIGNED, CONSTRUCTED AND MAINTAINED',
  69)]

When to use pandas?

CSVs
Production/corner cases

Built-in stuff is useful for very unstructured/messy data

Productionalizing a Data Science Model¶

Tudor Radoaca: Software Engineer

Nicole Carlson: Data Scientist

Scoring specialists/Developing Model¶

Quantifies reliability of a specialist based on retention and completion metrics
Checking score distribution for workers

Handoff¶

Jupyter notebook with queries and weights and flexibility to change them

Implementation¶

Specialists can view the score
Persistent Score History
Scalable implementation
Flexible architecture
Accesibility to Data Scientist

Celery¶

asynchronous task queue using multi-processing
grouped tasks returning results to a comon parent task

Final solution: Bolero¶

Implementing Score Features
Software engineers changed the model

Takeaways¶

Talk in person if things get heated
Document everything
Ask question
Involve stakeholders every step of the way
We can build something amazing together

sean@shiftgig.com for data careers

Using Exploratory Data Analysis to Discover Patterns in Image and Document Collections¶

Mehrdad Yazdani

Two problems¶

How do we know we've engineered our data to be ML friendly? How do we know we engineered the right way?
We have lots of data: now what?

What to do with large amounts of unlabeled data?

Lev Manovich

Pyimageplot

The simple graph has brought more information to the data analyst's mind than any other device ~ John Tukey

EDA: two main approaches¶

Statistical modeling (summary statistics)
- Scalable
- Cons: models are based on assumptions that may be wrong
Visualization
- not scalable
- no assumptions are made
EDA on a handful of variables is straightforward
- just plot every pair!
Otherwise
- PCA
- Multidimensional scaling
- t-SNE

Feature engineering¶

Manually designing the input x

Pipeline¶

$Data \rightarrow Extract \> Features \rightarrow Visualize$

Montage
Image histogram
Pretrained neural networks can be used as feature extractors
get features from a pre-trained neural net in Caffe
SkiCaffe : scikit_learn wrapper for extracting features

Visualizing collections of documents¶

generate term-frequency for each document
visualize with word clouds
generate tf-idf
- term frequency-inverse document frequency

Genotype-Phenotype Associations and Machine Learning¶

Mat Kallada

ML Problems in Medicine¶

Drug Design¶

Prediction problem: predict if a particular molecule will stick onto a part of the bacteria to a point where it doesn't move
Extract features from bacterial protein + molecule, train classifier, and look for binary outcome ### Vaccine discovery
Predict which parts of the infectious disease are useful for developing a vaccine ### Disease prediction
given someone's dna, predict if this person has a disease

Diseases¶

Background¶

DNA: a raw feature vector representation for life
genome: subset of dna that creates proteins

Classification Formulation¶

Standard supervised ML task
Training pairs, feature engineering, train model #### Problems
Representing sequential data
Missing genetic data
High dimensional space
Typically imbalanced/skewed class distribution

Genome feature vector representations¶

by definition, something that is fixed length
SNP Vectors
- Single nucleotide polymoriphisms
- Random insertions/deletions or substitutions
- Aline with reference genome (use alignment algorithm)
- record snps: record the differences
- problem: genomes have 3.3 billion amino acids
Gene presence/Absence Vectors
- apply a clustering algorithm on the genes
- cluster membership determines new features
- Has or has not gene
- does not account for mutations
- DBSCAN: similarity metric = edit distance (usually hamming distance)
Character n-grams
- K-mers
- as K -> 1, sparser feature space
- as K-> m, binary feature space
- sliding window representation
- alignment-based approaches are computationally expensive

Gene Feature Selection¶

Human genome has 3.3 billion nucleic acids, 20K genes
How to perform feature selection?
- computational limitations
- constraints on hypothesis space
- interpret important features
Two generic FS approaches
- univariate feature selection
- multivariate selection
Univariate feature selection
- compute a metric between each feature and label: accuracy, correlation
Multivariate feature selection
- construct all models with feature subsets of size K
- evaluate accuract for all models
- keep all features with a certain accuracy ### Gene feature imputation
Missing data problem
Missing at random
missing by design
missing at completely random
Mean/deletion strategies
Model-based imputation (python libary fancyimpute)

Machine Learning for Cleaning Data¶

Cathy Deng

Forest Gregg