Table of Contents¶

1. Machine Learning in Python
2. Recommendation Engine with Neo4j
3. PyMC3
4. Hacking Kernels
5. DASK
6. Other things
7. Neo4J Building Recommendation Engines
8. Luigi Building Data Pipelines
9. Introduction to Julia Language

An Introduction to Machine Learning in Python¶

PyData 2016 Scikit-Learn tutorial github

Intro to Machine Learning¶

Supervised: imitating previous classification
- Regresssion: continuous input predicts continuous output
- Classification: predicting categorical
Unsupervised: finding/discovery of patterns
- Clustering: finding groups of similar items
- Compression: finding the structure to simplify the analysis
Reinforcement: learning by doing

Linear Regression¶

Explanatory and response variable
Machine learning used when matrix solving is not feasible
scikit-learn: numpy for machine learning/algorithms
- X array = features (columns), data points (rows)
- Y array = store the continuous target variable corresponding to the data points

Workflow¶

1) Split into training/test (70/30)

2) Hyperparameter values (meta-values about the algorithm) + Learning algorithm

* Hyperparemeter = values of the algorithm, not the model- how many iterations, etc.

3) Test the model

* Make predictions using the model
* Compare prediction to actual output labels (model output to true labels)
* Compute performance metrics

4) Re-create the model with all data

* generally, the more data = better algorithm

Code example of regression¶

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Source: R.J. Gladstone (1905). "A Study of the Relations of the Brain to to the Size of the Head", Biometrika, Vol. 4, pp105-123

Description: Brain weight (grams) and head size (cubic cm) for 237 adults classified by gender and age group.

Variables/Columns

Gender (1=Male, 2=Female)
Age Range (1=20-46, 2=46+)
Head size (cm^3)
Brain weight (grams)

#load in the data with pandas and look at what it seems like
df = pd.read_csv('../../pydata-chicago2016-ml-tutorial/code/dataset_brain.txt', 
                 encoding='utf-8', 
                 comment='#',
                 sep='\s+')
df.tail()

Notes:

Head size = continuous variable
Brain-weight = continuous variable

Can the head size predict the brain-weight?

#look at whether or not it makes sense to build a model
plt.scatter(df['head-size'], df['brain-weight'])
plt.xlabel('Head size (cm^3)')
plt.ylabel('Brain weight (grams)');

Preparing the dataset¶

#creating numpy arrays using the pandas array: y first- what we want to predict
y = df['brain-weight'].values
y.shape

(237,)

# another numpy array from pandas
X = df['head-size'].values
# we need to add a second axis to the array- scikit-learn expects it
X = X[:, np.newaxis]
X.shape

(237, 1)

Random state- the same random state will give you the same split. This is helpful for reproducibility.

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=123)

X_train.shape

(165, 1)

plt.scatter(X_train, y_train, c='blue', marker='o')
plt.scatter(X_test, y_test, c='red', marker='s')
plt.xlabel('Head size (cm^3)')
plt.ylabel('Brain weight (grams)');

Fitting the model¶

from sklearn.linear_model import LinearRegression

#initialize linear regression object
lr = LinearRegression()
#provide both x and y, because it's supervised learning
lr.fit(X_train, y_train)
#making predictings using the 'predict' method
y_pred = lr.predict(X_test)

Evaluating the Model¶

We can use r2 score to evaluate the model: done manually

sum_of_squares = ((y_test - y_pred) ** 2).sum()
res_sum_of_squares = ((y_test - y_test.mean()) ** 2).sum()
r2_score = 1 - (sum_of_squares / res_sum_of_squares)
print('R2 score: %.3f' % r2_score)

R2 score: 0.635

#check the the scikit-learn R2 with the one manually computed above
print('R2 score: %.3f' % lr.score(X_test, y_test))

R2 score: 0.635

sklearn.metrics has a number of different ways to evaluate the fit of a model.

lr.coef_, lr.intercept_

(array([ 0.271117]), 302.03033196088086)

Use y = mx + b to plot the line below

min_pred = X_train.min() * lr.coef_ + lr.intercept_
max_pred = X_train.max() * lr.coef_ + lr.intercept_

plt.scatter(X_train, y_train, c='blue', marker='o')
plt.plot([X_train.min(), X_train.max()],
         [min_pred, max_pred],
         color='red',
         linewidth=4)
plt.xlabel('Head size (cm^3)')
plt.ylabel('Brain weight (grams)');

Understanding Classification¶

If we use train/test split funtion, we may not get a training set with the same proportion of things that are classified. We need to stratify the split, which is a relatively new feature of scikit-learn.

In logistic regression, the activation function is a predicted probability, which is then converted to a unit step function, which then predicts the class label.

Logistic regression is a generalized linear model that gets weight coefficients that we use to make predictions on new data.

K-nearest-neighbors looks up the samples in the neighborhood and makes the classification prediction based on the neareast neighbors.

One thing to keep in mind: all features must be at the same scale.

The Iris dataset¶

df = pd.read_csv('../../pydata-chicago2016-ml-tutorial/code/dataset_iris.txt', 
                 encoding='utf-8', 
                 comment='#',
                 sep=',')
df.tail()

X = df.iloc[:, :4].values 
y = df['class'].values
#these are the unique values within y
np.unique(y)

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

Class label encoding¶

from sklearn.preprocessing import LabelEncoder
# we want to convert the labels to integers (string to int)
l_encoder = LabelEncoder()
l_encoder.fit(y)
l_encoder.classes_

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

y_enc = l_encoder.transform(y)
np.unique(y_enc)

array([0, 1, 2])

np.unique(l_encoder.inverse_transform(y_enc))

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

Scikit-learn's in-build datasets¶

from sklearn.datasets import load_iris

iris = load_iris()
print(iris['DESCR'])

Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris datasets.
http://archive.ics.uci.edu/ml/datasets/Iris

The famous Iris database, first used by Sir R.A Fisher

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

References
----------
   - Fisher,R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

Test/train splits¶

X, y = iris.data[:, :2], iris.target
# ! We only use 2 features for visual purposes (sepal length, width)

print('Class labels:', np.unique(y))
print('Class proportions:', np.bincount(y))

Class labels: [0 1 2]
Class proportions: [50 50 50]

#what happens if we just split without stratifying?
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=123)

print('Class labels:', np.unique(y_train))
print('Class proportions:', np.bincount(y_train))

Class labels: [0 1 2]
Class proportions: [32 40 33]

# the stratify=y option makes sure that the data remains in a good distribution
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=123,
        stratify=y)

print('Class labels:', np.unique(y_train))
print('Class proportions:', np.bincount(y_train))

Class labels: [0 1 2]
Class proportions: [35 35 35]

Logistic Regression¶

If you use the default setting, it will train three binary logistic regression classifiers, each one comparing 'one-vs-rest'. We use the softmax function to do a multinomial classification. If the classes are independent, it's reasonable to use softmax.

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(solver='newton-cg', 
                        multi_class='multinomial', 
                        random_state=1)

lr.fit(X_train, y_train)
print('Test accuracy %.2f' % lr.score(X_test, y_test))

Test accuracy 0.80

from mlxtend.evaluate import plot_decision_regions

plot_decision_regions

plot_decision_regions(X=X, y=y, clf=lr, X_highlight=X_test)
plt.xlabel('sepal length [cm]')
plt.xlabel('sepal width [cm]');

Resource for how to pick a model

K-Nearest Neighbors¶

from sklearn.neighbors import KNeighborsClassifier

#how do you pick the number of neighbors?
#how to break a tie
kn = KNeighborsClassifier(n_neighbors=4)

kn.fit(X_train, y_train)
print('Test accuracy %.2f' % kn.score(X_test, y_test))

Test accuracy 0.82

plot_decision_regions(X=X, y=y, clf=kn, X_highlight=X_test)
plt.xlabel('sepal length [cm]')
plt.xlabel('sepal width [cm]');

Feature Preprocessing & scikit-learn Pipelines¶

Categorical features: nominal vs ordinal¶

Nominal variables have no intrinsic ordering, so we can binarize them. Ordinal variables can be coded with numbers.

Types of feature normalization: minmax just rescales from 0-1, and z-score centers things around the mean. Z-score standardization for simple optimization like gradient descent: weights are updated more equally. Z-score can optimize in both directions.

Re-scaling is done to to make sure that the coefficients the model puts out are representative of how important input variables are, rather than their magnitudes.

import pandas as pd

df = pd.DataFrame([
            ['green', 'M', 10.0], 
            ['red', 'L', 13.5], 
            ['blue', 'XL', 15.3]])

df.columns = ['color', 'size', 'prize']
df

Note to self: Look up what DictVectorizer does (QQQ)

from sklearn.feature_extraction import DictVectorizer

dvec = DictVectorizer(sparse=False)

X = dvec.fit_transform(df.transpose().to_dict().values())
X

array([[  0. ,   1. ,   0. ,  10. ,   0. ,   1. ,   0. ],
       [  0. ,   0. ,   1. ,  13.5,   1. ,   0. ,   0. ],
       [  1. ,   0. ,   0. ,  15.3,   0. ,   0. ,   1. ]])

size_mapping = {
           'XL': 3,
           'L': 2,
           'M': 1}

df['size'] = df['size'].map(size_mapping)
df

X = dvec.fit_transform(df.transpose().to_dict().values())
X

array([[  0. ,   1. ,   0. ,  10. ,   1. ],
       [  0. ,   0. ,   1. ,  13.5,   2. ],
       [  1. ,   0. ,   0. ,  15.3,   3. ]])

Normalization¶

df = pd.DataFrame([1., 2., 3., 4., 5., 6.], columns=['feature'])
df

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

mmxsc = MinMaxScaler()
stdsc = StandardScaler()

X = df['feature'].values[:, np.newaxis]

df['minmax'] = mmxsc.fit_transform(X)
df['z-score'] = stdsc.fit_transform(X)

df

Pipeline¶

from sklearn.pipeline import make_pipeline
from sklearn.cross_validation import train_test_split
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=123,
        stratify=y)

lr = LogisticRegression(solver='newton-cg', 
                        multi_class='multinomial', 
                        random_state=1)

lr_pipe = make_pipeline(StandardScaler(), lr)

lr_pipe.fit(X_train, y_train)
lr_pipe.score(X_test, y_test)

0.9555555555555556

lr_pipe.named_steps

{'logisticregression': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='multinomial',
           n_jobs=1, penalty='l2', random_state=1, solver='newton-cg',
           tol=0.0001, verbose=0, warm_start=False),
 'standardscaler': StandardScaler(copy=True, with_mean=True, with_std=True)}

lr_pipe.named_steps['standardscaler'].transform(X[:5])

array([[-0.92257899,  1.04365194, -1.36204734, -1.33302226],
       [-1.17919036, -0.09310588, -1.36204734, -1.33302226],
       [-1.43580174,  0.36159725, -1.42034941, -1.33302226],
       [-1.56410743,  0.13424568, -1.30374528, -1.33302226],
       [-1.05088468,  1.2710035 , -1.36204734, -1.33302226]])

Things to look up (QQQ):

Model evaluation

Repeated Holdout
Bootstrapping
Cross-validation

Visualization Classification results:

It's hard to visualize things in higher dimensions

Recommendation Engine with Neo4j¶

Things are done instead of indexing but through relationships.

Uniqueness constraints also auto-index (label, property)

CREATE CONSTRAINT ON (t:Topic) ASSRT t.id IS UNIQUE

Indexes are only used to find the starting point for queries
Relational databases use index scans to look up rows in tables and join them with rows from other tables

How to run the neo4j guide from this tutorial

:play http://guides.neo4j.com/pydatachi

Indexes will be the starting point for recommendation traverses
Need to install APOC (awesome procedures on Cypher)
Can write stored procedures and drop into neo4j as a plugin
There are implicit group by's in Cypher queries
Assuming that things with the thing that has 'has_topic' relationship are topics
Clustering topics

Periodic Commit¶

Cypher keeps all transaction state in memory when running a query

When using LOAD CSV, we can get an outofmemory exception

USING PERIODIC COMMIT [insert number here; default 1000]
LOAD CSV

Happy medium tends to be 5K - 10K rows

The WITH clause allows chaining of query parts that are then passed to other MATCH clauses.

Filter on aggregated values
Separate reading from updating

Topic Similarity Jupyter notebook

Look at this later (QQQ)

Neo4j defaults to understanding everything as a string
Imagine if we included event attendance in that algorithm
Neo4j load csv is a transactional import; if something goes wrong, tie things together later

Drawing neo4j stuff in Jupyter notebook

PyMC3¶

Github repository of pymc3 tutorial

Set up models using with statmeent
Can set up a burn-in preiod

look for betas to be sharp
what to look for in the plots?

Linear Regression¶

don't use Pandas dataframes with PyMC3
Cast boolean as integer - PyMC3 doesn't like booleans
PyMC3 needs to be numpy arrays
if you don't specify a sampler, pymc3 will pick one for you
betas look like they're in the right direction
Cast all the variables with errors
Instead of coming up with one single value for parameters, coming up with a value and know the error
What if the beta changes over time?
What is probabilistic programming

Hierarchical Linear/Logistic Regression¶

hierarchical linear regression intro
ADVI vs. MCMC (ADVI

ADVI (variational inference technique)¶

black box that does parameter estimatino
make sure that there are no NaNs and Infinities
Can get an estimate of the distributions
Can use a NUTS sampler with the standard deviations
ADVI vs. MCMC

API-fying¶

can use posterier samples to predict
even if you put in the wrong actual value, it'll predict something

Hacking the Jupyter Kernels¶

Slides for the talk are here

Safia's twitter

What is a kernel?
- A program that runs and introspects the user's code
- Tracks the state of the kernel itself
- Essentially a communicator
- Single kernel can be connected to multiple front-ends

What does jupyter notebook do?
- Starts a kernel + passes a connection file that defines how it should communicate to the front-end
- Kernel + front-end communicate through JSON payloads

What does a connetion file look like?

{
"control_port": 50160,
"shell_port": 57503,
"transport": "tcp",
"signature_scheme": "hmac-sha256",
"stdin_port": 52597,
"hb_port": 42540,
"ip": "127.0.0.1",
"iopub_port": 40885,
"key": "a0436f6c-1916-498b-8eb9-e81ab9368e84"
}

How do we write a kernel for Jupyter?¶

Write it in Python
- ipykernel provides all objects that you need to overwrite all standard python functions
- There has to be a repl (repl console system) using pexpect
Write the kernel in the native language

Jupyter messaging protocol

Defines the spec for how messages look between front-end and kernel

{ 
'header' : {  
'msg_id' : uuid,
          'username' : str,
       'session' : uuid,
       'date': str,
       'msg_type' : str,
       'version' : '5.0',
      },
      'parent_header' : dict,
     'metadata' : dict,
      'content' : dict,
}

Types of kernel sockets

Shell

Managing communication with all of the frontends
Request-reply pattern

sending to that port to that socket to the kernel, parsing the message, and sending the message back.

{ 
'header' : {  
'msg_id' : uuid,
'username' : str,
'session' : uuid,
'date': str,
'msg_type' : ‘is_complete_request’,
'version' : '5.0',
},
'parent_header' : dict,
'metadata' : dict,
'content' : {
‘code’: ‘prin’
}
}

example: tab-complete

    - <b>use casess</b>
    - statement completeness
    - tab completion
    - information about the connected kernel
    - send out requests to execute code
* IO Publication
    - any kind of image rendering
    - provides information about current kernel state (live/dead)
    - streams output that are side effects of execution
* Standard Input
    - when the kernel wants to request input from the front end
    - executing python code asking for raw_input
* Control
    - Does everything that the shell can do
    - Can trump things in the queue
    - restarting the kernel in the middle of execution is done by control socket
* Heartbeat
    - Sends bytestring back and forth
    - Used to make sure the kernel is still alive

What makes this magic happen?¶

0MQ is a messaging library that provides an API that has bindings across programming langauges to open up communication.

An agreement on how messages should be sent.

Random notes:

Banner: what is printed out when your kernel is connected in console mode.
code_mirror- what type of syntax highlighting
Execution count makes a call to another function called "start bash"
the do_execute tells what to do with the code
kernel base class only handles do_execute. can write do_complete

A fill-in-the-blank version of how to build a kernel

Parallel Computing with Dask¶

Speaker

Repo

Dask:

Parallel numpy and pandas
Responsive
Parallel programming + Task scheduling
It's a data structure on top of a bunch of numpy arrays
Does opportunistic cacheing
Number of chunks is arbitrary
dask.get() is an alias for the synchronous backend for tracebacks
Thread is the smallest unit
io tasks are not blocked by the GIL
Learn more about Tornado in python
Can do pandas dataframes distributed
Can build up arbitrarily complex things
h5py (hierarchical data format)
- fast lookups
Dask bag
- bag take is like head
- don't use groupby unless you have to
head is not lazy for a df in dask
There's a map_blocks for dask arrays

Talks I didn't attend (but things I want to look into)¶

JUSTIN'S NOTES¶

Building Recommendation Enginers With Neo4j:¶

:play http://guides.neo4j.com/pydatachi

-Use link above in Neo4j for full tutorial -Some links might have file instead of url address, change those! -“WITH” clause allows chaining of arguments in Neo4J

Luigi and Data Piplines¶

no link given to repo used or slides!¶

github.com/spotify/luigi

Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Works with slack and other chat: https://github.com/houzz/hubot-luigi
Basic structure is built around Tasks & Targets Tasks: atomic unit of work. Takes a target(s), produces targets) Targets: ..... slide moved too fast
It seems like LUIGI is made to tie together different functions, tasks, jobs, etc. : it is a framework for scripts tied together
Has both command line interface and GUI interface
Speaker used building permit data from City of Chicago: https://data.cityofchicago.org

JULIA¶

talk by: Huda Nassar @nassarhuda¶

Tutorial: clone git@github.com:nassarhuda/PyDataChi2016.git¶

download the language here: http://julialang.org/downloads/
run code in the cloud, including code from github https://juliabox.com
Need to correct this error to load some datasets https://github.com/JuliaLang/julia/issues/14746
Able to select options to do things in parallel
Able to change methods to increase speed
Typed like python for the most part
“Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments. It provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and an extensive mathematical function library”
Julia's goal: write like Python, run like C, easily access existing libraries and legacy code. (more here)
Julie packages - http://pkg.julialang.org
JULIA interface with AWS - https://github.com/JuliaCloud/AWS.jl
Benefit of JULIA = easy to bring in packages from other languages
Benefit of JULIA = easy to change type and sometimes possible to change assumptions of what types funtions or modules need
“Don’t be afraid of doing a for loop in Julia, they can be faster than vectorized operations that you might go to for speed in python”
Something about self-generated types… .that you can change on the fly with short-cuts?
There was a short-cut to show you the code behind the functions being called, down to maybe machine code? Or at least C?
PyCall is one of the most popular packages in Julia as it can run any python code or almost any python code in Julia.
Interesting data frames syntax in Julia that, at least in places, seem a bit more straight forward than Pandas? See DataFrames_example Jupyter Notebook

	gender	age-group	head-size	brain-weight
232	2	2	3214	1110
233	2	2	3394	1215
234	2	2	3233	1104
235	2	2	3352	1170
236	2	2	3391	1120

	sepal_length	sepal_width	petal_length	petal_width	class
145	6.7	3.0	5.2	2.3	Iris-virginica
146	6.3	2.5	5.0	1.9	Iris-virginica
147	6.5	3.0	5.2	2.0	Iris-virginica
148	6.2	3.4	5.4	2.3	Iris-virginica
149	5.9	3.0	5.1	1.8	Iris-virginica

	feature	minmax	z-score
0	1.0	0.0	-1.46385
1	2.0	0.2	-0.87831
2	3.0	0.4	-0.29277
3	4.0	0.6	0.29277
4	5.0	0.8	0.87831
5	6.0	1.0	1.46385

	color	size	prize
0	green	M	10.0
1	red	L	13.5
2	blue	XL	15.3

	color	size	prize
0	green	1	10.0
1	red	2	13.5
2	blue	3	15.3

PyData 2016 - Chicago - Day 1

Table of Contents¶

An Introduction to Machine Learning in Python¶

Intro to Machine Learning¶

Linear Regression¶

Workflow¶

Code example of regression¶

Preparing the dataset¶

Fitting the model¶

Evaluating the Model¶

Understanding Classification¶

The Iris dataset¶

Class label encoding¶

Scikit-learn's in-build datasets¶

Test/train splits¶

Logistic Regression¶

K-Nearest Neighbors¶

Feature Preprocessing & scikit-learn Pipelines¶

Categorical features: nominal vs ordinal¶

Normalization¶

Pipeline¶

Recommendation Engine with Neo4j¶

Periodic Commit¶

PyMC3¶

Linear Regression¶

Hierarchical Linear/Logistic Regression¶

ADVI (variational inference technique)¶

API-fying¶

Hacking the Jupyter Kernels¶

How do we write a kernel for Jupyter?¶

What makes this magic happen?¶

Parallel Computing with Dask¶

Talks I didn't attend (but things I want to look into)¶

JUSTIN'S NOTES¶

Building Recommendation Enginers With Neo4j:¶

Luigi and Data Piplines¶

no link given to repo used or slides!¶

JULIA¶

talk by: Huda Nassar @nassarhuda¶

Tutorial: clone git@github.com:nassarhuda/PyDataChi2016.git¶