Ingestion | Representation | Analysis | Visualization
Phases:
Sustaining Software
Do we actually need to sustain specific software?
What is sustainability?
$S(N) = \frac{1}{(1-P)+(P/N)}$
Quantifying bottlenecks. A representation of the bottlenecking in a community.
Challenges of software maintainance:
Empowering:
Empowerment sets up a power dynamic
Community
Values
Models
Barriers to entry (technical & social)
Engagement (methods & modes)
Investment
Credit is not a zero-sum game.
Code is not everything
Scale
How does this look technically?
Procedures for finding f
Graph-based numerical computation
TensorFlow + alternatives
Multilayer Preceptrons
sklearn.utils.estimator_checks.check_estimator
to check API conformity of a custom estimatorFacilitates:
Custom models need
Civis MLP implementation
__getstate__
scikit-learn uses joblib to parallelize hyperparameter evaluation in GridSearchCV
Model types
Best model seeks to minimize the mean absolute percentage error
"AR Discontinuation": supplier AR dropping to zero with all C2FO buyers
Challenges:
row = ('Dave', 'Beazley', '4312 N Clark ST')
row[1]
row[2]
from collections import namedtuple
Person = namedtuple('Person', ['first', 'last', 'address'])
row = Person('Dave', 'Beazley', 'address')
row.first
names = ['Dave', 'Thomas', 'Paula', 'Dave']
names
names = set(['Dave', 'Thomas', 'Paula', 'Dave'])
names
prices = {
'ACME': 94.23,
'YOW':45.2
}
prices['ACME']
from collections import Counter
c = Counter('xyzzy')
c
c['a'] +=10
c['b'] +=13
c
#one to many relationships, grouping, multidicts
from collections import defaultdict
d = defaultdict(list)
d['spam'].append(42)
d['blah'].append(13)
d['spam'].append(10)
d
## loops
## iteration
## reductions (sum, min, max, any, all)
## variants (enumerate, zip)
## List comprehesnsions
## set comprehensions
## Dict comprehensions
nums = [1,2,3,4,5,6]
squares = []
for x in nums:
squares.append(x*x)
squares
squares = [x*x for x in nums]
squares
# generator expressions + reductions
squares = (x*x for x in nums)
squares
for n in squares:
print(n)
import csv
food = list(csv.DictReader(open('Food_Inspections.csv')))
type(food)
food[1]
#all possible outcomes because sets are unique
{ row['Results'] for row in food}
fail = [row for row in food if row['Results']=='Fail']
len(fail)
fail[0]
worst = Counter(row['DBA Name'] for row in fail)
worst.most_common(5)
worst.most_common(15)
# taking a dictionary row and making new things
fail = [{**row, 'DBA Name': row['DBA Name'].replace("'", '').upper()}
for row in fail]
worst = Counter(row['DBA Name'] for row in fail)
worst.most_common(5)
worst.most_common(20)
bad = Counter(row['Address'] for row in fail)
bad.most_common(5)
by_year = defaultdict(Counter)
for row in fail:
by_year[row['Inspection Date'][-4:]][row['Address']]+=1
by_year['2015'].most_common(5)
by_year['2014'].most_common(5)
by_year['2013'].most_common(5)
bad.most_common(5)
_[0][0]
ohare = [row for row in fail if row['Address'].startswith('11601 W TOUHY')]
len(ohare)
{row['Address'] for row in ohare}
{row['DBA Name'] for row in ohare}
ohare[0]
# find the worst location in o'hare to eat
c = Counter(row['AKA Name'] for row in ohare)
c.most_common(10)
ohare[0]
inspections = defaultdict(list)
for row in ohare:
inspections[row['License #']].append(row)
inspections['2428080']
inspections.keys()
#finding failing inspections date
[row['Inspection Date']for row in inspections['34192']]
#what is the most common way that a place at o'hare fails the inspection
#numeric codes and comments
ohare[1]
ohare[1]['Violations'].split('|')
violations = _
violations
[v[:v.find('- Comments:')].strip() for v in violations]
all_violations = [row['Violations'].split('|') for row in ohare]
c = Counter()
for violations in all_violations:
for v in violations:
c[v[:v.find('- Comments:')].strip()]+=1
c.most_common(5)
When to use pandas?
Built-in stuff is useful for very unstructured/messy data
Tudor Radoaca: Software Engineer
Nicole Carlson: Data Scientist
sean@shiftgig.com for data careers
What to do with large amounts of unlabeled data?
The simple graph has brought more information to the data analyst's mind than any other device ~ John Tukey
Visualization
EDA on a handful of variables is straightforward
Otherwise
$Data \rightarrow Extract \> Features \rightarrow Visualize$
Image histogram
Pretrained neural networks can be used as feature extractors