You are here

Data Science Map.

Introduction
A work towards organizing and mapping out the toolset needed for my career and various projects. The focus this year has been filling in gaps around dataviz and significantly augmenting my data modeling skills. Some work left there around network and dynamic systems. Decision modeling is a new concept to me and needs maturity.

Overview.

  1. Gathering.
  2. Describing.
  3. Modeling.
  4. Deciding.
  5. Visualizing.

Gathering.

  • Manual: Notebooks, observations, etc
  • Pulling: Database, log files, etc
  • Scraping (ie from web): beautifulsoup, wptools, dbpedia
  • Cleaning: filter, drop, replace, regex (sheets, python, pandas, R)
  • Parsing: filter, sort, pivot, SQL query (sheets, python, pandas, R)
  • Merge: (sheets, python, pandas, R)
  • Store: Notebook, sheets, csv, database, etc

Describing.

  • Numerical: simple retrieval, min, max, average, percentage of whole, etc (python, numpy, scipy, R, sheets)
  • Function: class, fit, derivative, integration, limits, monotonicity (python, numpy, scipy, R, sheets)
  • Statistical: mu, sigma, pdf, cdf, confidence, etc (python, numpy, scipy, R, sheets)
  • Dynamic: coefficients, gain, steady-state, overshoot, settling time, etc
  • Network: path, distance, depth, #nodes, #edges, internal composition and organization (networkx, gephi, cytoscope)

Modeling.

  • Numerical: algebraic, geometrical, etc
  • Regression: curve fitting (ie function class and coefficients), correlation, error bars
  • Limits: tolerance analysis, sensitivity analysis, threshold analysis
  • Optimization: min, max, rate of change, etc (python, numpy, scipy, R, sheets)
  • Statistical: manufacturing limits, ANOVA, monte carlo (time and aggregrate)
  • Dynamic: system diagramming and equations, impulse response, transient response
  • Network: network diagramming and equations, game theory, simulated annealing, search algorithms

Deciding.

  • Forecasting/Risk Analysis: regression analysis, monte carlo (time series and aggregate), game theory
  • Set Point/Tuning: system inputs sweeping, hypothesis testing, necessary condition (threshold) analysis
  • Decision Matrix, Confusion Matrix, ROC plots, Decision Theory
  • Data Classification, Machine Learning, Deep Learning

Visualizing.

  • Time: 1-D amplitude, SPC plots, strip (seaborn, matplotlib, sheets)
  • Relationship (Causal): XY scatter/bubble, pairs plots, line, vector (seaborn, matplotlib, sheets)
  • Matrix (Area): heat [w/ clustering], geographical, tree/other maps (seaborn, matplotlib, sheets)
  • Aggregate (Composition): distribution, columnar [w stacking], pie, box, strip (seaborn, matplotlib, sheets)
  • Network: trees (directed), undirected, flow charts (e.g. sankey) (graphviz, pydot, gephi, cytoscape)

Notes.
Things to consider when choosing a model:

  1. Initial Conditions
  2. Boundary Conditions
  3. Mathematical vs Logical
  4. Deterministic vs Stochastic
  5. Linear vs Nonlinear
  6. Static vs Dynamic
  7. Explicit vs Implicit
  8. Discrete vs Continuous