A list of several sources to learn data science in a hands-on format

https://www.coursera.org/course/ml – The most approachable machine learning course available. And it’s free.

https://www.kaggle.com/wiki/Tutorials – Provides data sources, forums, scenarios, and real-world competitions to teach data mining

http://deeplearning.net/tutorial/ – Tutorial on Deep Learning – introduction to machine learning image analysis algorithms

http://tryr.codeschool.com/ – Interactive introduction to R Language

The elements of big data analytics has roots in statistics, knowledge management, and computer science. Many of the data mining terms below appear in these disciplines but may have different connotation or specialized meaning when applied to our problems. The problems of massive parallel processing and the specialized algorithms employed to perform analysis in a distributed computing environment are enough to require specialized treatment.

Data Mining Terms


Accuracy A measure of a predictive model that reflects the proportionate number of times that the model is correct when applied to data
Bias Difference between expected value and actual value
Cardinality Data mining terms indicating the number of different values a categorical predictor or OLAP dimension can have. High cardinality predictors and dimensions have large numbers of different values (e.g. zip codes), low cardinality fields have few different values (e.g. eye color).
CART Classification and Regression Trees. A type of decision tree algorithm that automates the pruning process through cross validation and other techniques.
CHAID Chi-Square Automatic Interaction Detector. A decision tree that uses contingency tables and the chi-square test to create the tree. Classification. The process of learning to distinguish and discriminate between different input patterns using a supervised training algorithm. Classification is the process of determining that a record belongs to a group
Cluster Centroid most typical case in a cluster.  The centroid is a prototype. It does not necessarily describe any given case assigned to the cluster.
Clustering The technique of grouping records together based on their locality and connectivity within the n-dimensional space. This is an unsupervised learning technique.
Collinearity The property of two predictors showing significant correlation without a causal relationship between them
concentration of measure any set of positive probability can be expanded very slightly to contain most of the probability the average of bounded independent random variables is tightly concentrated around its expectation
Conditional Probability The probability of an event happening given that some event has already occurred. For example the chance of a person committing fraud is much greater given that the person had previously committed fraud
Confidence The likelihood of the predicted outcome, given that the rule has been satisfied.
convergence of random variables a sequence of essentially random or unpredictable events can sometimes be expected to settle down into a behaviour that is essentially unchanging when items far enough into the sequence are studied
correlation number that describes the degree of relationship between two variables
Coverage A number that represents either the number of times that a rule can be applied or the percentage of times that it can be applied
Cross-validation The process of holding aside some training data which is not used to build a predictive model and to later use that data to estimate the accuracy of the model on unseen data simulating the real world deployment of the model.
Data Mining Process Define the problem. Select the data. Prepare the data. Mine the data. Deploy the model. Take business action.
Discrete Fourier Transform Concentrates energy in first few coefficients
Entropy A measure often used in data mining algorithms that measures the disorder of a set of data
Error Rate A number that reflects the rate of errors made by a predictive model. It is one minus the accuracy
Expectation–maximization algorithm for estimating parameters where there exist significant missing or inferred values
Expectation-Maximization (EM) Solves estimation with incomplete data. Iteratively use estimates for missing data and continue until convergence
Expert System A data processing system comprising a knowledge base (rules), an inference (rules) engine, and a working memory
Exploratory Data Analysis The processes and techniques for general exploration of data for patterns in preparation for more directed analysis of the data
Factor Analysis A statistical technique which seeks to reduce the number of total predictors from a large number to only a few “factors” that have the majority of the impact on the predicted outcome.
Fuzzy Logic A system of logic based on the fuzzy set theory
Fuzzy Set A set of items whose degree of membership in the set may range from 0 to 1
Fuzzy System A set of rules using fuzzy linguistic variables described by fuzzy sets and processed using fuzzy logic operations
Genetic Algorithm Optimization techniques that use processes such as generic combination, mutation, and natural selection in a design based on the concepts of  revolution
Genetic Operator An operation on the population member strings in a genetic algorithm which are used to produce new strings
Gini Index A measure of the disorder reduction caused by the splitting of data in a decision tree algorithm. Gini and the entropy metric are the most popular ways of selected predictors in the CART decision tree algorithm
Hebbian Learning One of the simplest and oldest forms of training a neural network. It is loosely based on observations of the human brain. The neural net link weights are strengthened between any nodes that are active at the same time.
Hill Climbing A simple optimization technique that modifies a proposed solution by a small amount and then accepts it if it is better than the previous solution. The technique can be slow and suffers from being caught in local optima
Hypothesis Testing The statistical process of proposing a hypothesis to explain the existing data and then testing to see the likelihood of that hypothesis being the explanation
ID3 Decision Tree algorithm
Intelligent Agent A software application which assists a system or a user by automating a task. Intelligent agents must recognize events and use domain knowledge to take appropriate actions based on those events.
Itemset An itemset is any combination of two or more items in a transaction
Jackknife Estimate estimate of parameter is obtained by omitting one value from the set of observed values. Allows you to examine the impact of outliers.
Kernel a function that transforms the input data to a high-dimensional space where the problem is solved
k-Nearest Neighbor A data mining technique that performs prediction by finding the prediction value of records (near neighbors) similar to the record to be predicted
Kohonen Network A type of neural network where locality of the nodes learn as local neighborhoods and locality of the nodes is important in the training process. They are often used for clustering
Latent variable variables inferred from a model rather than observed
Lift A number representing the increase in responses from a targeted marketing application using a predictive model over the response rate achieved when no model is used
Machine Learning A field of science and technology concerned with building machines that learn. In general it differs from Artificial Intelligence in that learning is considered to be just one of a number of ways of creating an artificial intelligence
maximum likelihood method for estimating the parameters of a model
Maximum Likelihood Estimate (MLE) Obtain parameter estimates that maximize the probability that the sample data occurs for the specific model. Joint probability for observing the sample data by multiplying the individual probabilities.
Mean Absolute Error AVG(ABS(predicted_value – actual_value))
Mean Squared Error (MSE) expected value of the squared difference between the estimate and the actual value
Memory-Based Reasoning (MBR) A technique for classifying records in a database by comparing them with similar records that are already classified. A form of nearest neighbor classification.
Minimum Description Length (MDL) Principle The idea that the least complex predictive model (with acceptable accuracy) will be the one that best reflects the true underlying model and performs most accurately on new data.
Model A description that adequately explains and predicts relevant data but is generally much smaller than the data itself
Neural Network A computing model based on the architecture of the brain. A neural network consists of multiple simple processing units connected by adaptive weights
Nominal Categorical Predictor A predictor that is categorical (finite cardinality) but where the values of the predictor have no particular order. For example, red, green, blue as values for the predictor “eye color”.
Ordinal Categorical Predictor A categorical predictor (i.e. has finite number of values) where the values have order but do not convey meaningful intervals or distances between them. For example the values high, middle and low for the income predictor
Outlier Analysis A type of data analysis that seeks to determine and report on records in the database that are significantly different from expectations. The technique is used for data cleansing, spotting emerging trends and recognizing unusually good or bad performers
overfitting The effect in data analysis, data mining and biological learning of training too closely on limited available data and building models that do not generalize well to new unseen data. At the limit, overfitting is synonymous with rote memorization where no generalized model of future situations is built
Point Estimation estimate a population parameter. May be made by calculating the parameter for a sample. May be used to predict value for missing data.
Predictive model model created or used to perform prediction. In contrast to models created solely for pattern detection, exploration or general organization of the data
Predictor The column or field in a database that could be used to build a predictive model to predict the values in another field or column. Also called variable, independent variable, dimension, or feature.
Principle Component Analysis A data analysis technique that seeks to weight the importance of a variety of predictors so that they optimally discriminate between various possible predicted outcomes
Prior Probability The probability of an event occurring without dependence on (conditional to) some other event. In contrast to conditional probability
Purity/Homogeneity the degree to which the resulting child nodes are made up of cases with the same target value
Radial Basis Function Networks Neural networks that combine some of the advantages of neural networks with those of nearest neighbor techniques. In radial basis functions the hidden layer is made up of nodes that represent prototypes or clusters of records
Receiver Operating Characteristic (ROC) The area under the ROC curve (AUC) measures the discriminating ability of a binary classification model. The larger the AUC, the higher the likelihood that an actual positive case will be assigned a higher probability of being positive than an actual negative case. The AUC measure is especially useful for data sets with unbalanced target distribution (one target class dominates the other).
Regression A data analysis technique classically used in statistics for building predictive models for continuous prediction fields. The technique automatically determines a mathematical equation that minimizes some measure of the error between the prediction from the regression model and the actual data
Reinforcement Learning A training model where an intelligence engine (e.g. neural network) is presented with a sequence of input data followed by a reinforcement signal
Root Mean Squared Error SQRT(AVG((predicted_value – actual_value) * (predicted_value – actual_value)))
Sampling The process by which only a fraction of all available data is used to build a model or perform exploratory analysis. Sampling can provide relatively good models at much less computational expense than using the entire database
Segmentation The process or result of the process that creates mutually exclusive collections of records that share similar attributes either in unsupervised learning (such as clustering) or in supervised learning for a particular prediction field
Sensitivity Analysis The process which determines the sensitivity of a predictive model to small fluctuations in predictor value. Through this technique end users can gauge the effects of noise and environmental change on the accuracy of the model
Simulated Annealing An optimization algorithm loosely based on the physical process of annealing metals through controlled heating and cooling
Sparsity This means that a high proportion of the nested rows are not populated.
Statistical Independence The property of two events displaying no causality or relationship of any kind. This can be quantitatively defined as occurring when the product of the probabilities of each event is equal to the probability of the both events occurring
Stepwise Regression Automated Regressions to identify most predictive variables.  1st regression finds most predictive, 2nd regression finds most predictive given 1st regression.
Supervised Algorithm A class of data mining and machine learning applications and techniques where the system builds a model based on the prediction of a well defined prediction field. This is in contrast to unsupervised learning where there is no particular goal aside from pattern detection.
Support The relative frequency or number of times a rule produced by a rule induction system occurs within the database. The higher the support the better the chance of the rule capturing a statistically significant pattern.
Term Definition
Time-Series Prediction The process of using a data mining tool (e.g., neural networks) to learn to predict temporal sequences of patterns, so that, given a set of patterns, it can predict a future value
Unsupervised Algorithm A data analysis technique whereby a model is built without a well defined goal or prediction field. The systems are used for exploration and general data organization. Clustering is an example of an unsupervised learning system
Visualization Graphical display of data and models which helps the user in understanding the structure and meaning of the information contained in them


This overview of data mining terms is part of a publication, “Dictionary of Data Mining Terms” due out in publication in November 2013 by Don Krapohl.  This post does not use any content from, but acknowledges a similar work by Dr. Vincent Granville at http://www.analyticbridge.com/profiles/blogs/2004291:BlogPost:223153, also containing a significant number of data mining terms.

To foster the study of the structure and dynamics of Web traffic networks, Indiana University has made available a large dataset (‘Click Dataset’) of about 53.5 billion HTTP requests made by users at Indiana University. Gathering anonymized requests directly from the network rather than relying on server logs and browser instrumentation allows one to examine large volumes of traffic data while minimizing biases associated with other data sources. It also provides one with valuable referrer information to reconstruct the subset of the Web graph actually traversed by users. The goal is to develop a better understanding of user behavior online and create more realistic models of Web traffic. The potential applications of this data include improved designs for networks, sites, and server software; more accurate forecasting of traffic trends; classification of sites based on the patterns of activity they inspire; and improved ranking algorithms for search results.

The data was generated by applying a Berkeley Packet Filter to a mirror of the traffic passing through the border router of Indiana University. This filter matched all traffic destined for TCP port 80. A long-running collection process used the pcap library to gather these packets, then applied a small set of regular expressions to their payloads to determine whether they contained HTTP GET requests. If a packet did contain a request, the collection system logged a record with the following fields:

  • a timestamp
  • the requested URL
  • the referring URL
  • a boolean classification of the user agent (browser or bot)
  • a boolean flag for whether the request was generated inside or outside IU.

Some important notes:

  1. Traffic generated outside IU only includes requests from outside IU for pages inside IU. Traffic generated inside IU only includes requests from people at IU (about 100,000 users) for resources outside IU. These two sets of requests have very different sampling biases.
  2. No distinguishing information about the client system was retained: no MAC or IP addresses nor any unique index were ever recorded.
  3. There was no attempt at stream reassembly, and server responses were not analyzed.

During collection, the system generated data at a rate of about 60 million requests per day, or about 30 GB/day of raw data. The data was collected between Sep 2006 and May 2010. Data is missing for about 275 days. The dataset has two collections:

  1. raw: About 25 billion requests, where  only the host name of the referrer is retained. Collected between 26 Sep 2006 and 3 Mar 2008; missing 98 days of data, including the entire month of Jun 2007. Approximately 0.85 TB, compressed.
  2. raw-url: About 28.6 billion requests, where the full referrer URL is retained. Collected between 3 Mar 2008 and 31 May 2010; missing 179 days of data, including the entire months of Dec 2008, Jan 2009, and Feb 2009. Approximately 1.5 TB, compressed.

The dataset is broken into hourly files. The initial line of each file has a set of flags that can be ignored. Each record looks like this:

 XXXXADreferrer  host  path

where XXXX is the timestamp (32-bit Unix epoch in seconds, in little endian order), A is the user-agent flag (“B” for browser or “?” for other, including bots), D is the direction flag (“O” for external traffic to IU, “I” for internal traffic to outside IU), referrer is the referrer hostname or URL (terminated by newline), host is the target hostname (terminated by newline), and path is the target path (terminated by newline).

The Click Dataset is large (~2.5 TB compressed), which requires that it be transferred on a physical hard drive. You will have to provide the drive as well as pre-paid return shipment. Additionally,  the dataset might potentially contain bits of stray personal data. Therefore you will have to sign a data security agreement. Indiana University require that you follow these instructions to request the data.

Citation information and FAQs are available on the team’s page at http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset .

I’ve started a meetup for local professionals in the decision science field around the Tampa Bay area to come together and learn about what’s happening in our area.  If you are a data science professional, come join us and be a part of making the Tampa-St. Petersburg metro area the southeast center of excellence in big data and analytics.  Visit http://www.meetup.com/Analytics-Professionals-of-Tampa/ to find events and to join.  I hope to see you there.

View agency activity clustering on geography in Excel using Excel Data Mining Add-ins

By Don Krapohl

1.       Ensure you have downloaded the Excel Data Mining Add-ins from Microsoft at http://www.microsoft.com/en-us/download/details.aspx?id=35578 .  The article assumes you have a working version of the DM Addins and a default Analysis Services (SSAS) instance defined.  Search for getting started with SQL Server Data Mining Add-ins for Excel if you are not familiar with this process.

2.       Open the Excel sample file for Federal contract acquisitions in Wyoming (2012) from https://augmentedintel.com/content/datasets/government_contracts_data_mining_addins.xlsx

3.       On the wy_data_feed tab, select all the data.

4.       In the Home tab on the ribbon in the Styles section select “Format as Table”.  Pick any format you wish.

5.       A new tab will appear on the ribbon for Table Tools with menus for Analyze and Design as below.

Microsoft Excel Table Tools menu
Table tools to format data as a table in Excel


6.       On the Analyze menu, select “Detect Categories”.  This is will group (cluster) your information on common attributes, particular commonalities that are not obvious or immediately observable.

7.       Deselect all checkboxes except the following:

a.       Dollars Obligated

b.      Award Type

c.       Contract Pricing

d.      Funding Agency

e.      Product Or Service Code

f.        Category

8.       Click ‘run’

9.       The output will show you categories of information showing strong affinities.  Explore the model by filtering the charts and tables by the category/ies generated.  Do this by selecting the filter icon (funnel) next to Category on the table or the Category label at the lower left of the graph.

10.   Interesting information may be derived from the groups with fewer rows that may show particularly interesting correlations for a targeted campaign.  For example, filter the table and chart on Category 6.  This group indicates a group affinity for the attribute values ProductOrServiceCode = “REFRIGERATION AND AIR CONDITIONING COMPONENTS”, fundingAgency = “Veterans Affairs, Department Of”, and a contract award value of $61,148 to $1,173,695 as shown below:


Importance of data categories in Excel Data Mining Add-ins
Factor Analysis in Microsoft Excel

For my organization’s business development activities, if I am in the heating and air business I may elect to focus efforts on medium-sized contracts with Veterans Affairs.

My Google+

Artificial Intelligence for the Creation of Competitive Intelligence Tools


Often in prioritizing business development activities it is helpful to determine who is able to influence a decision and how they are related to those in the market space.  To make a defensible and actionable strategy it is useful to perform Influence Analysis and Network Analysis, which can form the kernel of a competitive intelligence analysis strategy.  The data required for analysis must be obtained by identifying and extracting target attribute values in unstructured and often very large (multi-terabyte or petabyte) data stores.  This necessitates a scalable infrastructure, distributed parallel computing capability, and fit-for-use natural language processing algorithms.  Herein I will demonstrate a target logical architecture and methodology for accomplishing the task.  Influence and Network analysis by machine learning algorithm (naïve bayes or perceptron for example) will be covered in a later supporting article.

Recognizing Significance

Named-Entity Recognition is required for unstructured content extraction in this scenario.  This identification scheme may or may not employ stemming but will always require tokenizing, part-of-speech tagging, and the acquisition of a predefined model of attribute patterns to properly recognize and extract required metadata.  A powerful platform with these built-in capabilities is the Apache openNLP project, which includes typed attribute models for the name finder, an extensible name finder algorithm, an API that exposes a Lucene index consumer, and a scalable, distributed architecture.  The Apache Stanbol project in the incubator (http://stanbol.apache.org/) shows promise at semantic-based extraction and content enhancement but hasn’t been promoted outside the incubator yet.

Apache openNLP attribute recognition models are available in only a few languages with the original and largest being English.  The community publishes models in English for the Name Finder interface for dates, location, money, organization, percentage, person, and time (date).  Each is an appropriate candidate for term extraction for competitive intelligence analysis.

Logical Architecture

Natural Language Processing for Competitive Intelligence
openNLP in four node Hadoop cluster

The controlling requirement for the task of metadata extraction from massive datasources is the processing of massive datasets to extract information.  For this Hadoop provides a flexible, fault-tolerant framework and processing model that readily supports the natural language processing needs.  The logical architecture for a small (<1TB) 4-node clustered Hadoop solution is as follows:


Process Flow

As below, the process to execute is standardized on the map/reduce patterns Distributed Task Execution, Union, Selection, and Intersection.  Pre-processing using a Graph Processing pattern in a distinctly separate map phase would likely hasten any Influence Analysis to be performed post-process.


Operations Sequence Diagram of openNLP with Map Reduce on Hadoop for Competitive Intelligence
Multi-node Sequence Diagram for openNLP with Map Reduce on Hadoop

The primary namenode initiates work and passes the data and map/reduce execution program to the task trackers, who in turn distribute it among worker nodes.  The worker nodes execute the map on HDFS-stored data, provide health and status to the task tracker, who reports it to the primary namenode.  On node map completion the primary namenode may redistribute map work to the worker node or order the reduce task, each by way of the task tracker.  The reduce task selects data from the HDFS interim resultset, aggregates, and streams to a result file.  The result file is then used later for analysis by the machine learning algorithm of choice.

File Structures

The input file is of a machine-readable ASCII text type and is unstructured.  Example:


From: Amir Soofi

Sent: Thursday, December 06, 2012 2:37 AM

To: Aaron Macarthur; Hugo Cruz

Cc: Donald Krapohl

Subject: RE: Language Comparison




FYI, Rick Marshall unofficially approved a 3-day trip for one person from the Enterprise team down to Jacksonville, FL to assist in the catalog reinstall.


I’ll be placing it in the travel portal soon for the official process, so that the option becomes officially available to us.


I think together we’ll be able to push through the environment differences better in person than over the phone.


Let us know whether your site can even accommodate a visitor, and when you’d like to exercise this option.




Amir Soofi


Principal Software Engineer, Enterprise



The output of the openNLP Name Find algorithm map task on this input:

From: <namefind/person>Amir Soofi</namefind/person>

Sent: <namefind/date>Thursday, December 06, 2012 2:37 AM</namefind/date >

To: <namefind/person>Aaron Macarthur</namefind/person>; <namefind/person>Hugo Cruz</namefind/person>

Cc: <namefind/person>Donald Krapohl</namefind/person>

Subject: RE: Language Comparison




FYI, <namefind/person>Rick Marshall</namefind/person> unofficially approved a 3-day trip starting <namefind/date>14 November</namefind/date> for one person from the Enterprise team down to <namefind/location>Jacksonville, FL</namefind/location> to assist in the catalog reinstall.


I’ll be placing it in the travel portal soon for the official process, so that the option becomes officially available to us.


I think together we’ll be able to push through the environment differences better in person than over the phone.


Let us know whether your site can even accommodate a visitor, and when you’d like to exercise this option.




<namefind/person>Amir Soofi</namefind/person>


Principal Software Engineer, Enterprise


The output of an example reduce task on this output:

{DocumentUniqueID, EntityKey, EntityType}

{234cba3231, Amir Soofi, Person}

{234cba3231, Thursday, December 06, 2012 2:37 AM, Date}

{234cba3231, Aaron Macarthur, Person}

{234cba3231, Hugo Cruz, Person}

{234cba3231, Donald Krapohl, Person}

{234cba3231, Rick Marshall, Person}

{234cba3231, 14 November, Date}

{234cba3231, Jacksonville/,FL, Location}

{234cba3231, Amir Soofi, Person}


A second reduce pass might yield combinations for network analysis (link strength below being calculated on instances of co-existence across unique documents):

{EntityKey, LinkedEntity, LinkStrength}

{Amir Soofi, Donald Krapohl, 6}

{Amir Soofi, Aaron Macarthur, 15}

{Amir Soofi, Jacksonville/, FL, 1}


The data may then be consumed into the analysis tool of choice, such as RapidMiner, WEKA, PowerPivot, or SQL Server/SQL Server Analysis Services for further analysis.


openNLP on Hadoop can provides good metadata extraction for key information in unstructured data.  The information may be retrieved from competitor websites, SEC filings, Twitter activity, employee social network activity, or many other sources.  The data pre-processing and preparation steps in metadata extraction for competitive intelligence applications can be low relative to that of other analytical problems (contract semantic analysis, social analysis trending, etc.).  The steps outlined in this paper demonstrate a very high-level overview of a logical architecture and key execution activities required to gather metadata for Influence Analysis and Network Analysis for competitive advantage.

My Google+