Below is an example dev cluster topology for a Big Data development cluster as I’ve actually used for some customers.  It’s composed of 6 Amazon Web Service (AWS) servers, each with a particular purpose.  We have been able to perform full lambda using this topology along with Teiid (for data abstraction) on terabytes of data.  It’s not sufficient for a production cluster but is a good starting point for a development group.  The total cost of this cluster as configured (less storage) is under $6/hour.

Here’s a link to this dev_topology in Excel.


Service Category Server1 Server2 Server3 Server4 Server5 Server6
Cloudera Mgr Cluster Mgt Alert pub Server Host mon Svc Mon Event Svr Act Mon
Zookeeper Infra Server Server Server
YARN Infra Node Mgr Node Mgr JobHist Node Mgr RM/NM
Redis Infra Master Slave Slave
Hive Data Hive server Metastore Hcat
Impala Data App Master Cat Svr Daemon Daemon Daemon
Storm Data Nimbus/UI Supervisor Supervisor Supervisor
Hue UI Server
Pentaho BI UI BI Server
AWS details              
Name m3.2xlarge m3.2xlarge m3.2xlarge r3.4xlarge r3.4xlarge r3.4xlarge
vCPU 8 8 8 16 16 16
Memory (Gb) 30.0 30.0 30.0 122.0 122.0 122.0
Instance storage (Gb) SSD 2 x 80 SSD 2 x 80 SSD 2 x 80 SSD 1 x 320 SSD 1 x 320 SSD 1 x 320
I/O High High High High High High
EBS option Yes Yes Yes Yes Yes Yes


A list of several sources to learn data science in a hands-on format – The most approachable machine learning course available. And it’s free. – Provides data sources, forums, scenarios, and real-world competitions to teach data mining – Tutorial on Deep Learning – introduction to machine learning image analysis algorithms – Interactive introduction to R Language

Refreshed from my June 2013 post.


A simple but powerful method for structured, transparent decision making within a group is demonstrated using a supplied template and the supporting process.  The approach employs a weighted decision matrix with authoritative attributes, leads to an individual decision outcome, and composes a group decision using basic statistical methods.  The goal is a simple-to-convey means by which to discover factors, results, agreement, and influencers in group decision making.

Download the group decision making template (filled with example data) in:

Excel 2010 format (*.xslx)

OpenData format (*.odf)

Decision Making

A common dilemma in organizations is the need for rapid decision making on complex subjects.  The default unstructured open-dialog approach appears to respect the judgment of the individual while actually putting the possibility of a quick, fit, and defensible solution beyond reach.  Further complicating factors such as decision makers’ distribution across multiple locations, differences in controlling standards among group members, and competing organizational interests add additional noise to an unstructured process.  Focused analysis of a 10-factor decision with 3 options for example requires each individual to evaluate in real-time, 30 combinations while debating on- and off-topic aspects.

The use of a weighted decision matrix and rudimentary analysis provide a simple tool set for rapid group decision making on complex subjects.  By including strategic-level experts in each area affected, documenting the aspects that impact the decision, and approaching the problem methodically will reduce the time to complete the exercise.  The steps to complete the process also enable instant identification of differences and commonality of opinion for targeted debate.  Use of this method is particularly prescribed when analyzing multiple options and their measures of value across decision spaces such as (not exhaustive):

  • finance ,  emphasizing  ROI, NPV, FV, and Opportunity Cost
  • engineering, focused on technical obstacles, multi-year plans, legacy technology issues, cost of skilled resources, and operational complexity
  • sales, stressing time to market, product features, product relevance, and “seize-the-moment” opportunity capture support
  • business capture decision making and negotiations
  • and portfolio strategist, concerned with SWOT analysis, market trends, and integration into the enterprise ecosystem

Weighted Decision Matrix Concept

A tool commonly used to achieve a quantifiable (but still subjective) answer to any type of question from “What university should I attend?” to “Should I invest in a business intelligence services department for my multi-national company?” is the weighted decision matrix.  Only a few elements are required for analysis:

Factor Weight Option 1 Score on Factor

Option n Score on Factor
Factor 1
Factor 2

Factor n


The required elements to complete the matrix are the list of factors involved in the decision, the weights to apply to each factor, a column for each option under study, and a number representing how well the option positively supports each factor.


Factor – An attribute that supports a positive outcome in the final product.  Examples might be “Uses only open-source software” for a software platform acquisition, “Close to home” for selection of a college, “Regular work hours” for a new job.

Fitness – a simple measure of weight times score indicating the relative strength of an option when measured against a factor.

Weight – The absolute importance of each factor individually.  It is not a ranking or other direct competition between factors.

Option – A potential outcome in the decision.  Options for selecting a university might be “Middle Tennessee State University, University of Tennessee, and University of Virginia”.  Options for a statistical analysis desktop application might be “WEKA, Excel, RapidMiner, and Statistica”.

Option score – The measure of how well the option satisfies the factor.  A factor “Software is open source” might score 0 for Microsoft Excel but 7 for RapidMiner.  The score is determined by the analysis and expert opinion of each decision maker.

The Decision Making Process

It is important to follow the process as closely as possible as it reduces the possibility of biasing the final decision and optimizes the ability to make quick decisions.

Phase I – Group Factor Identification

  1. Identify a single facilitator.  This individual is responsible to keep the conversations on-point, attempt to negotiate the phrasing of the factors when disagreements arise, distribute the final factor list, collect everyone’s results, and publish the final results.
  2. Call all required individuals together at one time to define the factors in the decision.  Consider no other aspect of the decision at this point and do not pre-suppose any options.
  3. Make a list of the factors involved in the decision.  Phrase the factors in a way that supports a positive result and is definitive.  “Must add value” for example is not definitive and would not be a good factor as it is too relative.  Likewise, “Does not integrate with our current technology” is negative and will undermine the measurements.
  4. Define the options or courses of action that could be selected.  The options should be well-understood and not vague.  If selecting a school for example a good option might be “Middle Tennessee State University” instead of “A college in Tennessee” as it will not be realistic to rate a broad and non-specific option.
  5. Write the finalized options and factors on the decision matrix template and create one matrix per rater.  The remainder will be performed individually.
  6. Cover or hide the Options columns.  This is done so the process is not skewed by previewing the results.
  7. Rate the factors beginning with the first factor.  Rate the factor’s importance from zero (low) to ten (high).  Zero indicates the factor is absolutely not important in your individual estimation, ten indicates this factor is absolutely vital.  Any whole number between zero and ten is valid.
  8. Hide the weight column and uncover the score column for the first option only.
  9. Score the option against the factor, zero (low) to ten (high).  If the option is a perfect fit for the factor it may score a ten, if the factor provides no support at all for the factor it may score zero.  Any whole number between zero and ten is valid.
  10. Show all the columns and view the results.  A sample result appears in Figure 1.

Phase II – Individual Scoring

  1. Cover or hide the Options columns.  This is done so the process is not skewed by previewing the results.
  2. Rate the factors beginning with the first factor.  Rate the factor’s importance from zero (low) to ten (high).  Zero indicates the factor is absolutely not important in your individual estimation, ten indicates this factor is absolutely vital.  Any whole number between zero and ten is valid.
  3. Hide the weight column and uncover the score column for the first option only.
  4. Score the option against the factor, zero ( low) to ten (high).  If the option is a perfect fit for the factor it may score a ten, if the factor provides no support at all for the factor it may score zero.  Any whole number between zero and ten is valid.
  5. Show all the columns and view the results.  A sample result appears in Figure 1.
Individual Weighted Decision Matrix Example for Group Decision making
Figure 1. Weighted Decision Matrix-Individual

Phase III – Facilitator Compiles Results

Copy each individual matrix and paste it into a new tab in the template.  Adjust the formulas in the Results tab to reflect the locations and numbers of the factors and results in each matrix.  A sample Result appears in Figure 2.


Group Decision Making Results
Figure 2. Group Decision Results



The decision making result measures shown in the template are not a complete list of those that could be created.  Further analysis could be performed to determine if one idea such as geographic location or known-concept bias is skewing the results (through language processing and/or a dependency parser if data mining resources are available).  Most of the agreement/disagreement measures rely on standard deviation to show how broadly the data are distributed as well as the data skew to demonstrate if there were individual strong opinions that affected the outcome significantly.

The Final Result is a Compromise

The consensus result is that with the highest final score.  In the example above option 2 received the highest final score at 632.  This value represents how the group (composed of individual assessments) determined the option fit the requirements (the factors).

Relative Strength of Disagreement

This metric uses the standard deviation of the population (STDEV.P in Excel) to determine how widely distributed the individual scores are.  The notion is that with perfect agreement the standard deviation is zero and the more widely the scores vary the larger this figure and the bar in the Excel cell will be.

Disagreement Heat Map

This section displays the degree of contention within the results and uses the values from the Relative Strength of Disagreement section.  The higher the intensity of the red coloration, the greater the degree of disagreement on that element.  The color intensity and range is configurable in the Excel template through the Conditional Formatting tab on the Home ribbon.

Points of Contention

Points of contention show only the few points that are the least agreed upon.

Agreement Heat Map

Agreement in this instance is denoted by using the inverse of the disagreement calculation (STDEV.P), or 1/(1+STDEV.P).  The agreement heat map shows the points on which the individual scores most agree and can be set aside in negotiation.

Optimistic/Pessimistic Disagreement

The degree of optimism or pessimism in this case is based on the skew (non-parametric skew to be accurate) of the data.  That is, if the mean (average) is higher than the median the data is skewed RIGHT and a few individual, strong negative opinions weighed heavily on the outcome.  Likewise, if the skew was LEFT there existed strong positive sentiment that had a disproportionate influence on the outcome.

Optimistic/Pessimistic support of the final score

As with the individual item optimism or pessimism, negative opinion dragging on the consensus is indicated by the down arrow, positive support for an opinion overinflating the result for that option shows an up arrow.


Decision making by a group may be reached quickly and transparently through a structured process of analysis.  Individual weighted decision matrices, coalesced and analyzed, with a simple process can quantify group assessment of approaches to a problem as well as a means by which to discover and de-conflict individual interests and to demonstrate when individually-held strong opinions are influencing a decision.  Further advancement of this technique through workflow automation to gather inputs, master factor lists for global factor analysis, decision trending, and term extraction for content analysis would add additional dimensions for the broader Enterprise and provide data for a supervised model of the successful or failure rates of decision outcomes.

Keywords: decision making, group decision making, group consensus building, structured decision making

Named Entity Models

Research labs and product teams intent on building upon openNLP and SOLR (which can consume an openNLP NameFinder model) frequently find it important to generate their own model parser or model builder classes.  openNLP has in-built capabilities for this but in the case of custom parsers the structure of the openNLP NameFinder model must be known.

The NameFinder model is defined by the GISModel class which extends AbstractModel and the definition and interfaces exposed can be found in the openNLP api docs on the Apache site.  The structure as below is composed of an indicator of Model type, a correction constant, model outcomes, and model predicates.  Models for NameFinder can be downloaded free from the openNLP project and are trained against generic corpora.

openNLP NameFinder Model Structure

  1. The type identifier, GIS (literal)
  2. The model correction constant (int)
  3. Model correction constant parameter (double)
  4. Outcomes
    1. The number of outcomes (int)
    2. The outcome names (string array, length of which is specified in 4.1. above)
  5. Predicates
    1. Outcome patterns
      1. The number of outcome patterns (int)
      2. The outcome pattern values (each stored in a space delimited string)
    2. The predicate labels
      1. The number of predicates (int)
      2. The predicate names (string array, length of which is specified in 5.2.1. above)
    3. Predicate parameters (double values)

Many interesting questions came up in the NIST Definitions and Taxonomy Big Data group meeting today. Brilliant minds are hard at work to stabilize the language around Big Data, but some fundamental questions have been posed that the marketplace seems to believe we have already solved.

  • How do we differentiate Big Data from traditional big data like sensor feeds, credit card processing, and financial transactions. What makes it different? One noted professional taxonomist asserted that a basic differentiator may exist in the variability and variety of data.
  • Has data lifecycle changed with BD? The subgroup lead Nancy Grady made a compelling argument that the position of storage in the workstream may be of interest. She pointed out that traditional decision support transforms and stores the data prior to analysis, whereas the Big Data paradigm frequently stores data raw and applies structure later (schema on read).
  • Should there be an obsolescence characteristic attached to data definitions? Ubiquitous sensors (The Internet of Things) may present disposable data with immediate obsolescence which climate monitoring sensors only provide value at a future date.
  • Data cleanliness may be less important than traditional BI.
  • Are there certain enablers to Big Data that should be assumed in planning such as (perhaps) cloud computing?

At this point it is obvious there is no consensus on these questions, but what do we as a community of practitioners think about these questions?


The working group meetings are highly compelling and I encourage anyone who wishes to become involved to go to the group site,

2013-2014 I was honored to serve as a co-chair of the National Institute of Standards and Technology (NIST) Big Data Working Group’s Big Data Reference Architecture sub-group.  The artifacts we produced are at .  These include extensive use cases that cross all industry verticals and technical domains and Enterprise Reference Architectures relevant to any large organization.

The US National Institute of Standards and Technology (NIST) kicked off their Big Data Working Group on June 19th 2013.  The sessions have now been broken down into subgroups for Definitions, Taxonomies, Reference Architecture, and Technology Roadmap.  The charter for the working group:

NIST is leading the development of a Big Data Technology Roadmap. This roadmap will define and prioritize requirements for interoperabilityreusability, and extendibility for big data analytic techniques and technology infrastructure in order to support secure and effective adoption of Big Data. To help develop the ideas in the Big Data Technology Roadmap, NIST is creating the Public Working Group for Big Data.

Scope: The focus of the NBD-WG is to form a community of interest from industry, academia, and government, with the goal of developing a consensus definitionstaxonomiesreference architectures, and technology roadmap which would enable breakthrough discoveries and innovation by advancing measurement science, standards, and technology in ways that enhance economic security and improve quality of life. Deliverables:

  • Develop Big Data Definitions
  • Develop Big Data Taxonomies
  • Develop Big Data Reference Architectures
  • Develop Big Data Technology Roadmap

Target Date: The goal for completion of INITIAL DRAFTs is Friday, September 27, 2013. Further milestones will be developed once the WG has initiated its regular meetings.

Participants: The NBD-WG is open to everyone. We hope to bring together stakeholder communities across industry, academic, and government sectors representing all of those with interests in Big Data techniques, technologies, and applications. The group needs your input to meet its goals so please join us for the kick-off meeting and contribute your ideas and insights.

Meetings: The NBD-WG will hold weekly meetings on Wednesdays from 1300 – 1500 EDT (unless announce otherwise) by teleconference. Please click here for the virtual meeting information.> Questions: General questions to the NBD-WG can be addressed to


To participate in helping the US Government in their efforts, sign up at

Predictive analytics tells you what will happen; prescriptive analytics tells you what to do about it.

prescriptive analytics and big data

Decision Support and Analytics has traditionally addressed Descriptive Analytics and Predictive Analytics. Jeff Bertolucci highlights this domain founded on the methods of Operations Research and called by IBM, “the final phase” in business analytics.

Link:  Prescriptive Analytics And Big Data: Next Big Thing?

The elements of big data analytics has roots in statistics, knowledge management, and computer science. Many of the data mining terms below appear in these disciplines but may have different connotation or specialized meaning when applied to our problems. The problems of massive parallel processing and the specialized algorithms employed to perform analysis in a distributed computing environment are enough to require specialized treatment.

Data Mining Terms


Accuracy A measure of a predictive model that reflects the proportionate number of times that the model is correct when applied to data
Bias Difference between expected value and actual value
Cardinality Data mining terms indicating the number of different values a categorical predictor or OLAP dimension can have. High cardinality predictors and dimensions have large numbers of different values (e.g. zip codes), low cardinality fields have few different values (e.g. eye color).
CART Classification and Regression Trees. A type of decision tree algorithm that automates the pruning process through cross validation and other techniques.
CHAID Chi-Square Automatic Interaction Detector. A decision tree that uses contingency tables and the chi-square test to create the tree. Classification. The process of learning to distinguish and discriminate between different input patterns using a supervised training algorithm. Classification is the process of determining that a record belongs to a group
Cluster Centroid most typical case in a cluster.  The centroid is a prototype. It does not necessarily describe any given case assigned to the cluster.
Clustering The technique of grouping records together based on their locality and connectivity within the n-dimensional space. This is an unsupervised learning technique.
Collinearity The property of two predictors showing significant correlation without a causal relationship between them
concentration of measure any set of positive probability can be expanded very slightly to contain most of the probability the average of bounded independent random variables is tightly concentrated around its expectation
Conditional Probability The probability of an event happening given that some event has already occurred. For example the chance of a person committing fraud is much greater given that the person had previously committed fraud
Confidence The likelihood of the predicted outcome, given that the rule has been satisfied.
convergence of random variables a sequence of essentially random or unpredictable events can sometimes be expected to settle down into a behaviour that is essentially unchanging when items far enough into the sequence are studied
correlation number that describes the degree of relationship between two variables
Coverage A number that represents either the number of times that a rule can be applied or the percentage of times that it can be applied
Cross-validation The process of holding aside some training data which is not used to build a predictive model and to later use that data to estimate the accuracy of the model on unseen data simulating the real world deployment of the model.
Data Mining Process Define the problem. Select the data. Prepare the data. Mine the data. Deploy the model. Take business action.
Discrete Fourier Transform Concentrates energy in first few coefficients
Entropy A measure often used in data mining algorithms that measures the disorder of a set of data
Error Rate A number that reflects the rate of errors made by a predictive model. It is one minus the accuracy
Expectation–maximization algorithm for estimating parameters where there exist significant missing or inferred values
Expectation-Maximization (EM) Solves estimation with incomplete data. Iteratively use estimates for missing data and continue until convergence
Expert System A data processing system comprising a knowledge base (rules), an inference (rules) engine, and a working memory
Exploratory Data Analysis The processes and techniques for general exploration of data for patterns in preparation for more directed analysis of the data
Factor Analysis A statistical technique which seeks to reduce the number of total predictors from a large number to only a few “factors” that have the majority of the impact on the predicted outcome.
Fuzzy Logic A system of logic based on the fuzzy set theory
Fuzzy Set A set of items whose degree of membership in the set may range from 0 to 1
Fuzzy System A set of rules using fuzzy linguistic variables described by fuzzy sets and processed using fuzzy logic operations
Genetic Algorithm Optimization techniques that use processes such as generic combination, mutation, and natural selection in a design based on the concepts of  revolution
Genetic Operator An operation on the population member strings in a genetic algorithm which are used to produce new strings
Gini Index A measure of the disorder reduction caused by the splitting of data in a decision tree algorithm. Gini and the entropy metric are the most popular ways of selected predictors in the CART decision tree algorithm
Hebbian Learning One of the simplest and oldest forms of training a neural network. It is loosely based on observations of the human brain. The neural net link weights are strengthened between any nodes that are active at the same time.
Hill Climbing A simple optimization technique that modifies a proposed solution by a small amount and then accepts it if it is better than the previous solution. The technique can be slow and suffers from being caught in local optima
Hypothesis Testing The statistical process of proposing a hypothesis to explain the existing data and then testing to see the likelihood of that hypothesis being the explanation
ID3 Decision Tree algorithm
Intelligent Agent A software application which assists a system or a user by automating a task. Intelligent agents must recognize events and use domain knowledge to take appropriate actions based on those events.
Itemset An itemset is any combination of two or more items in a transaction
Jackknife Estimate estimate of parameter is obtained by omitting one value from the set of observed values. Allows you to examine the impact of outliers.
Kernel a function that transforms the input data to a high-dimensional space where the problem is solved
k-Nearest Neighbor A data mining technique that performs prediction by finding the prediction value of records (near neighbors) similar to the record to be predicted
Kohonen Network A type of neural network where locality of the nodes learn as local neighborhoods and locality of the nodes is important in the training process. They are often used for clustering
Latent variable variables inferred from a model rather than observed
Lift A number representing the increase in responses from a targeted marketing application using a predictive model over the response rate achieved when no model is used
Machine Learning A field of science and technology concerned with building machines that learn. In general it differs from Artificial Intelligence in that learning is considered to be just one of a number of ways of creating an artificial intelligence
maximum likelihood method for estimating the parameters of a model
Maximum Likelihood Estimate (MLE) Obtain parameter estimates that maximize the probability that the sample data occurs for the specific model. Joint probability for observing the sample data by multiplying the individual probabilities.
Mean Absolute Error AVG(ABS(predicted_value – actual_value))
Mean Squared Error (MSE) expected value of the squared difference between the estimate and the actual value
Memory-Based Reasoning (MBR) A technique for classifying records in a database by comparing them with similar records that are already classified. A form of nearest neighbor classification.
Minimum Description Length (MDL) Principle The idea that the least complex predictive model (with acceptable accuracy) will be the one that best reflects the true underlying model and performs most accurately on new data.
Model A description that adequately explains and predicts relevant data but is generally much smaller than the data itself
Neural Network A computing model based on the architecture of the brain. A neural network consists of multiple simple processing units connected by adaptive weights
Nominal Categorical Predictor A predictor that is categorical (finite cardinality) but where the values of the predictor have no particular order. For example, red, green, blue as values for the predictor “eye color”.
Ordinal Categorical Predictor A categorical predictor (i.e. has finite number of values) where the values have order but do not convey meaningful intervals or distances between them. For example the values high, middle and low for the income predictor
Outlier Analysis A type of data analysis that seeks to determine and report on records in the database that are significantly different from expectations. The technique is used for data cleansing, spotting emerging trends and recognizing unusually good or bad performers
overfitting The effect in data analysis, data mining and biological learning of training too closely on limited available data and building models that do not generalize well to new unseen data. At the limit, overfitting is synonymous with rote memorization where no generalized model of future situations is built
Point Estimation estimate a population parameter. May be made by calculating the parameter for a sample. May be used to predict value for missing data.
Predictive model model created or used to perform prediction. In contrast to models created solely for pattern detection, exploration or general organization of the data
Predictor The column or field in a database that could be used to build a predictive model to predict the values in another field or column. Also called variable, independent variable, dimension, or feature.
Principle Component Analysis A data analysis technique that seeks to weight the importance of a variety of predictors so that they optimally discriminate between various possible predicted outcomes
Prior Probability The probability of an event occurring without dependence on (conditional to) some other event. In contrast to conditional probability
Purity/Homogeneity the degree to which the resulting child nodes are made up of cases with the same target value
Radial Basis Function Networks Neural networks that combine some of the advantages of neural networks with those of nearest neighbor techniques. In radial basis functions the hidden layer is made up of nodes that represent prototypes or clusters of records
Receiver Operating Characteristic (ROC) The area under the ROC curve (AUC) measures the discriminating ability of a binary classification model. The larger the AUC, the higher the likelihood that an actual positive case will be assigned a higher probability of being positive than an actual negative case. The AUC measure is especially useful for data sets with unbalanced target distribution (one target class dominates the other).
Regression A data analysis technique classically used in statistics for building predictive models for continuous prediction fields. The technique automatically determines a mathematical equation that minimizes some measure of the error between the prediction from the regression model and the actual data
Reinforcement Learning A training model where an intelligence engine (e.g. neural network) is presented with a sequence of input data followed by a reinforcement signal
Root Mean Squared Error SQRT(AVG((predicted_value – actual_value) * (predicted_value – actual_value)))
Sampling The process by which only a fraction of all available data is used to build a model or perform exploratory analysis. Sampling can provide relatively good models at much less computational expense than using the entire database
Segmentation The process or result of the process that creates mutually exclusive collections of records that share similar attributes either in unsupervised learning (such as clustering) or in supervised learning for a particular prediction field
Sensitivity Analysis The process which determines the sensitivity of a predictive model to small fluctuations in predictor value. Through this technique end users can gauge the effects of noise and environmental change on the accuracy of the model
Simulated Annealing An optimization algorithm loosely based on the physical process of annealing metals through controlled heating and cooling
Sparsity This means that a high proportion of the nested rows are not populated.
Statistical Independence The property of two events displaying no causality or relationship of any kind. This can be quantitatively defined as occurring when the product of the probabilities of each event is equal to the probability of the both events occurring
Stepwise Regression Automated Regressions to identify most predictive variables.  1st regression finds most predictive, 2nd regression finds most predictive given 1st regression.
Supervised Algorithm A class of data mining and machine learning applications and techniques where the system builds a model based on the prediction of a well defined prediction field. This is in contrast to unsupervised learning where there is no particular goal aside from pattern detection.
Support The relative frequency or number of times a rule produced by a rule induction system occurs within the database. The higher the support the better the chance of the rule capturing a statistically significant pattern.
Term Definition
Time-Series Prediction The process of using a data mining tool (e.g., neural networks) to learn to predict temporal sequences of patterns, so that, given a set of patterns, it can predict a future value
Unsupervised Algorithm A data analysis technique whereby a model is built without a well defined goal or prediction field. The systems are used for exploration and general data organization. Clustering is an example of an unsupervised learning system
Visualization Graphical display of data and models which helps the user in understanding the structure and meaning of the information contained in them


This overview of data mining terms is part of a publication, “Dictionary of Data Mining Terms” due out in publication in November 2013 by Don Krapohl.  This post does not use any content from, but acknowledges a similar work by Dr. Vincent Granville at, also containing a significant number of data mining terms.

To foster the study of the structure and dynamics of Web traffic networks, Indiana University has made available a large dataset (‘Click Dataset’) of about 53.5 billion HTTP requests made by users at Indiana University. Gathering anonymized requests directly from the network rather than relying on server logs and browser instrumentation allows one to examine large volumes of traffic data while minimizing biases associated with other data sources. It also provides one with valuable referrer information to reconstruct the subset of the Web graph actually traversed by users. The goal is to develop a better understanding of user behavior online and create more realistic models of Web traffic. The potential applications of this data include improved designs for networks, sites, and server software; more accurate forecasting of traffic trends; classification of sites based on the patterns of activity they inspire; and improved ranking algorithms for search results.

The data was generated by applying a Berkeley Packet Filter to a mirror of the traffic passing through the border router of Indiana University. This filter matched all traffic destined for TCP port 80. A long-running collection process used the pcap library to gather these packets, then applied a small set of regular expressions to their payloads to determine whether they contained HTTP GET requests. If a packet did contain a request, the collection system logged a record with the following fields:

  • a timestamp
  • the requested URL
  • the referring URL
  • a boolean classification of the user agent (browser or bot)
  • a boolean flag for whether the request was generated inside or outside IU.

Some important notes:

  1. Traffic generated outside IU only includes requests from outside IU for pages inside IU. Traffic generated inside IU only includes requests from people at IU (about 100,000 users) for resources outside IU. These two sets of requests have very different sampling biases.
  2. No distinguishing information about the client system was retained: no MAC or IP addresses nor any unique index were ever recorded.
  3. There was no attempt at stream reassembly, and server responses were not analyzed.

During collection, the system generated data at a rate of about 60 million requests per day, or about 30 GB/day of raw data. The data was collected between Sep 2006 and May 2010. Data is missing for about 275 days. The dataset has two collections:

  1. raw: About 25 billion requests, where  only the host name of the referrer is retained. Collected between 26 Sep 2006 and 3 Mar 2008; missing 98 days of data, including the entire month of Jun 2007. Approximately 0.85 TB, compressed.
  2. raw-url: About 28.6 billion requests, where the full referrer URL is retained. Collected between 3 Mar 2008 and 31 May 2010; missing 179 days of data, including the entire months of Dec 2008, Jan 2009, and Feb 2009. Approximately 1.5 TB, compressed.

The dataset is broken into hourly files. The initial line of each file has a set of flags that can be ignored. Each record looks like this:

 XXXXADreferrer  host  path

where XXXX is the timestamp (32-bit Unix epoch in seconds, in little endian order), A is the user-agent flag (“B” for browser or “?” for other, including bots), D is the direction flag (“O” for external traffic to IU, “I” for internal traffic to outside IU), referrer is the referrer hostname or URL (terminated by newline), host is the target hostname (terminated by newline), and path is the target path (terminated by newline).

The Click Dataset is large (~2.5 TB compressed), which requires that it be transferred on a physical hard drive. You will have to provide the drive as well as pre-paid return shipment. Additionally,  the dataset might potentially contain bits of stray personal data. Therefore you will have to sign a data security agreement. Indiana University require that you follow these instructions to request the data.

Citation information and FAQs are available on the team’s page at .