The elements of big data analytics has roots in statistics, knowledge management, and computer science. Many of the data mining terms below appear in these disciplines but may have different connotation or specialized meaning when applied to our problems. The problems of massive parallel processing and the specialized algorithms employed to perform analysis in a distributed computing environment are enough to require specialized treatment.

Machine Learning Terms

Term Definition
Accuracy A measure of a predictive model that reflects the proportionate number of times that the model is correct when applied to data
Bias Difference between expected value and actual value
Cardinality The number of different values a categorical predictor or OLAP dimension can have. High cardinality predictors and dimensions have large numbers of different values (e.g. zip codes), low cardinality fields have few different values (e.g. eye color).
CART Classification and Regression Trees. A type of decision tree algorithm that automates the pruning process through cross validation and other techniques.
CHAID Chi-Square Automatic Interaction Detector. A decision tree that uses contingency tables and the chi-square test to create the tree. Classification. The process of learning to distinguish and discriminate between different input patterns using a supervised training algorithm. Classification is the process of determining that a record belongs to a group
Cluster Centroid most typical case in a cluster.  The centroid is a prototype. It does not necessarily describe any given case assigned to the cluster.
Clustering The technique of grouping records together based on their locality and connectivity within the n-dimensional space. This is an unsupervised learning technique.
Collinearity The property of two predictors showing significant correlation without a causal relationship between them
concentration of measure any set of positive probability can be expanded very slightly to contain most of the probability the average of bounded independent random variables is tightly concentrated around its expectation
Conditional Probability The probability of an event happening given that some event has already occurred. For example the chance of a person committing fraud is much greater given that the person had previously committed fraud
Confidence The likelihood of the predicted outcome, given that the rule has been satisfied.
convergence of random variables a sequence of essentially random or unpredictable events can sometimes be expected to settle down into a behaviour that is essentially unchanging when items far enough into the sequence are studied
correlation number that describes the degree of relationship between two variables
Coverage A number that represents either the number of times that a rule can be applied or the percentage of times that it can be applied
Cross-validation The process of holding aside some training data which is not used to build a predictive model and to later use that data to estimate the accuracy of the model on unseen data simulating the real world deployment of the model.
Data Mining Process Define the problem. Select the data. Prepare the data. Mine the data. Deploy the model. Take business action.
Discrete Fourier Transform Concentrates energy in first few coefficients
Entropy A measure often used in data mining algorithms that measures the disorder of a set of data
Error Rate A number that reflects the rate of errors made by a predictive model. It is one minus the accuracy
Expectation–maximization algorithm for estimating parameters where there exist significant missing or inferred values
Expectation-Maximization (EM) Solves estimation with incomplete data. Iteratively use estimates for missing data and continue until convergence
Expert System A data processing system comprising a knowledge base (rules), an inference (rules) engine, and a working memory
Exploratory Data Analysis The processes and techniques for general exploration of data for patterns in preparation for more directed analysis of the data
Factor Analysis A statistical technique which seeks to reduce the number of total predictors from a large number to only a few “factors” that have the majority of the impact on the predicted outcome.
Fuzzy Logic A system of logic based on the fuzzy set theory
Fuzzy Set A set of items whose degree of membership in the set may range from 0 to 1
Fuzzy System A set of rules using fuzzy linguistic variables described by fuzzy sets and processed using fuzzy logic operations
Genetic Algorithm Optimization techniques that use processes such as generic combination, mutation, and natural selection in a design based on the concepts of  revolution
Genetic Operator An operation on the population member strings in a genetic algorithm which are used to produce new strings
Gini Index A measure of the disorder reduction caused by the splitting of data in a decision tree algorithm. Gini and the entropy metric are the most popular ways of selected predictors in the CART decision tree algorithm
Hebbian Learning One of the simplest and oldest forms of training a neural network. It is loosely based on observations of the human brain. The neural net link weights are strengthened between any nodes that are active at the same time.
Hill Climbing A simple optimization technique that modifies a proposed solution by a small amount and then accepts it if it is better than the previous solution. The technique can be slow and suffers from being caught in local optima
Hypothesis Testing The statistical process of proposing a hypothesis to explain the existing data and then testing to see the likelihood of that hypothesis being the explanation
ID3 Decision Tree algorithm
Intelligent Agent A software application which assists a system or a user by automating a task. Intelligent agents must recognize events and use domain knowledge to take appropriate actions based on those events.
Itemset An itemset is any combination of two or more items in a transaction
Jackknife Estimate estimate of parameter is obtained by omitting one value from the set of observed values. Allows you to examine the impact of outliers.
Kernel a function that transforms the input data to a high-dimensional space where the problem is solved
k-Nearest Neighbor A data mining technique that performs prediction by finding the prediction value of records (near neighbors) similar to the record to be predicted
Kohonen Network A type of neural network where locality of the nodes learn as local neighborhoods and locality of the nodes is important in the training process. They are often used for clustering
Latent variable variables inferred from a model rather than observed
Lift A number representing the increase in responses from a targeted marketing application using a predictive model over the response rate achieved when no model is used
Machine Learning A field of science and technology concerned with building machines that learn. In general it differs from Artificial Intelligence in that learning is considered to be just one of a number of ways of creating an artificial intelligence
maximum likelihood method for estimating the parameters of a model
Maximum Likelihood Estimate (MLE) Obtain parameter estimates that maximize the probability that the sample data occurs for the specific model. Joint probability for observing the sample data by multiplying the individual probabilities.
Mean Absolute Error AVG(ABS(predicted_value – actual_value))
Mean Squared Error (MSE) expected value of the squared difference between the estimate and the actual value
Memory-Based Reasoning (MBR) A technique for classifying records in a database by comparing them with similar records that are already classified. A form of nearest neighbor classification.
Minimum Description Length (MDL) Principle The idea that the least complex predictive model (with acceptable accuracy) will be the one that best reflects the true underlying model and performs most accurately on new data.
Model A description that adequately explains and predicts relevant data but is generally much smaller than the data itself
Neural Network A computing model based on the architecture of the brain. A neural network consists of multiple simple processing units connected by adaptive weights
Nominal Categorical Predictor A predictor that is categorical (finite cardinality) but where the values of the predictor have no particular order. For example, red, green, blue as values for the predictor “eye color”.
Ordinal Categorical Predictor A categorical predictor (i.e. has finite number of values) where the values have order but do not convey meaningful intervals or distances between them. For example the values high, middle and low for the income predictor
Outlier Analysis A type of data analysis that seeks to determine and report on records in the database that are significantly different from expectations. The technique is used for data cleansing, spotting emerging trends and recognizing unusually good or bad performers
overfitting The effect in data analysis, data mining and biological learning of training too closely on limited available data and building models that do not generalize well to new unseen data. At the limit, overfitting is synonymous with rote memorization where no generalized model of future situations is built
Point Estimation estimate a population parameter. May be made by calculating the parameter for a sample. May be used to predict value for missing data.
Predictive model model created or used to perform prediction. In contrast to models created solely for pattern detection, exploration or general organization of the data
Predictor The column or field in a database that could be used to build a predictive model to predict the values in another field or column. Also called variable, independent variable, dimension, or feature.
Principle Component Analysis A data analysis technique that seeks to weight the importance of a variety of predictors so that they optimally discriminate between various possible predicted outcomes
Prior Probability The probability of an event occurring without dependence on (conditional to) some other event. In contrast to conditional probability
Purity/Homogeneity the degree to which the resulting child nodes are made up of cases with the same target value
Radial Basis Function Networks Neural networks that combine some of the advantages of neural networks with those of nearest neighbor techniques. In radial basis functions the hidden layer is made up of nodes that represent prototypes or clusters of records
Receiver Operating Characteristic (ROC) The area under the ROC curve (AUC) measures the discriminating ability of a binary classification model. The larger the AUC, the higher the likelihood that an actual positive case will be assigned a higher probability of being positive than an actual negative case. The AUC measure is especially useful for data sets with unbalanced target distribution (one target class dominates the other).
Regression A data analysis technique classically used in statistics for building predictive models for continuous prediction fields. The technique automatically determines a mathematical equation that minimizes some measure of the error between the prediction from the regression model and the actual data
Reinforcement Learning A training model where an intelligence engine (e.g. neural network) is presented with a sequence of input data followed by a reinforcement signal
Root Mean Squared Error SQRT(AVG((predicted_value – actual_value) * (predicted_value – actual_value)))
Sampling The process by which only a fraction of all available data is used to build a model or perform exploratory analysis. Sampling can provide relatively good models at much less computational expense than using the entire database
Segmentation The process or result of the process that creates mutually exclusive collections of records that share similar attributes either in unsupervised learning (such as clustering) or in supervised learning for a particular prediction field
Sensitivity Analysis The process which determines the sensitivity of a predictive model to small fluctuations in predictor value. Through this technique end users can gauge the effects of noise and environmental change on the accuracy of the model
Simulated Annealing An optimization algorithm loosely based on the physical process of annealing metals through controlled heating and cooling
Sparsity This means that a high proportion of the nested rows are not populated.
Statistical Independence The property of two events displaying no causality or relationship of any kind. This can be quantitatively defined as occurring when the product of the probabilities of each event is equal to the probability of the both events occurring
Stepwise Regression Automated Regressions to identify most predictive variables.  1st regression finds most predictive, 2nd regression finds most predictive given 1st regression.
Supervised Algorithm A class of data mining and machine learning applications and techniques where the system builds a model based on the prediction of a well defined prediction field. This is in contrast to unsupervised learning where there is no particular goal aside from pattern detection.
Support The relative frequency or number of times a rule produced by a rule induction system occurs within the database. The higher the support the better the chance of the rule capturing a statistically significant pattern.
Time-Series Prediction The process of using a data mining tool (e.g., neural networks) to learn to predict temporal sequences of patterns, so that, given a set of patterns, it can predict a future value
Unsupervised Algorithm A data analysis technique whereby a model is built without a well defined goal or prediction field. The systems are used for exploration and general data organization. Clustering is an example of an unsupervised learning system
Visualization Graphical display of data and models which helps the user in understanding the structure and meaning of the information contained in them


A “Runbook” or “Run Book” by any other name…

We use a runbook template to document what processes occur automatically or on demand. Typically these events are either date-triggered or event-triggered. An example of a date-triggered process might be a scheduled email sent nightly at midnight. An example of an event-triggered process might be to send an email when any disk on a server is 90% full.a run book captures processes

Configuration Management

Use this template as a starting point for building your own. It should be a key element of your Configuration Management Database (CMDB) used to document and operate your enterprise.


The simple runbook template below features three tabs. The first is to list all your automated processes–those that occur without assistance. The second is operational process, which captures tasks your team may perform either on demand or based on some event. The last is a lookup tab for dropdown values in the other tabs. You can ignore this.

The tabs are pretty self-explanatory. Fill in each task that your group performs, how often, how long it normally takes, and any accompanying notes. Feel free to add more columns to capture other configuration items such as who to contact in case of failure, email distribution groups who support the process, organization the process supports, etc.


Please note that you are free to use this for any purposes, public or private.  I submit it as a (hopefully) helpful tool back to the community of professionals who have given me the advice and feedback and trialed it in real-world operations.

computer vision training images

Below are many downloadable free machine learning datasets. They cover click data, air traffic control data, surveys, temporal datasets of various types, crime data, employee pay data, map data, law data, and many other types.


I am a huge fan of SSPS, scikit-learn, opennlp, and other mainstream libraries but for quick analysis and visualization don’t forget about Pentaho data mining ( ) based on University of Waikato’s Weka ( ). It can also be used with Pentaho Kettle to submit to a hadoop cluster and perform advanced multi-step analysis.

Public data resources: research-quality, free data mining data sets

All datasets with keywords

Search entries:

UCI Machine Learning Repository


Gene Expression Omnibus (GEO) Main page


Social Science Data

social science

IMDB dataset


Stanford Large Network Dataset Collection

network analysis

Google Books n-gram dataset

books`text mining

Million Song Dataset | scaling MIR research

audio analysis

Belly Button Biodiversity 2.0

health informatics`bioinformatics

Datasets – Modeling Online Auctions


2gb of photos of cats

image analysis`pets`cats

Click Dataset | Center for Complex Networks and Systems Research

web analytics

The Electric Rice Cooker — One year of deleted weibos archive

text mining

Registered meteorites that has impacted on Earth visualized – AnalyticBridge


GeoJSON files for real-time Virginia transportation data.


NYPD Crash Data Band-Aid


State Department of Education datasets

student performance`school demographics`standardized test performance`school quality`education

Open Source Virtual Data Lake

Before data virtualization was well-known I was presented with the scenario of a large company that had purchased several other companies and needed to fuse their data. I had already presumed that I would be introducing the Hadoop ecosystem into the portfolio because that was some large part of why they had hired me. Rather than take the approach that has become common now and build a data lake I proposed we use data virtualization along with a custom lightweight layer that performed the major functions of a Multidimensional Online Analytical Processing (MOLAP) engine to allow us to provide immediate business benefit by using their data in-place while providing an abstraction that allowed us to evolve the systems that held the data. Some of the key requirements of this platform was that it must be easily changed, provide query abilities as good as or better than the underlying data stores, and scale linearly.   I recommended an architecture that is still unique within our global domain, the concept of a virtual data lake.

Virtual Online Analytical Processing (VOLAP) design

The product we built in 2014 and continue to evolve is unique in the marketplace with the exception of perhaps Denodo. The key difference is that it uses entirely free and open source technologies that can be swapped for others as needed. Currently we have a small java app, Jboss Teiid (swappable for Apache Calcite), and the many data technologies the company uses–(many built based on my reference architecture template) HBase, MySQL, Oracle, Impala, Hive, MongoDB, Cassandra, SalesForce, and external REST APIs.

Teiid architecture

We also designed our our approach to operate as a virtual OLAP (VOLAP) system–an approach that doesn’t currently exist–that adds cost-based routing to make the query run as fast as possible by selecting the fastest data source automatically and executing all possible operations at the data source.  The large tech company now leverages this platform as their enterprise data layer–a virtual data lake that performs as a virtualized hybrid OLAP “database”–VOLAP.

The basic premise
REST APIs are arguably the most universal of data interchange methods and for this reason we elected a REST api as our interface for any application that needed data. We support jdbc and odbc as well but discourage their use because we can provide better service through REST (a swagger UI for exploration, traffic smoothing of queries, circuit breakers, security injection, complex in-flight transformations, etc).

I hold that a business expert (mathematical analyst, account manager, HR manager) need not to know the location, context, or formulas to have basic utility of data. They need simple access to data and they add the expertise over that data to provide business value. To that end data virtualization provides a universal adapter and the java application adds a universal language and the business knowledge to the data at query time automatically in a way that is easily governed.

How it works — the request sequence
With more in-depth explanation below, here is how a request for data proceeds in its simplest form:
java app-
REST API->request parser->logical field replacement->data source selection->query writer->
data virtualization-
Query splits into multiples if going to multiple data stores->data store queried->rows buffered and if necessary, aggregated->data streamed back

Injection points exist in our java application that you may consider for such options as plugging in other ways of data sources selection other than cost-based (like semantic, quality of service, or other algorithmic approach), circuit breakers (fail if sql injection attempted, some business rule not passed), and traffic smoothing algorithms such as token bucket for DDOS protection.

Applied example
Assume a web application or spreadsheet wants to determine which customer had the most sales this month in Canada. The application sends a simplified form of the request to a URL that looks very much like a web page using a simplified SQL language:
http://myserver?query=SELECT customer, sales where sale_date between 01/01/2019 and 02/03/2019

  1. receive the request
    This request goes to the java application
  2. extract the logical fields we’ll need to return–customer, sales, and sale_date. These are LOGICAL fields and may or may not correspond to the name of the columns in any of the underlying data storage.
  3.  get the formulas for each field. Assume in your underlying data sources you have identified and mapped a series of sources that contain these fields. In this case assume in the data virtualization layer we have set that customer is an attribute column that is just the column “customer_name”, sales is a metric column that corresponds to the formula “sum(amount_spent – tax_collected)” and sale_date is called “date_of_transaction”.
  4. use the metadata from the data virtualization layer to identify all data sources with all of these columns–customer_name, amount_spent, tax_collected, and sale_date. There are four because it requires for columns to get the required data. Use the cost-based (or other plug-in) rule to identify which will be quickest. Lowest cardinality of the underlying fact table is usually a good first guess at the lowest cost (fastest) data source.
  5. write a query to be sent to the virtualization layer that will retrieve these data: select customer_name as customer, sum(amount_spent – tax_collected) as sales from myvirtualtable where date_of_transaction between ’01/01/2019′ and ’02/03/2019′
  6. send query to virtualization layer, which splits the query into several queries if there are multiple data sources involved (data federation).
  7. underlying data store is queried and rows sent back to data virtualization layer. If the underlying data provider can’t perform some part of the query natively (called a pushdown operation), such as summing values, the values are returned to the virtualization layer and performed there.
  8. data are streamed back from the virtualization layer to the java application where it is sent back to the caller

Next, architecture and code.

Planning and Communicating Your Cluster Design

When creating a new Amazon Web Services (AWS) hadoop cluster it is overwhelming for most people to put together a configuration plan or topology.  Below is a Hadoop reference architecture template I’ve built that can be filled in that addresses the key aspects of planning, building, configuring, and communicating your hadoop cluster on AWS.

I’ve done this many times and as part of my focus on tools and templates thought I’d add a template you can use as a basic guideline for planning your Cloudera big data cluster.  The template includes configurations for:

  • instance basics
  • instance list
  • storage
  • operating system
  • CDH version
  • the cluster topology
  • metastore detail for hive, YARN, hue, impala, sqoop, oozie, and Cloudera Manager
  • high-availability
  • resource management
  • and additional detail for custom service descriptors (CSD) for Storm and Redis

No Warranty Expressed or Implied

It’s not meant to be exhaustive as there are many items not covered (AWS security groups, network optimization, dockerization, continuous integration, monitors, etc.) but it is an example of a real-world cluster in AWS (details of instance and AZ changed for security).

Screenshot of the roles and services in the big data design template

Example list of EC2 instances for the cluster plan

Cloudera hadoop reference architecture  configuration template for Amazon Web Services (AWS)


Please feel free to let me know how it works for you and if you have any improvements for it.

NOTE: This content is for archive purposes only.  With generation 4+ EBS volumes big data IO performance no longer requires volume prewarming.

Fresh Elastic Block Storage volumes have first-write overhead

At my employer I architect Big Data hybrid cloud platforms for global audience that have to be FAST.  In our cluster provisioning I find we frequently overlook doing an initial write across our volumes to reduce write time during production compute workloads (called pre-warming the EBS volumes).  Per Amazon ( failure to pre-warm EBS volumes incurs a 5-50% loss in effective IOPS.  Worst case that means you could DOUBLE the IO portion of your HDFS writes until each sector has been touched by the kernel.  Amazon asserts that this performance loss, amortized over the life of a disk, is inconsequential to most applications.  For one of our current clusters we have a portion with 8-1TB drives in each of 10 compute nodes as a baseline.  Our estimated pre-warm time is 30 hours on each mount point so if done sequentially that’s 2,400 hours to touch each drive block.

What does this imply?  Without pre-warming we would have added as much as 2,400 additional hours of write latency during initial HDFS writes and that latency could appear in many different places in the stack (HDFS direct writes, Hive postgresql/mysql metadata writes and management, log writes, etc.)

Steps to optimize your EBS writes

Read the AWS document above carefully as it will ERASE EVERYTHING ON THE DISK if you use the first method in their article.  The steps below will execute this safely on disks with existing content.

To pre-warm the drives on your cluster:

    1. stop the cluster services
    1. ssh into each server
    1. execute lsblk and note the mount points (they likely start from /dev/xvdf and go down from there increasing the letter at the end, such as /dev/xvdg, /dev/xvdh, etc.)
    1. unmount each one at a time with sudo umount /ONEMOUNTPOINT
    1. Continue until all mount points are unmounted, meaning there’s nothing shown after the ‘disk’ column as below:
    1. execute the following, changing the if= and of= to the same mount pointsudo dd if=/YOURMOUNTPOINT of=/YOURSAMEMOUNTPOINT conv=notrunc bs=1MExample:  sudo dd if=/dev/xvdf of=/dev/xvdf conv=notrunc bs=1M
    1. Wait.  It’ll be a few minutes for a 32GB drive as shown in the Amazon write-up above or 1 day+ for a 1TB drive.
  1. After ALL the processes on the server complete, reboot the server

If you’d like to check the process or if your ssh session has expired and you want to ensure you’re still warming execute ps aux|grep YOURMOUNTPOINT , example:  ps aux |grep /dev/xvdf

A far better approach, of course, would be to automate this as part of your cluster deployment process using Chef or equivalent infrastructure automation tool.


Scenario 1:  There was data, the logs say Namenode not formatted, the (check your config to see where it is) is empty

Cause:  The data was emptied out of your namenode directory.

Things to try (in order):

    1. FSCK (see scenario 2 below)
    2. recover the namenode
      1. hadoop namenode start -recover
      2. If the output says some directories are missing, create them, chgrp to hadoop, chown to hdfs, chmod 755, then run again
    3. Import the fsimage from a non-corrupt secondary namenode
      1. hadoop namenode -importCheckpoint
      2. If the output says some directories are missing, create them, chgrp to hadoop, chown to hdfs, chmod 755, then run again
    4. Brute force it
      1. Find out in the config where the snn checkpoint is kept (fs.checkpoint.dir)
      2. SCP down ALL the files in the fs.checkpoint.dir to your local machine
      3. SCP up ALL the files you just downloaded to the
      4. For all those files chgrp to hadoop, chown to hdfs, chmod 755
      5. Start your HDFS service as usual through the cluster manager and think optimistic thoughts.

Scenario 2:  There was data, the logs point to corrupt blocks

Cause:  Probably a bad termination signal during copy or high volume data movement with bad network

Things to try (in order):

    1. FSCK
      You can use

        hadoop fsck /

      to determine which files are having problems. Look through the output for missing or corrupt blocks (ignore under-replicated blocks for now). This command is really verbose especially on a large HDFS filesystem so I normally get down to the meaningful output with

        hadoop fsck / | egrep -v '^\.+$' | grep -v eplica

      which ignores lines with nothing but dots and lines talking about replication.

      Once you find a file that is corrupt

        hadoop fsck /path/to/corrupt/file -locations -blocks -files

      Use that output to determine where blocks might live. If the file is larger than your block size it might have multiple blocks.

      You can use the reported block numbers to go around to the datanodes and the namenode logs searching for the machine or machines on which the blocks lived. Try looking for filesystem errors on those machines. Missing mount points, datanode not running, file system reformatted/reprovisioned. If you can find a problem in that way and bring the block back online that file will be healthy again.

      Lather rinse and repeat until all files are healthy or you exhaust all alternatives looking for the blocks.

      Once you determine what happened and you cannot recover any more blocks, just use the

        hadoop fs -rm /path/to/file/with/permanently/missing/blocks

      command to get your HDFS filesystem back to healthy so you can start tracking new errors as they occur.

Scenario 3: Secondary Namenode can’t checkpoint the namenode

the SNN logs show checkpoint failed, probably with missing txid=####

  1. Change /etc/fstab and set the mount point to allow fsck on boot
    1. vi /etc/fstab as root
    2. Change the last zero in the first line to one, so change:
      1. LABEL=cloudimg-rootfs / ext4 defaults 0 0


      2. LABEL=cloudimg-rootfs / ext4 defaults 0 1
    3. Save the file and exit
  2. Change the FSCKFIX in /etc/default/rcS to yes
    1. vi /etc/default/rcS as root
    2. Find the line that says #FSCKFIX=no
    3. Change it to FSCKFIX=yes (make sure you remove the commenting # at the beginning)
    4. Save and exit
  3. Check and record the last FSCK run
    1. execute and record the output of
      sudo tune2fs -l /dev/xvda1 | grep “Last checked”
  4. Reboot (use AWS instance reboot or do it from ssh)
  5. Check that FSCK ran on boot
    1. execute and verify that the date changed using
      sudo tune2fs -l /dev/xvda1 | grep “Last checked”
  6. Reverse the changes you made in steps 1 and 2
  7. Reboot

Jboss enterprise has a free data virtualization (NOT server virtualization) platform called Teiid.  Capabilities of this include service of data from multiple technologies (jdbc, odbc, Thrift, REST, SOAP, etc.), merging/transformation of data, fault tolerance, scalability, and other capabilities one would require of an enterprise service.  This can stand in the technology portfolio as part of an Enterprise Service Bus (ESB) to abstract big data and make it APPEAR to be relational (among other benefits)  To set up a Teiid server to expose Hive data:

Install the Jboss EAP

  1. Download (or latest) from jboss downloads
  2. Unzip to c:\programfiles\jboss\ on windows , /etc/jboss on linux

 Overlay Teiid on top of EAP

  1. Download (or latest)
  2. Unzip on top of jboss you just installed:  c:\programfiles\jboss\jboss-eap-6.4.0

 Add the Teiid web console to jboss

  1. Download (or latest)
  2. Unzip on top of jboss you just installed:  c:\programfiles\jboss\jboss-eap-6.4.0

In \jboss-eap-6.3\standalone\configuration\standalone-teiid.xml add to the drivers section:

<driver name="hive" module="org.apache.hadoop.hive">

Find on your cluster the following files and add them to <jboss install dir>\modules\org\apache\hive\main   This path is VERY important and is mis-documented at present on the jboss site.


Navigate to the EAP bin directory and execute ./ -c standalone-teiid.xml

Additional versions:

I’ve attached an excel file for a full-featured Big Data (hadoop) Production topology with a good starting place for an architecture that supports full Lambda architecture (streaming for seconds-old recency, batch for heavy lifting, and services to logically merge the two on demand).  The cluster is composed of 21 AWS instances with EBS backing.  The HDFS layer can be partitioned with the older data (those more than 1 year for example) are on cheaper S3 storage while still fully query-able.

The use cases covered in this architecture:

  1. Accessibility
    1. Data miner support through SQL and machine learning libraries into the raw data
    2. Ad-hoc querying through SQL in a dimensional model
    3. REST, thrift, and other API access with load balancing, data merging (from any data technology), and efficient data source routing
    4. OLAP cubes with perspectives (through data marts) for business analysis
  2. Technical
    1. Open source, free licensing model
    2. Fault tolerance and re-entrance on failure
    3. Scalable design with massive parallelism
    4. Cloud design for flexibility