Planning and Communicating Your Cluster Design

When creating a new Amazon Web Services (AWS) hadoop cluster it is overwhelming for most people to put together a configuration plan or topology.  Below is a Hadoop reference architecture template I’ve built that can be filled in that addresses the key aspects of planning, building, configuring, and communicating your hadoop cluster on AWS.

I’ve done this many times and as part of my focus on tools and templates thought I’d add a template you can use as a basic guideline for planning your Cloudera big data cluster.  The template includes configurations for:

  • instance basics
  • instance list
  • storage
  • operating system
  • CDH version
  • the cluster topology
  • metastore detail for hive, YARN, hue, impala, sqoop, oozie, and Cloudera Manager
  • high-availability
  • resource management
  • and additional detail for custom service descriptors (CSD) for Storm and Redis

No Warranty Expressed or Implied

It’s not meant to be exhaustive as there are many items not covered (AWS security groups, network optimization, dockerization, continuous integration, monitors, etc.) but it is an example of a real-world cluster in AWS (details of instance and AZ changed for security).

Screenshot of the roles and services in the big data design template
Example list of EC2 instances for the cluster plan

Cloudera hadoop reference architecture  configuration template for Amazon Web Services (AWS)

AWS_topology_template

Please feel free to let me know how it works for you and if you have any improvements for it.

NOTE: This content is for archive purposes only.  With generation 4+ EBS volumes big data IO performance no longer requires volume prewarming.

Fresh Elastic Block Storage volumes have first-write overhead

At my employer I architect Big Data hybrid cloud platforms for global audience that have to be FAST.  In our cluster provisioning I find we frequently overlook doing an initial write across our volumes to reduce write time during production compute workloads (called pre-warming the EBS volumes).  Per Amazon (http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-prewarm.html) failure to pre-warm EBS volumes incurs a 5-50% loss in effective IOPS.  Worst case that means you could DOUBLE the IO portion of your HDFS writes until each sector has been touched by the kernel.  Amazon asserts that this performance loss, amortized over the life of a disk, is inconsequential to most applications.  For one of our current clusters we have a portion with 8-1TB drives in each of 10 compute nodes as a baseline.  Our estimated pre-warm time is 30 hours on each mount point so if done sequentially that’s 2,400 hours to touch each drive block.

What does this imply?  Without pre-warming we would have added as much as 2,400 additional hours of write latency during initial HDFS writes and that latency could appear in many different places in the stack (HDFS direct writes, Hive postgresql/mysql metadata writes and management, log writes, etc.)

Steps to optimize your EBS writes

Read the AWS document above carefully as it will ERASE EVERYTHING ON THE DISK if you use the first method in their article.  The steps below will execute this safely on disks with existing content.

To pre-warm the drives on your cluster:

    1. stop the cluster services
    1. ssh into each server
    1. execute lsblk and note the mount points (they likely start from /dev/xvdf and go down from there increasing the letter at the end, such as /dev/xvdg, /dev/xvdh, etc.)
    1. unmount each one at a time with sudo umount /ONEMOUNTPOINT
    1. Continue until all mount points are unmounted, meaning there’s nothing shown after the ‘disk’ column as below:
    1. CAUTION:  DO NOT DO THE FOLLOWING ON A MOUNTED DISK AND MAKE SURE YOU USE THE SAME MOUNT FOR BOTH if AND of
    1. execute the following, changing the if= and of= to the same mount pointsudo dd if=/YOURMOUNTPOINT of=/YOURSAMEMOUNTPOINT conv=notrunc bs=1MExample:  sudo dd if=/dev/xvdf of=/dev/xvdf conv=notrunc bs=1M
    1. Wait.  It’ll be a few minutes for a 32GB drive as shown in the Amazon write-up above or 1 day+ for a 1TB drive.
  1. After ALL the processes on the server complete, reboot the server

If you’d like to check the process or if your ssh session has expired and you want to ensure you’re still warming execute ps aux|grep YOURMOUNTPOINT , example:  ps aux |grep /dev/xvdf

A far better approach, of course, would be to automate this as part of your cluster deployment process using Chef or equivalent infrastructure automation tool.

Ref: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-initialize.html

Scenario 1:  There was data, the logs say Namenode not formatted, the dfs.data.dir (check your config to see where it is) is empty

Cause:  The data was emptied out of your namenode directory.

Things to try (in order):

    1. FSCK (see scenario 2 below)
    2. recover the namenode
      1. hadoop namenode start -recover
      2. If the output says some directories are missing, create them, chgrp to hadoop, chown to hdfs, chmod 755, then run again
    3. Import the fsimage from a non-corrupt secondary namenode
      1. hadoop namenode -importCheckpoint
      2. If the output says some directories are missing, create them, chgrp to hadoop, chown to hdfs, chmod 755, then run again
    4. Brute force it
      1. Find out in the config where the snn checkpoint is kept (fs.checkpoint.dir)
      2. SCP down ALL the files in the fs.checkpoint.dir to your local machine
      3. SCP up ALL the files you just downloaded to the dfs.name.dir
      4. For all those files chgrp to hadoop, chown to hdfs, chmod 755
      5. Start your HDFS service as usual through the cluster manager and think optimistic thoughts.

Scenario 2:  There was data, the logs point to corrupt blocks

Cause:  Probably a bad termination signal during copy or high volume data movement with bad network

Things to try (in order):

    1. FSCK
      You can use

        hadoop fsck /
      

      to determine which files are having problems. Look through the output for missing or corrupt blocks (ignore under-replicated blocks for now). This command is really verbose especially on a large HDFS filesystem so I normally get down to the meaningful output with

        hadoop fsck / | egrep -v '^\.+$' | grep -v eplica
      

      which ignores lines with nothing but dots and lines talking about replication.

      Once you find a file that is corrupt

        hadoop fsck /path/to/corrupt/file -locations -blocks -files
      

      Use that output to determine where blocks might live. If the file is larger than your block size it might have multiple blocks.

      You can use the reported block numbers to go around to the datanodes and the namenode logs searching for the machine or machines on which the blocks lived. Try looking for filesystem errors on those machines. Missing mount points, datanode not running, file system reformatted/reprovisioned. If you can find a problem in that way and bring the block back online that file will be healthy again.

      Lather rinse and repeat until all files are healthy or you exhaust all alternatives looking for the blocks.

      Once you determine what happened and you cannot recover any more blocks, just use the

        hadoop fs -rm /path/to/file/with/permanently/missing/blocks
      

      command to get your HDFS filesystem back to healthy so you can start tracking new errors as they occur.

Scenario 3: Secondary Namenode can’t checkpoint the namenode

the SNN logs show checkpoint failed, probably with missing txid=####

  1. Change /etc/fstab and set the mount point to allow fsck on boot
    1. vi /etc/fstab as root
    2. Change the last zero in the first line to one, so change:
      1. LABEL=cloudimg-rootfs / ext4 defaults 0 0

        to

      2. LABEL=cloudimg-rootfs / ext4 defaults 0 1
    3. Save the file and exit
  2. Change the FSCKFIX in /etc/default/rcS to yes
    1. vi /etc/default/rcS as root
    2. Find the line that says #FSCKFIX=no
    3. Change it to FSCKFIX=yes (make sure you remove the commenting # at the beginning)
    4. Save and exit
  3. Check and record the last FSCK run
    1. execute and record the output of
      sudo tune2fs -l /dev/xvda1 | grep “Last checked”
  4. Reboot (use AWS instance reboot or do it from ssh)
  5. Check that FSCK ran on boot
    1. execute and verify that the date changed using
      sudo tune2fs -l /dev/xvda1 | grep “Last checked”
  6. Reverse the changes you made in steps 1 and 2
  7. Reboot

Jboss enterprise has a free data virtualization (NOT server virtualization) platform called Teiid.  Capabilities of this include service of data from multiple technologies (jdbc, odbc, Thrift, REST, SOAP, etc.), merging/transformation of data, fault tolerance, scalability, and other capabilities one would require of an enterprise service.  This can stand in the technology portfolio as part of an Enterprise Service Bus (ESB) to abstract big data and make it APPEAR to be relational (among other benefits)  To set up a Teiid server to expose Hive data:

Install the Jboss EAP

  1. Download jboss-eap-6.4.0.zip (or latest) from jboss downloads
  2. Unzip to c:\programfiles\jboss\ on windows , /etc/jboss on linux

 Overlay Teiid on top of EAP

  1. Download teiid-9.0.1-jboss-dist.zip (or latest)
  2. Unzip on top of jboss you just installed:  c:\programfiles\jboss\jboss-eap-6.4.0

 Add the Teiid web console to jboss

  1. Download teiid-console-dist-1.x.zip (or latest)
  2. Unzip on top of jboss you just installed:  c:\programfiles\jboss\jboss-eap-6.4.0

In \jboss-eap-6.3\standalone\configuration\standalone-teiid.xml add to the drivers section:

<driver name="hive" module="org.apache.hadoop.hive">
     <driver-class>org.apache.hive.jdbc.HiveDriver</driver-class>
</driver>

Find on your cluster the following files and add them to <jboss install dir>\modules\org\apache\hive\main   This path is VERY important and is mis-documented at present on the jboss site.

commons-logging-1.1.3.jar
hadoop-common-2.5.0-cdh5.3.0.jar
hadoop-core-2.5.0-mr1-cdh5.3.0
hive_metastore.jar
hive_service.jar
hive-common-0.13.1-cdh5.3.0.jar
hive-jdbc-0.13.1-cdh5.3.0.jar
hive-metastore-0.13.1-cdh5.3.0.jar
hive-serde-0.13.1-cdh5.3.0.jar
hive-service-0.13.1-cdh5.3.0.jar
httpclient-4.2.5.jar
httpcore-4.2.5.jar
HiveJDBC4.jar
libfb303-0.9.0.jar
libthrift-0.9.0.jar
log4j-1.2.14.jar
ql.jar
slf4j-api-1.5.11.jar
slf4j-log4j12-1.5.11.jar
TCLIServiceClient.jar

Navigate to the EAP bin directory and execute ./standalone.sh -c standalone-teiid.xml

Additional versions:  http://tools.jboss.org/downloads/overview.html

Below is an example dev cluster topology for a Big Data development cluster as I’ve actually used for some customers.  It’s composed of 6 Amazon Web Service (AWS) servers, each with a particular purpose.  We have been able to perform full lambda using this topology along with Teiid (for data abstraction) on terabytes of data.  It’s not sufficient for a production cluster but is a good starting point for a development group.  The total cost of this cluster as configured (less storage) is under $6/hour.

Here’s a link to this dev_topology in Excel.

 

Service Category Server1 Server2 Server3 Server4 Server5 Server6
Cloudera Mgr Cluster Mgt Alert pub Server Host mon Svc Mon Event Svr Act Mon
HDFS Infra Namenode SNN/DN/JN/HA DN DN/JN DN DN/JN
Zookeeper Infra Server Server Server
YARN Infra Node Mgr Node Mgr JobHist Node Mgr RM/NM
Redis Infra Master Slave Slave
Hive Data Hive server Metastore Hcat
Impala Data App Master Cat Svr Daemon Daemon Daemon
Storm Data Nimbus/UI Supervisor Supervisor Supervisor
Hue UI Server
Pentaho BI UI BI Server
IP ADDRESS
AWS details              
Name m3.2xlarge m3.2xlarge m3.2xlarge r3.4xlarge r3.4xlarge r3.4xlarge
vCPU 8 8 8 16 16 16
Memory (Gb) 30.0 30.0 30.0 122.0 122.0 122.0
Instance storage (Gb) SSD 2 x 80 SSD 2 x 80 SSD 2 x 80 SSD 1 x 320 SSD 1 x 320 SSD 1 x 320
I/O High High High High High High
EBS option Yes Yes Yes Yes Yes Yes

 

Named Entity Models

Research labs and product teams intent on building upon openNLP and SOLR (which can consume an openNLP NameFinder model) frequently find it important to generate their own model parser or model builder classes.  openNLP has in-built capabilities for this but in the case of custom parsers the structure of the openNLP NameFinder model must be known.

The NameFinder model is defined by the GISModel class which extends AbstractModel and the definition and interfaces exposed can be found in the openNLP api docs on the Apache site.  The structure as below is composed of an indicator of Model type, a correction constant, model outcomes, and model predicates.  Models for NameFinder can be downloaded free from the openNLP project and are trained against generic corpora.

openNLP NameFinder Model Structure

  1. The type identifier, GIS (literal)
  2. The model correction constant (int)
  3. Model correction constant parameter (double)
  4. Outcomes
    1. The number of outcomes (int)
    2. The outcome names (string array, length of which is specified in 4.1. above)
  5. Predicates
    1. Outcome patterns
      1. The number of outcome patterns (int)
      2. The outcome pattern values (each stored in a space delimited string)
    2. The predicate labels
      1. The number of predicates (int)
      2. The predicate names (string array, length of which is specified in 5.2.1. above)
    3. Predicate parameters (double values)

View agency activity clustering on geography in Excel using Excel Data Mining Add-ins

By Don Krapohl

1.       Ensure you have downloaded the Excel Data Mining Add-ins from Microsoft at http://www.microsoft.com/en-us/download/details.aspx?id=35578 .  The article assumes you have a working version of the DM Addins and a default Analysis Services (SSAS) instance defined.  Search for getting started with SQL Server Data Mining Add-ins for Excel if you are not familiar with this process.

2.       Open the Excel sample file for Federal contract acquisitions in Wyoming (2012) from https://augmentedintel.com/content/datasets/government_contracts_data_mining_addins.xlsx

3.       On the wy_data_feed tab, select all the data.

4.       In the Home tab on the ribbon in the Styles section select “Format as Table”.  Pick any format you wish.

5.       A new tab will appear on the ribbon for Table Tools with menus for Analyze and Design as below.

Microsoft Excel Table Tools menu
Table tools to format data as a table in Excel

 

6.       On the Analyze menu, select “Detect Categories”.  This is will group (cluster) your information on common attributes, particular commonalities that are not obvious or immediately observable.

7.       Deselect all checkboxes except the following:

a.       Dollars Obligated

b.      Award Type

c.       Contract Pricing

d.      Funding Agency

e.      Product Or Service Code

f.        Category

8.       Click ‘run’

9.       The output will show you categories of information showing strong affinities.  Explore the model by filtering the charts and tables by the category/ies generated.  Do this by selecting the filter icon (funnel) next to Category on the table or the Category label at the lower left of the graph.

10.   Interesting information may be derived from the groups with fewer rows that may show particularly interesting correlations for a targeted campaign.  For example, filter the table and chart on Category 6.  This group indicates a group affinity for the attribute values ProductOrServiceCode = “REFRIGERATION AND AIR CONDITIONING COMPONENTS”, fundingAgency = “Veterans Affairs, Department Of”, and a contract award value of $61,148 to $1,173,695 as shown below:

 

Importance of data categories in Excel Data Mining Add-ins
Factor Analysis in Microsoft Excel

For my organization’s business development activities, if I am in the heating and air business I may elect to focus efforts on medium-sized contracts with Veterans Affairs.

My Google+

Artificial Intelligence for the Creation of Competitive Intelligence Tools

Introduction

Often in prioritizing business development activities it is helpful to determine who is able to influence a decision and how they are related to those in the market space.  To make a defensible and actionable strategy it is useful to perform Influence Analysis and Network Analysis, which can form the kernel of a competitive intelligence analysis strategy.  The data required for analysis must be obtained by identifying and extracting target attribute values in unstructured and often very large (multi-terabyte or petabyte) data stores.  This necessitates a scalable infrastructure, distributed parallel computing capability, and fit-for-use natural language processing algorithms.  Herein I will demonstrate a target logical architecture and methodology for accomplishing the task.  Influence and Network analysis by machine learning algorithm (naïve bayes or perceptron for example) will be covered in a later supporting article.

Recognizing Significance

Named-Entity Recognition is required for unstructured content extraction in this scenario.  This identification scheme may or may not employ stemming but will always require tokenizing, part-of-speech tagging, and the acquisition of a predefined model of attribute patterns to properly recognize and extract required metadata.  A powerful platform with these built-in capabilities is the Apache openNLP project, which includes typed attribute models for the name finder, an extensible name finder algorithm, an API that exposes a Lucene index consumer, and a scalable, distributed architecture.  The Apache Stanbol project in the incubator (http://stanbol.apache.org/) shows promise at semantic-based extraction and content enhancement but hasn’t been promoted outside the incubator yet.

Apache openNLP attribute recognition models are available in only a few languages with the original and largest being English.  The community publishes models in English for the Name Finder interface for dates, location, money, organization, percentage, person, and time (date).  Each is an appropriate candidate for term extraction for competitive intelligence analysis.

Logical Architecture

Natural Language Processing for Competitive Intelligence
openNLP in four node Hadoop cluster

The controlling requirement for the task of metadata extraction from massive datasources is the processing of massive datasets to extract information.  For this Hadoop provides a flexible, fault-tolerant framework and processing model that readily supports the natural language processing needs.  The logical architecture for a small (<1TB) 4-node clustered Hadoop solution is as follows:

 

Process Flow

As below, the process to execute is standardized on the map/reduce patterns Distributed Task Execution, Union, Selection, and Intersection.  Pre-processing using a Graph Processing pattern in a distinctly separate map phase would likely hasten any Influence Analysis to be performed post-process.

 

Operations Sequence Diagram of openNLP with Map Reduce on Hadoop for Competitive Intelligence
Multi-node Sequence Diagram for openNLP with Map Reduce on Hadoop

The primary namenode initiates work and passes the data and map/reduce execution program to the task trackers, who in turn distribute it among worker nodes.  The worker nodes execute the map on HDFS-stored data, provide health and status to the task tracker, who reports it to the primary namenode.  On node map completion the primary namenode may redistribute map work to the worker node or order the reduce task, each by way of the task tracker.  The reduce task selects data from the HDFS interim resultset, aggregates, and streams to a result file.  The result file is then used later for analysis by the machine learning algorithm of choice.

File Structures

The input file is of a machine-readable ASCII text type and is unstructured.  Example:

 

From: Amir Soofi

Sent: Thursday, December 06, 2012 2:37 AM

To: Aaron Macarthur; Hugo Cruz

Cc: Donald Krapohl

Subject: RE: Language Comparison

 

Hugo,

 

FYI, Rick Marshall unofficially approved a 3-day trip for one person from the Enterprise team down to Jacksonville, FL to assist in the catalog reinstall.

 

I’ll be placing it in the travel portal soon for the official process, so that the option becomes officially available to us.

 

I think together we’ll be able to push through the environment differences better in person than over the phone.

 

Let us know whether your site can even accommodate a visitor, and when you’d like to exercise this option.

 

Respectfully,

 

Amir Soofi

 

Principal Software Engineer, Enterprise

 

 

The output of the openNLP Name Find algorithm map task on this input:

From: <namefind/person>Amir Soofi</namefind/person>

Sent: <namefind/date>Thursday, December 06, 2012 2:37 AM</namefind/date >

To: <namefind/person>Aaron Macarthur</namefind/person>; <namefind/person>Hugo Cruz</namefind/person>

Cc: <namefind/person>Donald Krapohl</namefind/person>

Subject: RE: Language Comparison

 

Hugo,

 

FYI, <namefind/person>Rick Marshall</namefind/person> unofficially approved a 3-day trip starting <namefind/date>14 November</namefind/date> for one person from the Enterprise team down to <namefind/location>Jacksonville, FL</namefind/location> to assist in the catalog reinstall.

 

I’ll be placing it in the travel portal soon for the official process, so that the option becomes officially available to us.

 

I think together we’ll be able to push through the environment differences better in person than over the phone.

 

Let us know whether your site can even accommodate a visitor, and when you’d like to exercise this option.

 

Respectfully,

 

<namefind/person>Amir Soofi</namefind/person>

 

Principal Software Engineer, Enterprise

 

The output of an example reduce task on this output:

{DocumentUniqueID, EntityKey, EntityType}

{234cba3231, Amir Soofi, Person}

{234cba3231, Thursday, December 06, 2012 2:37 AM, Date}

{234cba3231, Aaron Macarthur, Person}

{234cba3231, Hugo Cruz, Person}

{234cba3231, Donald Krapohl, Person}

{234cba3231, Rick Marshall, Person}

{234cba3231, 14 November, Date}

{234cba3231, Jacksonville/,FL, Location}

{234cba3231, Amir Soofi, Person}

 

A second reduce pass might yield combinations for network analysis (link strength below being calculated on instances of co-existence across unique documents):

{EntityKey, LinkedEntity, LinkStrength}

{Amir Soofi, Donald Krapohl, 6}

{Amir Soofi, Aaron Macarthur, 15}

{Amir Soofi, Jacksonville/, FL, 1}

 

The data may then be consumed into the analysis tool of choice, such as RapidMiner, WEKA, PowerPivot, or SQL Server/SQL Server Analysis Services for further analysis.

Conclusion

openNLP on Hadoop can provides good metadata extraction for key information in unstructured data.  The information may be retrieved from competitor websites, SEC filings, Twitter activity, employee social network activity, or many other sources.  The data pre-processing and preparation steps in metadata extraction for competitive intelligence applications can be low relative to that of other analytical problems (contract semantic analysis, social analysis trending, etc.).  The steps outlined in this paper demonstrate a very high-level overview of a logical architecture and key execution activities required to gather metadata for Influence Analysis and Network Analysis for competitive advantage.

My Google+

Can I predict which contracts will likely be awarded in my area?

By Don Krapohl

  1. Open WEKA explorer
  2. On pre-process tab find the government_contracts.arff file.
  3. Perform pre-processing
    1. Escape non-enclosure single- and double-quotes (’, ”) if using a delimited text version.
    2. Check ‘UniqueTransactionID’ and click ‘Remove’.  Stating the obvious, there is no value in analysis of a continuous random transaction ID, discretization and local smoothing  can lead to overfitting, and it has no predictive value.
    3. If you have saved the arff back into a csv you will have to filter the ZIP code fields RecipientZipCode and PlaceOfPerformanceZipCode back to nominal with the unsupervised attribute filter StringToNominal and DollarsObligated to numeric.
  4. Using the attribute evaluator to explore algorithm merit on the ‘Select Attributes’ tab, use the ClassifierSubsetEval  evaluator with the Naïve Bayes algorithm and a RandomSearch search predicting the Product or Service Code (PSC).  This yields:

Selected attributes: 2,3,4,6 : 4

ContractPricing

FundingAgency

PlaceofPerformanceZipCode

RecipientZipCode

 

This indicates the best prediction of a Product or Service Code using the Naïve Bayes algorithm is a 40% (0.407 subset merit) predictive ability if you know these contract attributes.

  1. Using those attributes to predict PSC, select the Classify tab, bayes classifier -> Naïve Bayes, 10-fold cross validation, predict PSC and click ‘Start’.  The output will indicate F-measure and other attribute significance by class.  An example of a single class result is:

TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class

0               0.014      0                  0            0                         0.972        REFRIGERATION AND AIR CONDITIONING COMPONENTS

  1. View the threshold for the prediction by right-clicking the result buffer entry at the left, hover over Threshold Curve.  Select the “REFRIGERATION AND AIR CONDITIONING COMPONENTS” for example.  The curve is as follows:

 

Classifier accuracy
Classifier accuracy

 

This shows a 97% predictive accuracy on this class.  The F-Measure visualization further supports this:

 

Lift chart showing classifier coverage
Lift chart showing classifier coverage

To see an analogous cluster visualization using Excel and the SQL Server 2008 R2 addins, see my quick article on Activity Clustering on Geography.

My Google+

Entity Extraction and Competitive Intelligence

I have been approached by multiple companies wishing to perform entity extraction for competitive intelligence. Simply put, executives want to know what their competition is up to, they want to expand their company, or they are just performing market research for a proposal. The targets are typically newspaper stories, SEC filing, blogs, social media, and other unstructured content. Another goal is frequently to create intellectual property by way of branded product. Frequently these are Microsoft .net-driven organizations. These are characterized by robust enterprise licensing with Microsoft, a mature product ecosystem, and large sunk cost in existing systems making a .net platform more amenable to their resource base and portfolio.

Making possible a quick-hit entity extractor in this environment are the opensource projects openNLP (open Natural Language Processing) and IKVM, a free java virtual machine that runs .net assemblies. openNLP provides entity extraction through pre-trained models for extraction of several common entity types: person, organization, date, time, location, percentage, and money. openNLP also provides for training and refinement of user-created models.

This article won’t undertake to answer the questions of requirements gathering, fitness measurement, statistical analysis, model internals, platform architecture, operational support, or release management, but these are factors which should be considered prior to development for a production application.

Preparation

This article assumes the user has .net development skill and knowledge of the fundamentals of natural language processing. Download the latest version of openNLP from The Apache Foundation website and extract it to a directory of your choice. You will also need to download models for tokenization, sentence detection, and the entity model of your choice (person, date, etc.). Likewise, download the latest version of IKVM from SourceForge and extract it to a directory of your choice.

Create the openNLP dll

Open a command prompt and navigate to the ikvmbin-(yourProductVersion)/bin directory and build the openNLP dll with the command (change the versions to match yours):
ikvmc -target:library -assembly:openNLP opennlp-maxent-3.0.2-incubating.jar jwnl-1.3.3.jar opennlp-tools-1.5.2-incubating.jar

Create your .net Project

Create a project of your choice at a known location. Add a project reference to:
IKVM.OpenJDK.Core.dll
IKVM.OpenJDK.Jdbc.dll
IKVM.OpenJDK.Text.dll
IKVM.OpenJDK.Util.dll
IKVM.OpenJDK.XML.API.dll
IKVM.Runtime.dll
openNLP.dll

Create your Class

Copy the code below and paste it into a blank C# class file. Change the path to the models to match where you downloaded them. Compile your application and call the EntityExtractor.ExtractEntities with the content to be searched and the entity extraction type.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace NaturalLanguageProcessingCSharp
{

public class EntityExtractor
{
///
/// Entity Extraction for the entity types available in openNLP.
/// TODO:
/// try/catch/exception handling
/// filestream closure
/// model training if desired
/// Regex or dictionary entity extraction
/// clean up the setting of the Name Finder model path
/// Implement entity extraction in other languages
/// Implement entity extraction for other entity types

/// Call syntax: myList = ExtractEntities(myInText, EntityType.Person);

private string sentenceModelPath = “c:\\models\\en-sent.bin”; //path to the model for sentence detection
private string nameFinderModelPath; //NameFinder model path for English names
private string tokenModelPath = “c:\\models\\en-token.bin”; //model path for English tokens
public enum EntityType
{
Date = 0,
Location,
Money,
Organization,
Person,
Time
}

public List ExtractEntities(string inputData, EntityType targetType)
{
/*required steps to detect names are:
* downloaded sentence, token, and name models from http://opennlp.sourceforge.net/models-1.5/
* 1. Parse the input into sentences
* 2. Parse the sentences into tokens
* 3. Find the entity in the tokens

*/

//——————Preparation — Set Name Finder model path based upon entity type—————–
switch (targetType)
{
case EntityType.Date:
nameFinderModelPath = “c:\\models\\en-ner-date.bin”;
break;
case EntityType.Location:
nameFinderModelPath = “c:\\models\\en-ner-location.bin”;
break;
case EntityType.Money:
nameFinderModelPath = “c:\\models\\en-ner-money.bin”;
break;
case EntityType.Organization:
nameFinderModelPath = “c:\\models\\en-ner-organization.bin”;
break;
case EntityType.Person:
nameFinderModelPath = “c:\\models\\en-ner-person.bin”;
break;
case EntityType.Time:
nameFinderModelPath = “c:\\models\\en-ner-time.bin”;
break;
default:
break;
}

//—————– Preparation — load models into objects—————–
//initialize the sentence detector
opennlp.tools.sentdetect.SentenceDetectorME sentenceParser = prepareSentenceDetector();

//initialize person names model
opennlp.tools.namefind.NameFinderME nameFinder = prepareNameFinder();

//initialize the tokenizer–used to break our sentences into words (tokens)
opennlp.tools.tokenize.TokenizerME tokenizer = prepareTokenizer();

//—————— Make sentences, then tokens, then get names——————————–

String[] sentences = sentenceParser.sentDetect(inputData) ; //detect the sentences and load into sentence array of strings
List results = new List();

foreach (string sentence in sentences)
{
//now tokenize the input.
//”Don Krapohl enjoys warm sunny weather” would tokenize as
//”Don”, “Krapohl”, “enjoys”, “warm”, “sunny”, “weather”
string[] tokens = tokenizer.tokenize(sentence);

//do the find
opennlp.tools.util.Span[] foundNames = nameFinder.find(tokens);

//important: clear adaptive data in the feature generators or the detection rate will decrease over time.
nameFinder.clearAdaptiveData();

results.AddRange( opennlp.tools.util.Span.spansToStrings(foundNames, tokens).AsEnumerable());
}

return results;
}

#region private methods
private opennlp.tools.tokenize.TokenizerME prepareTokenizer()
{
java.io.FileInputStream tokenInputStream = new java.io.FileInputStream(tokenModelPath); //load the token model into a stream
opennlp.tools.tokenize.TokenizerModel tokenModel = new opennlp.tools.tokenize.TokenizerModel(tokenInputStream); //load the token model
return new opennlp.tools.tokenize.TokenizerME(tokenModel); //create the tokenizer
}
private opennlp.tools.sentdetect.SentenceDetectorME prepareSentenceDetector()
{
java.io.FileInputStream sentModelStream = new java.io.FileInputStream(sentenceModelPath); //load the sentence model into a stream
opennlp.tools.sentdetect.SentenceModel sentModel = new opennlp.tools.sentdetect.SentenceModel(sentModelStream);// load the model
return new opennlp.tools.sentdetect.SentenceDetectorME(sentModel); //create sentence detector
}
private opennlp.tools.namefind.NameFinderME prepareNameFinder()
{
java.io.FileInputStream modelInputStream = new java.io.FileInputStream(nameFinderModelPath); //load the name model into a stream
opennlp.tools.namefind.TokenNameFinderModel model = new opennlp.tools.namefind.TokenNameFinderModel(modelInputStream); //load the model
return new opennlp.tools.namefind.NameFinderME(model); //create the namefinder
}
#endregion
}
}

Hadoop on Azure and HDInsight integration
4/2/2013 – HDInsight doesn’t seem to support openNLP or any other natural language processing algorithm. It does integrate well with SQL Server Analysis Services and the rest of the Microsoft business intelligence stack, which do provide excellent views within and across data islands. I hope to see NLP on HDInsight in the near future for algorithms stronger than the LSA/LSI (latent semantic analysis/latent semantic indexing–semantic query) in SQL Server 2012.

My Google+