Automatic Entity Extraction using openNLP in C#

Entity Extraction and Competitive Intelligence

I have been approached by multiple companies wishing to perform entity extraction for competitive intelligence. Simply put, executives want to know what their competition is up to, they want to expand their company, or they are just performing market research for a proposal. The targets are typically newspaper stories, SEC filing, blogs, social media, and other unstructured content. Another goal is frequently to create intellectual property by way of branded product. Frequently these are Microsoft .net-driven organizations. These are characterized by robust enterprise licensing with Microsoft, a mature product ecosystem, and large sunk cost in existing systems making a .net platform more amenable to their resource base and portfolio.

Making possible a quick-hit entity extractor in this environment are the opensource projects openNLP (open Natural Language Processing) and IKVM, a free java virtual machine that runs .net assemblies. openNLP provides entity extraction through pre-trained models for extraction of several common entity types: person, organization, date, time, location, percentage, and money. openNLP also provides for training and refinement of user-created models.

This article won’t undertake to answer the questions of requirements gathering, fitness measurement, statistical analysis, model internals, platform architecture, operational support, or release management, but these are factors which should be considered prior to development for a production application.

Preparation

This article assumes the user has .net development skill and knowledge of the fundamentals of natural language processing. Download the latest version of openNLP from The Apache Foundation website and extract it to a directory of your choice. You will also need to download models for tokenization, sentence detection, and the entity model of your choice (person, date, etc.). Likewise, download the latest version of IKVM from SourceForge and extract it to a directory of your choice.

Create the openNLP dll

Open a command prompt and navigate to the ikvmbin-(yourProductVersion)/bin directory and build the openNLP dll with the command (change the versions to match yours):
ikvmc -target:library -assembly:openNLP opennlp-maxent-3.0.2-incubating.jar jwnl-1.3.3.jar opennlp-tools-1.5.2-incubating.jar

Create your .net Project

Create a project of your choice at a known location. Add a project reference to:
IKVM.OpenJDK.Core.dll
IKVM.OpenJDK.Jdbc.dll
IKVM.OpenJDK.Text.dll
IKVM.OpenJDK.Util.dll
IKVM.OpenJDK.XML.API.dll
IKVM.Runtime.dll
openNLP.dll

Create your Class

Copy the code below and paste it into a blank C# class file. Change the path to the models to match where you downloaded them. Compile your application and call the EntityExtractor.ExtractEntities with the content to be searched and the entity extraction type.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace NaturalLanguageProcessingCSharp
{

public class EntityExtractor
{
///
/// Entity Extraction for the entity types available in openNLP.
/// TODO:
/// try/catch/exception handling
/// filestream closure
/// model training if desired
/// Regex or dictionary entity extraction
/// clean up the setting of the Name Finder model path
/// Implement entity extraction in other languages
/// Implement entity extraction for other entity types

/// Call syntax: myList = ExtractEntities(myInText, EntityType.Person);

private string sentenceModelPath = “c:\\models\\en-sent.bin”; //path to the model for sentence detection
private string nameFinderModelPath; //NameFinder model path for English names
private string tokenModelPath = “c:\\models\\en-token.bin”; //model path for English tokens
public enum EntityType
{
Date = 0,
Location,
Money,
Organization,
Person,
Time
}

public List ExtractEntities(string inputData, EntityType targetType)
{
/*required steps to detect names are:
* downloaded sentence, token, and name models from http://opennlp.sourceforge.net/models-1.5/
* 1. Parse the input into sentences
* 2. Parse the sentences into tokens
* 3. Find the entity in the tokens

*/

//——————Preparation — Set Name Finder model path based upon entity type—————–
switch (targetType)
{
case EntityType.Date:
nameFinderModelPath = “c:\\models\\en-ner-date.bin”;
break;
case EntityType.Location:
nameFinderModelPath = “c:\\models\\en-ner-location.bin”;
break;
case EntityType.Money:
nameFinderModelPath = “c:\\models\\en-ner-money.bin”;
break;
case EntityType.Organization:
nameFinderModelPath = “c:\\models\\en-ner-organization.bin”;
break;
case EntityType.Person:
nameFinderModelPath = “c:\\models\\en-ner-person.bin”;
break;
case EntityType.Time:
nameFinderModelPath = “c:\\models\\en-ner-time.bin”;
break;
default:
break;
}

//—————– Preparation — load models into objects—————–
//initialize the sentence detector
opennlp.tools.sentdetect.SentenceDetectorME sentenceParser = prepareSentenceDetector();

//initialize person names model
opennlp.tools.namefind.NameFinderME nameFinder = prepareNameFinder();

//initialize the tokenizer–used to break our sentences into words (tokens)
opennlp.tools.tokenize.TokenizerME tokenizer = prepareTokenizer();

//—————— Make sentences, then tokens, then get names——————————–

String[] sentences = sentenceParser.sentDetect(inputData) ; //detect the sentences and load into sentence array of strings
List results = new List();

foreach (string sentence in sentences)
{
//now tokenize the input.
//”Don Krapohl enjoys warm sunny weather” would tokenize as
//”Don”, “Krapohl”, “enjoys”, “warm”, “sunny”, “weather”
string[] tokens = tokenizer.tokenize(sentence);

//do the find
opennlp.tools.util.Span[] foundNames = nameFinder.find(tokens);

//important: clear adaptive data in the feature generators or the detection rate will decrease over time.
nameFinder.clearAdaptiveData();

results.AddRange( opennlp.tools.util.Span.spansToStrings(foundNames, tokens).AsEnumerable());
}

return results;
}

#region private methods
private opennlp.tools.tokenize.TokenizerME prepareTokenizer()
{
java.io.FileInputStream tokenInputStream = new java.io.FileInputStream(tokenModelPath); //load the token model into a stream
opennlp.tools.tokenize.TokenizerModel tokenModel = new opennlp.tools.tokenize.TokenizerModel(tokenInputStream); //load the token model
return new opennlp.tools.tokenize.TokenizerME(tokenModel); //create the tokenizer
}
private opennlp.tools.sentdetect.SentenceDetectorME prepareSentenceDetector()
{
java.io.FileInputStream sentModelStream = new java.io.FileInputStream(sentenceModelPath); //load the sentence model into a stream
opennlp.tools.sentdetect.SentenceModel sentModel = new opennlp.tools.sentdetect.SentenceModel(sentModelStream);// load the model
return new opennlp.tools.sentdetect.SentenceDetectorME(sentModel); //create sentence detector
}
private opennlp.tools.namefind.NameFinderME prepareNameFinder()
{
java.io.FileInputStream modelInputStream = new java.io.FileInputStream(nameFinderModelPath); //load the name model into a stream
opennlp.tools.namefind.TokenNameFinderModel model = new opennlp.tools.namefind.TokenNameFinderModel(modelInputStream); //load the model
return new opennlp.tools.namefind.NameFinderME(model); //create the namefinder
}
#endregion
}
}

Hadoop on Azure and HDInsight integration
4/2/2013 – HDInsight doesn’t seem to support openNLP or any other natural language processing algorithm. It does integrate well with SQL Server Analysis Services and the rest of the Microsoft business intelligence stack, which do provide excellent views within and across data islands. I hope to see NLP on HDInsight in the near future for algorithms stronger than the LSA/LSI (latent semantic analysis/latent semantic indexing–semantic query) in SQL Server 2012.

My Google+

Leave a Reply

Your email address will not be published. Required fields are marked *