Approachable Data Mining Tutorials for the Non Data Miner

A list of several sources to learn data science in a hands-on format

https://www.coursera.org/course/ml – The most approachable machine learning course available. And it’s free.

https://www.kaggle.com/wiki/Tutorials – Provides data sources, forums, scenarios, and real-world competitions to teach data mining

http://deeplearning.net/tutorial/ – Tutorial on Deep Learning – introduction to machine learning image analysis algorithms

http://tryr.codeschool.com/ – Interactive introduction to R Language

2.5TB, 53.5 Billion Clicks Dataset Available for Clickstream Analysis

To foster the study of the structure and dynamics of Web traffic networks, Indiana University has made available a large dataset (‘Click Dataset’) of about 53.5 billion HTTP requests made by users at Indiana University. Gathering anonymized requests directly from the network rather than relying on server logs and browser instrumentation allows one to examine large volumes of traffic data while minimizing biases associated with other data sources. It also provides one with valuable referrer information to reconstruct the subset of the Web graph actually traversed by users. The goal is to develop a better understanding of user behavior online and create more realistic models of Web traffic. The potential applications of this data include improved designs for networks, sites, and server software; more accurate forecasting of traffic trends; classification of sites based on the patterns of activity they inspire; and improved ranking algorithms for search results.

The data was generated by applying a Berkeley Packet Filter to a mirror of the traffic passing through the border router of Indiana University. This filter matched all traffic destined for TCP port 80. A long-running collection process used the pcap library to gather these packets, then applied a small set of regular expressions to their payloads to determine whether they contained HTTP GET requests. If a packet did contain a request, the collection system logged a record with the following fields:

  • a timestamp
  • the requested URL
  • the referring URL
  • a boolean classification of the user agent (browser or bot)
  • a boolean flag for whether the request was generated inside or outside IU.

Some important notes:

  1. Traffic generated outside IU only includes requests from outside IU for pages inside IU. Traffic generated inside IU only includes requests from people at IU (about 100,000 users) for resources outside IU. These two sets of requests have very different sampling biases.
  2. No distinguishing information about the client system was retained: no MAC or IP addresses nor any unique index were ever recorded.
  3. There was no attempt at stream reassembly, and server responses were not analyzed.

During collection, the system generated data at a rate of about 60 million requests per day, or about 30 GB/day of raw data. The data was collected between Sep 2006 and May 2010. Data is missing for about 275 days. The dataset has two collections:

  1. raw: About 25 billion requests, where  only the host name of the referrer is retained. Collected between 26 Sep 2006 and 3 Mar 2008; missing 98 days of data, including the entire month of Jun 2007. Approximately 0.85 TB, compressed.
  2. raw-url: About 28.6 billion requests, where the full referrer URL is retained. Collected between 3 Mar 2008 and 31 May 2010; missing 179 days of data, including the entire months of Dec 2008, Jan 2009, and Feb 2009. Approximately 1.5 TB, compressed.

The dataset is broken into hourly files. The initial line of each file has a set of flags that can be ignored. Each record looks like this:

 XXXXADreferrer  host  path

where XXXX is the timestamp (32-bit Unix epoch in seconds, in little endian order), A is the user-agent flag (“B” for browser or “?” for other, including bots), D is the direction flag (“O” for external traffic to IU, “I” for internal traffic to outside IU), referrer is the referrer hostname or URL (terminated by newline), host is the target hostname (terminated by newline), and path is the target path (terminated by newline).

The Click Dataset is large (~2.5 TB compressed), which requires that it be transferred on a physical hard drive. You will have to provide the drive as well as pre-paid return shipment. Additionally,  the dataset might potentially contain bits of stray personal data. Therefore you will have to sign a data security agreement. Indiana University require that you follow these instructions to request the data.

Citation information and FAQs are available on the team’s page at http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset .

US Government Business Capture Data Mining in Microsoft Excel

View agency activity clustering on geography in Excel using Excel Data Mining Add-ins

By Don Krapohl

1.       Ensure you have downloaded the Excel Data Mining Add-ins from Microsoft at http://www.microsoft.com/en-us/download/details.aspx?id=35578 .  The article assumes you have a working version of the DM Addins and a default Analysis Services (SSAS) instance defined.  Search for getting started with SQL Server Data Mining Add-ins for Excel if you are not familiar with this process.

2.       Open the Excel sample file for Federal contract acquisitions in Wyoming (2012) from http://www.augmentedintel.com/content/datasets/government_contracts_data_mining_addins.xlsx

3.       On the wy_data_feed tab, select all the data.

4.       In the Home tab on the ribbon in the Styles section select “Format as Table”.  Pick any format you wish.

5.       A new tab will appear on the ribbon for Table Tools with menus for Analyze and Design as below.

Microsoft Excel Table Tools menu
Table tools to format data as a table in Excel

 

6.       On the Analyze menu, select “Detect Categories”.  This is will group (cluster) your information on common attributes, particular commonalities that are not obvious or immediately observable.

7.       Deselect all checkboxes except the following:

a.       Dollars Obligated

b.      Award Type

c.       Contract Pricing

d.      Funding Agency

e.      Product Or Service Code

f.        Category

8.       Click ‘run’

9.       The output will show you categories of information showing strong affinities.  Explore the model by filtering the charts and tables by the category/ies generated.  Do this by selecting the filter icon (funnel) next to Category on the table or the Category label at the lower left of the graph.

10.   Interesting information may be derived from the groups with fewer rows that may show particularly interesting correlations for a targeted campaign.  For example, filter the table and chart on Category 6.  This group indicates a group affinity for the attribute values ProductOrServiceCode = “REFRIGERATION AND AIR CONDITIONING COMPONENTS”, fundingAgency = “Veterans Affairs, Department Of”, and a contract award value of $61,148 to $1,173,695 as shown below:

 

Importance of data categories in Excel Data Mining Add-ins
Factor Analysis in Microsoft Excel

For my organization’s business development activities, if I am in the heating and air business I may elect to focus efforts on medium-sized contracts with Veterans Affairs.

My Google+

Predicting the Best Parameters for Federal Business Capture using WEKA

Which contract parameters should I choose?

What combination of features might I pursue to raise my probability of contract award?

  1. Open WEKA explorer
  2. On pre-process tab find the government_contracts.arff file.
  3. Perform pre-processing
    1. Escape non-enclosure single- and double-quotes (\’, \”) if using a delimited text version.
    2. Check ‘UniqueTransactionID’ and click ‘Remove’.  Stating the obvious, there is no value in analysis of a continuous random transaction ID, discretization and local smoothing  can lead to overfitting, and it has no predictive value.
    3. If you have saved the arff back into a csv you will have to filter the ZIP code fields RecipientZipCode and PlaceOfPerformanceZipCode back to nominal with the unsupervised attribute filter StringToNominal and DollarsObligated to numeric.
    4. On the Associate tab, select the Apriori algorithm and click ‘start’.  The results:

 

WEKA association rules for contract feature prediction
Predicting Award Parameters

 

This indicates that selecting for Firm Fixed Price contracts for the VA, if you are located in ZIP 83110 and the work will be performed within ZIP 83110 you may have an advantage in the acquisition.

My Google+

Automated Metadata Extraction for Competitive Intelligence

Artificial Intelligence for the Creation of Competitive Intelligence Tools

Introduction

Often in prioritizing business development activities it is helpful to determine who is able to influence a decision and how they are related to those in the market space.  To make a defensible and actionable strategy it is useful to perform Influence Analysis and Network Analysis, which can form the kernel of a competitive intelligence analysis strategy.  The data required for analysis must be obtained by identifying and extracting target attribute values in unstructured and often very large (multi-terabyte or petabyte) data stores.  This necessitates a scalable infrastructure, distributed parallel computing capability, and fit-for-use natural language processing algorithms.  Herein I will demonstrate a target logical architecture and methodology for accomplishing the task.  Influence and Network analysis by machine learning algorithm (naïve bayes or perceptron for example) will be covered in a later supporting article.

Recognizing Significance

Named-Entity Recognition is required for unstructured content extraction in this scenario.  This identification scheme may or may not employ stemming but will always require tokenizing, part-of-speech tagging, and the acquisition of a predefined model of attribute patterns to properly recognize and extract required metadata.  A powerful platform with these built-in capabilities is the Apache openNLP project, which includes typed attribute models for the name finder, an extensible name finder algorithm, an API that exposes a Lucene index consumer, and a scalable, distributed architecture.  The Apache Stanbol project in the incubator (http://stanbol.apache.org/) shows promise at semantic-based extraction and content enhancement but hasn’t been promoted outside the incubator yet.

Apache openNLP attribute recognition models are available in only a few languages with the original and largest being English.  The community publishes models in English for the Name Finder interface for dates, location, money, organization, percentage, person, and time (date).  Each is an appropriate candidate for term extraction for competitive intelligence analysis.

Logical Architecture

Natural Language Processing for Competitive Intelligence
openNLP in four node Hadoop cluster

The controlling requirement for the task of metadata extraction from massive datasources is the processing of massive datasets to extract information.  For this Hadoop provides a flexible, fault-tolerant framework and processing model that readily supports the natural language processing needs.  The logical architecture for a small (<1TB) 4-node clustered Hadoop solution is as follows:

 

Process Flow

As below, the process to execute is standardized on the map/reduce patterns Distributed Task Execution, Union, Selection, and Intersection.  Pre-processing using a Graph Processing pattern in a distinctly separate map phase would likely hasten any Influence Analysis to be performed post-process.

 

Operations Sequence Diagram of openNLP with Map Reduce on Hadoop for Competitive Intelligence
Multi-node Sequence Diagram for openNLP with Map Reduce on Hadoop

The primary namenode initiates work and passes the data and map/reduce execution program to the task trackers, who in turn distribute it among worker nodes.  The worker nodes execute the map on HDFS-stored data, provide health and status to the task tracker, who reports it to the primary namenode.  On node map completion the primary namenode may redistribute map work to the worker node or order the reduce task, each by way of the task tracker.  The reduce task selects data from the HDFS interim resultset, aggregates, and streams to a result file.  The result file is then used later for analysis by the machine learning algorithm of choice.

File Structures

The input file is of a machine-readable ASCII text type and is unstructured.  Example:

 

From: Amir Soofi

Sent: Thursday, December 06, 2012 2:37 AM

To: Aaron Macarthur; Hugo Cruz

Cc: Donald Krapohl

Subject: RE: Language Comparison

 

Hugo,

 

FYI, Rick Marshall unofficially approved a 3-day trip for one person from the Enterprise team down to Jacksonville, FL to assist in the catalog reinstall.

 

I’ll be placing it in the travel portal soon for the official process, so that the option becomes officially available to us.

 

I think together we’ll be able to push through the environment differences better in person than over the phone.

 

Let us know whether your site can even accommodate a visitor, and when you’d like to exercise this option.

 

Respectfully,

 

Amir Soofi

 

Principal Software Engineer, Enterprise

 

 

The output of the openNLP Name Find algorithm map task on this input:

From: <namefind/person>Amir Soofi</namefind/person>

Sent: <namefind/date>Thursday, December 06, 2012 2:37 AM</namefind/date >

To: <namefind/person>Aaron Macarthur</namefind/person>; <namefind/person>Hugo Cruz</namefind/person>

Cc: <namefind/person>Donald Krapohl</namefind/person>

Subject: RE: Language Comparison

 

Hugo,

 

FYI, <namefind/person>Rick Marshall</namefind/person> unofficially approved a 3-day trip starting <namefind/date>14 November</namefind/date> for one person from the Enterprise team down to <namefind/location>Jacksonville, FL</namefind/location> to assist in the catalog reinstall.

 

I’ll be placing it in the travel portal soon for the official process, so that the option becomes officially available to us.

 

I think together we’ll be able to push through the environment differences better in person than over the phone.

 

Let us know whether your site can even accommodate a visitor, and when you’d like to exercise this option.

 

Respectfully,

 

<namefind/person>Amir Soofi</namefind/person>

 

Principal Software Engineer, Enterprise

 

The output of an example reduce task on this output:

{DocumentUniqueID, EntityKey, EntityType}

{234cba3231, Amir Soofi, Person}

{234cba3231, Thursday, December 06, 2012 2:37 AM, Date}

{234cba3231, Aaron Macarthur, Person}

{234cba3231, Hugo Cruz, Person}

{234cba3231, Donald Krapohl, Person}

{234cba3231, Rick Marshall, Person}

{234cba3231, 14 November, Date}

{234cba3231, Jacksonville/,FL, Location}

{234cba3231, Amir Soofi, Person}

 

A second reduce pass might yield combinations for network analysis (link strength below being calculated on instances of co-existence across unique documents):

{EntityKey, LinkedEntity, LinkStrength}

{Amir Soofi, Donald Krapohl, 6}

{Amir Soofi, Aaron Macarthur, 15}

{Amir Soofi, Jacksonville/, FL, 1}

 

The data may then be consumed into the analysis tool of choice, such as RapidMiner, WEKA, PowerPivot, or SQL Server/SQL Server Analysis Services for further analysis.

Conclusion

openNLP on Hadoop can provides good metadata extraction for key information in unstructured data.  The information may be retrieved from competitor websites, SEC filings, Twitter activity, employee social network activity, or many other sources.  The data pre-processing and preparation steps in metadata extraction for competitive intelligence applications can be low relative to that of other analytical problems (contract semantic analysis, social analysis trending, etc.).  The steps outlined in this paper demonstrate a very high-level overview of a logical architecture and key execution activities required to gather metadata for Influence Analysis and Network Analysis for competitive advantage.

My Google+

Predicting Federal Contracts using Machine Learning Classification in WEKA

Can I predict which contracts will likely be awarded in my area?

By Don Krapohl

  1. Open WEKA explorer
  2. On pre-process tab find the government_contracts.arff file.
  3. Perform pre-processing
    1. Escape non-enclosure single- and double-quotes (’, ”) if using a delimited text version.
    2. Check ‘UniqueTransactionID’ and click ‘Remove’.  Stating the obvious, there is no value in analysis of a continuous random transaction ID, discretization and local smoothing  can lead to overfitting, and it has no predictive value.
    3. If you have saved the arff back into a csv you will have to filter the ZIP code fields RecipientZipCode and PlaceOfPerformanceZipCode back to nominal with the unsupervised attribute filter StringToNominal and DollarsObligated to numeric.
  4. Using the attribute evaluator to explore algorithm merit on the ‘Select Attributes’ tab, use the ClassifierSubsetEval  evaluator with the Naïve Bayes algorithm and a RandomSearch search predicting the Product or Service Code (PSC).  This yields:

Selected attributes: 2,3,4,6 : 4

ContractPricing

FundingAgency

PlaceofPerformanceZipCode

RecipientZipCode

 

This indicates the best prediction of a Product or Service Code using the Naïve Bayes algorithm is a 40% (0.407 subset merit) predictive ability if you know these contract attributes.

  1. Using those attributes to predict PSC, select the Classify tab, bayes classifier -> Naïve Bayes, 10-fold cross validation, predict PSC and click ‘Start’.  The output will indicate F-measure and other attribute significance by class.  An example of a single class result is:

TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class

0               0.014      0                  0            0                         0.972        REFRIGERATION AND AIR CONDITIONING COMPONENTS

  1. View the threshold for the prediction by right-clicking the result buffer entry at the left, hover over Threshold Curve.  Select the “REFRIGERATION AND AIR CONDITIONING COMPONENTS” for example.  The curve is as follows:

 

Classifier accuracy
Classifier accuracy

 

This shows a 97% predictive accuracy on this class.  The F-Measure visualization further supports this:

 

Lift chart showing classifier coverage
Lift chart showing classifier coverage

To see an analogous cluster visualization using Excel and the SQL Server 2008 R2 addins, see my quick article on Activity Clustering on Geography.

My Google+

Automatic Entity Extraction using openNLP in C#

Entity Extraction and Competitive Intelligence

I have been approached by multiple companies wishing to perform entity extraction for competitive intelligence. Simply put, executives want to know what their competition is up to, they want to expand their company, or they are just performing market research for a proposal. The targets are typically newspaper stories, SEC filing, blogs, social media, and other unstructured content. Another goal is frequently to create intellectual property by way of branded product. Frequently these are Microsoft .net-driven organizations. These are characterized by robust enterprise licensing with Microsoft, a mature product ecosystem, and large sunk cost in existing systems making a .net platform more amenable to their resource base and portfolio.

Making possible a quick-hit entity extractor in this environment are the opensource projects openNLP (open Natural Language Processing) and IKVM, a free java virtual machine that runs .net assemblies. openNLP provides entity extraction through pre-trained models for extraction of several common entity types: person, organization, date, time, location, percentage, and money. openNLP also provides for training and refinement of user-created models.

This article won’t undertake to answer the questions of requirements gathering, fitness measurement, statistical analysis, model internals, platform architecture, operational support, or release management, but these are factors which should be considered prior to development for a production application.

Preparation

This article assumes the user has .net development skill and knowledge of the fundamentals of natural language processing. Download the latest version of openNLP from The Apache Foundation website and extract it to a directory of your choice. You will also need to download models for tokenization, sentence detection, and the entity model of your choice (person, date, etc.). Likewise, download the latest version of IKVM from SourceForge and extract it to a directory of your choice.

Create the openNLP dll

Open a command prompt and navigate to the ikvmbin-(yourProductVersion)/bin directory and build the openNLP dll with the command (change the versions to match yours):
ikvmc -target:library -assembly:openNLP opennlp-maxent-3.0.2-incubating.jar jwnl-1.3.3.jar opennlp-tools-1.5.2-incubating.jar

Create your .net Project

Create a project of your choice at a known location. Add a project reference to:
IKVM.OpenJDK.Core.dll
IKVM.OpenJDK.Jdbc.dll
IKVM.OpenJDK.Text.dll
IKVM.OpenJDK.Util.dll
IKVM.OpenJDK.XML.API.dll
IKVM.Runtime.dll
openNLP.dll

Create your Class

Copy the code below and paste it into a blank C# class file. Change the path to the models to match where you downloaded them. Compile your application and call the EntityExtractor.ExtractEntities with the content to be searched and the entity extraction type.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace NaturalLanguageProcessingCSharp
{

public class EntityExtractor
{
///
/// Entity Extraction for the entity types available in openNLP.
/// TODO:
/// try/catch/exception handling
/// filestream closure
/// model training if desired
/// Regex or dictionary entity extraction
/// clean up the setting of the Name Finder model path
/// Implement entity extraction in other languages
/// Implement entity extraction for other entity types

/// Call syntax: myList = ExtractEntities(myInText, EntityType.Person);

private string sentenceModelPath = “c:\\models\\en-sent.bin”; //path to the model for sentence detection
private string nameFinderModelPath; //NameFinder model path for English names
private string tokenModelPath = “c:\\models\\en-token.bin”; //model path for English tokens
public enum EntityType
{
Date = 0,
Location,
Money,
Organization,
Person,
Time
}

public List ExtractEntities(string inputData, EntityType targetType)
{
/*required steps to detect names are:
* downloaded sentence, token, and name models from http://opennlp.sourceforge.net/models-1.5/
* 1. Parse the input into sentences
* 2. Parse the sentences into tokens
* 3. Find the entity in the tokens

*/

//——————Preparation — Set Name Finder model path based upon entity type—————–
switch (targetType)
{
case EntityType.Date:
nameFinderModelPath = “c:\\models\\en-ner-date.bin”;
break;
case EntityType.Location:
nameFinderModelPath = “c:\\models\\en-ner-location.bin”;
break;
case EntityType.Money:
nameFinderModelPath = “c:\\models\\en-ner-money.bin”;
break;
case EntityType.Organization:
nameFinderModelPath = “c:\\models\\en-ner-organization.bin”;
break;
case EntityType.Person:
nameFinderModelPath = “c:\\models\\en-ner-person.bin”;
break;
case EntityType.Time:
nameFinderModelPath = “c:\\models\\en-ner-time.bin”;
break;
default:
break;
}

//—————– Preparation — load models into objects—————–
//initialize the sentence detector
opennlp.tools.sentdetect.SentenceDetectorME sentenceParser = prepareSentenceDetector();

//initialize person names model
opennlp.tools.namefind.NameFinderME nameFinder = prepareNameFinder();

//initialize the tokenizer–used to break our sentences into words (tokens)
opennlp.tools.tokenize.TokenizerME tokenizer = prepareTokenizer();

//—————— Make sentences, then tokens, then get names——————————–

String[] sentences = sentenceParser.sentDetect(inputData) ; //detect the sentences and load into sentence array of strings
List results = new List();

foreach (string sentence in sentences)
{
//now tokenize the input.
//”Don Krapohl enjoys warm sunny weather” would tokenize as
//”Don”, “Krapohl”, “enjoys”, “warm”, “sunny”, “weather”
string[] tokens = tokenizer.tokenize(sentence);

//do the find
opennlp.tools.util.Span[] foundNames = nameFinder.find(tokens);

//important: clear adaptive data in the feature generators or the detection rate will decrease over time.
nameFinder.clearAdaptiveData();

results.AddRange( opennlp.tools.util.Span.spansToStrings(foundNames, tokens).AsEnumerable());
}

return results;
}

#region private methods
private opennlp.tools.tokenize.TokenizerME prepareTokenizer()
{
java.io.FileInputStream tokenInputStream = new java.io.FileInputStream(tokenModelPath); //load the token model into a stream
opennlp.tools.tokenize.TokenizerModel tokenModel = new opennlp.tools.tokenize.TokenizerModel(tokenInputStream); //load the token model
return new opennlp.tools.tokenize.TokenizerME(tokenModel); //create the tokenizer
}
private opennlp.tools.sentdetect.SentenceDetectorME prepareSentenceDetector()
{
java.io.FileInputStream sentModelStream = new java.io.FileInputStream(sentenceModelPath); //load the sentence model into a stream
opennlp.tools.sentdetect.SentenceModel sentModel = new opennlp.tools.sentdetect.SentenceModel(sentModelStream);// load the model
return new opennlp.tools.sentdetect.SentenceDetectorME(sentModel); //create sentence detector
}
private opennlp.tools.namefind.NameFinderME prepareNameFinder()
{
java.io.FileInputStream modelInputStream = new java.io.FileInputStream(nameFinderModelPath); //load the name model into a stream
opennlp.tools.namefind.TokenNameFinderModel model = new opennlp.tools.namefind.TokenNameFinderModel(modelInputStream); //load the model
return new opennlp.tools.namefind.NameFinderME(model); //create the namefinder
}
#endregion
}
}

Hadoop on Azure and HDInsight integration
4/2/2013 – HDInsight doesn’t seem to support openNLP or any other natural language processing algorithm. It does integrate well with SQL Server Analysis Services and the rest of the Microsoft business intelligence stack, which do provide excellent views within and across data islands. I hope to see NLP on HDInsight in the near future for algorithms stronger than the LSA/LSI (latent semantic analysis/latent semantic indexing–semantic query) in SQL Server 2012.

My Google+