The US National Institute of Standards and Technology (NIST) kicked off their Big Data Working Group on June 19th 2013.  The sessions have now been broken down into subgroups for Definitions, Taxonomies, Reference Architecture, and Technology Roadmap.  The charter for the working group:

NIST is leading the development of a Big Data Technology Roadmap. This roadmap will define and prioritize requirements for interoperabilityreusability, and extendibility for big data analytic techniques and technology infrastructure in order to support secure and effective adoption of Big Data. To help develop the ideas in the Big Data Technology Roadmap, NIST is creating the Public Working Group for Big Data.

Scope: The focus of the NBD-WG is to form a community of interest from industry, academia, and government, with the goal of developing a consensus definitionstaxonomiesreference architectures, and technology roadmap which would enable breakthrough discoveries and innovation by advancing measurement science, standards, and technology in ways that enhance economic security and improve quality of life. Deliverables:

  • Develop Big Data Definitions
  • Develop Big Data Taxonomies
  • Develop Big Data Reference Architectures
  • Develop Big Data Technology Roadmap

Target Date: The goal for completion of INITIAL DRAFTs is Friday, September 27, 2013. Further milestones will be developed once the WG has initiated its regular meetings.

Participants: The NBD-WG is open to everyone. We hope to bring together stakeholder communities across industry, academic, and government sectors representing all of those with interests in Big Data techniques, technologies, and applications. The group needs your input to meet its goals so please join us for the kick-off meeting and contribute your ideas and insights.

Meetings: The NBD-WG will hold weekly meetings on Wednesdays from 1300 – 1500 EDT (unless announce otherwise) by teleconference. Please click here for the virtual meeting information.> Questions: General questions to the NBD-WG can be addressed to BigDataInfo@nist.gov

 

To participate in helping the US Government in their efforts, sign up at http://bigdatawg.nist.gov/home.php

View agency activity clustering on geography in Excel using Excel Data Mining Add-ins

By Don Krapohl

1.       Ensure you have downloaded the Excel Data Mining Add-ins from Microsoft at http://www.microsoft.com/en-us/download/details.aspx?id=35578 .  The article assumes you have a working version of the DM Addins and a default Analysis Services (SSAS) instance defined.  Search for getting started with SQL Server Data Mining Add-ins for Excel if you are not familiar with this process.

2.       Open the Excel sample file for Federal contract acquisitions in Wyoming (2012) from https://augmentedintel.com/content/datasets/government_contracts_data_mining_addins.xlsx

3.       On the wy_data_feed tab, select all the data.

4.       In the Home tab on the ribbon in the Styles section select “Format as Table”.  Pick any format you wish.

5.       A new tab will appear on the ribbon for Table Tools with menus for Analyze and Design as below.

Microsoft Excel Table Tools menu
Table tools to format data as a table in Excel

 

6.       On the Analyze menu, select “Detect Categories”.  This is will group (cluster) your information on common attributes, particular commonalities that are not obvious or immediately observable.

7.       Deselect all checkboxes except the following:

a.       Dollars Obligated

b.      Award Type

c.       Contract Pricing

d.      Funding Agency

e.      Product Or Service Code

f.        Category

8.       Click ‘run’

9.       The output will show you categories of information showing strong affinities.  Explore the model by filtering the charts and tables by the category/ies generated.  Do this by selecting the filter icon (funnel) next to Category on the table or the Category label at the lower left of the graph.

10.   Interesting information may be derived from the groups with fewer rows that may show particularly interesting correlations for a targeted campaign.  For example, filter the table and chart on Category 6.  This group indicates a group affinity for the attribute values ProductOrServiceCode = “REFRIGERATION AND AIR CONDITIONING COMPONENTS”, fundingAgency = “Veterans Affairs, Department Of”, and a contract award value of $61,148 to $1,173,695 as shown below:

 

Importance of data categories in Excel Data Mining Add-ins
Factor Analysis in Microsoft Excel

For my organization’s business development activities, if I am in the heating and air business I may elect to focus efforts on medium-sized contracts with Veterans Affairs.

My Google+

Which contract parameters should I choose?

What combination of features might I pursue to raise my probability of contract award?

  1. Open WEKA explorer
  2. On pre-process tab find the government_contracts.arff file.
  3. Perform pre-processing
    1. Escape non-enclosure single- and double-quotes (\’, \”) if using a delimited text version.
    2. Check ‘UniqueTransactionID’ and click ‘Remove’.  Stating the obvious, there is no value in analysis of a continuous random transaction ID, discretization and local smoothing  can lead to overfitting, and it has no predictive value.
    3. If you have saved the arff back into a csv you will have to filter the ZIP code fields RecipientZipCode and PlaceOfPerformanceZipCode back to nominal with the unsupervised attribute filter StringToNominal and DollarsObligated to numeric.
    4. On the Associate tab, select the Apriori algorithm and click ‘start’.  The results:

 

WEKA association rules for contract feature prediction
Predicting Award Parameters

 

This indicates that selecting for Firm Fixed Price contracts for the VA, if you are located in ZIP 83110 and the work will be performed within ZIP 83110 you may have an advantage in the acquisition.

My Google+

Can I predict which contracts will likely be awarded in my area?

By Don Krapohl

  1. Open WEKA explorer
  2. On pre-process tab find the government_contracts.arff file.
  3. Perform pre-processing
    1. Escape non-enclosure single- and double-quotes (’, ”) if using a delimited text version.
    2. Check ‘UniqueTransactionID’ and click ‘Remove’.  Stating the obvious, there is no value in analysis of a continuous random transaction ID, discretization and local smoothing  can lead to overfitting, and it has no predictive value.
    3. If you have saved the arff back into a csv you will have to filter the ZIP code fields RecipientZipCode and PlaceOfPerformanceZipCode back to nominal with the unsupervised attribute filter StringToNominal and DollarsObligated to numeric.
  4. Using the attribute evaluator to explore algorithm merit on the ‘Select Attributes’ tab, use the ClassifierSubsetEval  evaluator with the Naïve Bayes algorithm and a RandomSearch search predicting the Product or Service Code (PSC).  This yields:

Selected attributes: 2,3,4,6 : 4

ContractPricing

FundingAgency

PlaceofPerformanceZipCode

RecipientZipCode

 

This indicates the best prediction of a Product or Service Code using the Naïve Bayes algorithm is a 40% (0.407 subset merit) predictive ability if you know these contract attributes.

  1. Using those attributes to predict PSC, select the Classify tab, bayes classifier -> Naïve Bayes, 10-fold cross validation, predict PSC and click ‘Start’.  The output will indicate F-measure and other attribute significance by class.  An example of a single class result is:

TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class

0               0.014      0                  0            0                         0.972        REFRIGERATION AND AIR CONDITIONING COMPONENTS

  1. View the threshold for the prediction by right-clicking the result buffer entry at the left, hover over Threshold Curve.  Select the “REFRIGERATION AND AIR CONDITIONING COMPONENTS” for example.  The curve is as follows:

 

Classifier accuracy
Classifier accuracy

 

This shows a 97% predictive accuracy on this class.  The F-Measure visualization further supports this:

 

Lift chart showing classifier coverage
Lift chart showing classifier coverage

To see an analogous cluster visualization using Excel and the SQL Server 2008 R2 addins, see my quick article on Activity Clustering on Geography.

My Google+