I’ve started a meetup for local professionals in the decision science field around the Tampa Bay area to come together and learn about what’s happening in our area. If you are a data science professional, come join us and be a part of making the Tampa-St. Petersburg metro area the southeast center of excellence in big data and analytics. Visit http://www.meetup.com/Analytics-Professionals-of-Tampa/ to find events and to join. I hope to see you there.
View agency activity clustering on geography in Excel using Excel Data Mining Add-ins
By Don Krapohl
1. Ensure you have downloaded the Excel Data Mining Add-ins from Microsoft at http://www.microsoft.com/en-us/download/details.aspx?id=35578 . The article assumes you have a working version of the DM Addins and a default Analysis Services (SSAS) instance defined. Search for getting started with SQL Server Data Mining Add-ins for Excel if you are not familiar with this process.
4. In the Home tab on the ribbon in the Styles section select “Format as Table”. Pick any format you wish.
5. A new tab will appear on the ribbon for Table Tools with menus for Analyze and Design as below.
6. On the Analyze menu, select “Detect Categories”. This is will group (cluster) your information on common attributes, particular commonalities that are not obvious or immediately observable.
7. Deselect all checkboxes except the following:
a. Dollars Obligated
b. Award Type
c. Contract Pricing
d. Funding Agency
e. Product Or Service Code
8. Click ‘run’
9. The output will show you categories of information showing strong affinities. Explore the model by filtering the charts and tables by the category/ies generated. Do this by selecting the filter icon (funnel) next to Category on the table or the Category label at the lower left of the graph.
10. Interesting information may be derived from the groups with fewer rows that may show particularly interesting correlations for a targeted campaign. For example, filter the table and chart on Category 6. This group indicates a group affinity for the attribute values ProductOrServiceCode = “REFRIGERATION AND AIR CONDITIONING COMPONENTS”, fundingAgency = “Veterans Affairs, Department Of”, and a contract award value of $61,148 to $1,173,695 as shown below:
For my organization’s business development activities, if I am in the heating and air business I may elect to focus efforts on medium-sized contracts with Veterans Affairs.
Can I predict which contracts will likely be awarded in my area?
By Don Krapohl
Open WEKA explorer
On pre-process tab find the government_contracts.arff file.
Escape non-enclosure single- and double-quotes (’, ”) if using a delimited text version.
Check ‘UniqueTransactionID’ and click ‘Remove’. Stating the obvious, there is no value in analysis of a continuous random transaction ID, discretization and local smoothing can lead to overfitting, and it has no predictive value.
If you have saved the arff back into a csv you will have to filter the ZIP code fields RecipientZipCode and PlaceOfPerformanceZipCode back to nominal with the unsupervised attribute filter StringToNominal and DollarsObligated to numeric.
Using the attribute evaluator to explore algorithm merit on the ‘Select Attributes’ tab, use the ClassifierSubsetEval evaluator with the Naïve Bayes algorithm and a RandomSearch search predicting the Product or Service Code (PSC). This yields:
Selected attributes: 2,3,4,6 : 4
This indicates the best prediction of a Product or Service Code using the Naïve Bayes algorithm is a 40% (0.407 subset merit) predictive ability if you know these contract attributes.
Using those attributes to predict PSC, select the Classify tab, bayes classifier -> Naïve Bayes, 10-fold cross validation, predict PSC and click ‘Start’. The output will indicate F-measure and other attribute significance by class. An example of a single class result is:
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0 0.014 0 0 0 0.972 REFRIGERATION AND AIR CONDITIONING COMPONENTS
View the threshold for the prediction by right-clicking the result buffer entry at the left, hover over Threshold Curve. Select the “REFRIGERATION AND AIR CONDITIONING COMPONENTS” for example. The curve is as follows:
This shows a 97% predictive accuracy on this class. The F-Measure visualization further supports this: