Can I predict which contracts will likely be awarded in my area?
By Don Krapohl
- Open WEKA explorer
- On pre-process tab find the government_contracts.arff file.
- Perform pre-processing
- Escape non-enclosure single- and double-quotes (’, ”) if using a delimited text version.
- Check ‘UniqueTransactionID’ and click ‘Remove’. Stating the obvious, there is no value in analysis of a continuous random transaction ID, discretization and local smoothing can lead to overfitting, and it has no predictive value.
- If you have saved the arff back into a csv you will have to filter the ZIP code fields RecipientZipCode and PlaceOfPerformanceZipCode back to nominal with the unsupervised attribute filter StringToNominal and DollarsObligated to numeric.
- Using the attribute evaluator to explore algorithm merit on the ‘Select Attributes’ tab, use the ClassifierSubsetEval evaluator with the Naïve Bayes algorithm and a RandomSearch search predicting the Product or Service Code (PSC). This yields:
Selected attributes: 2,3,4,6 : 4
ContractPricing
FundingAgency
PlaceofPerformanceZipCode
RecipientZipCode
This indicates the best prediction of a Product or Service Code using the Naïve Bayes algorithm is a 40% (0.407 subset merit) predictive ability if you know these contract attributes.
- Using those attributes to predict PSC, select the Classify tab, bayes classifier -> Naïve Bayes, 10-fold cross validation, predict PSC and click ‘Start’. The output will indicate F-measure and other attribute significance by class. An example of a single class result is:
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0 0.014 0 0 0 0.972 REFRIGERATION AND AIR CONDITIONING COMPONENTS
- View the threshold for the prediction by right-clicking the result buffer entry at the left, hover over Threshold Curve. Select the “REFRIGERATION AND AIR CONDITIONING COMPONENTS” for example. The curve is as follows:
This shows a 97% predictive accuracy on this class. The F-Measure visualization further supports this:
To see an analogous cluster visualization using Excel and the SQL Server 2008 R2 addins, see my quick article on Activity Clustering on Geography.