The elements of big data analytics has roots in statistics, knowledge management, and computer science. Many of the data mining terms below appear in these disciplines but may have different connotation or specialized meaning when applied to our problems. The problems of massive parallel processing and the specialized algorithms employed to perform analysis in a distributed computing environment are enough to require specialized treatment.

Machine Learning Terms

Term Definition
Accuracy A measure of a predictive model that reflects the proportionate number of times that the model is correct when applied to data
Bias Difference between expected value and actual value
Cardinality The number of different values a categorical predictor or OLAP dimension can have. High cardinality predictors and dimensions have large numbers of different values (e.g. zip codes), low cardinality fields have few different values (e.g. eye color).
CART Classification and Regression Trees. A type of decision tree algorithm that automates the pruning process through cross validation and other techniques.
CHAID Chi-Square Automatic Interaction Detector. A decision tree that uses contingency tables and the chi-square test to create the tree. Classification. The process of learning to distinguish and discriminate between different input patterns using a supervised training algorithm. Classification is the process of determining that a record belongs to a group
Cluster Centroid most typical case in a cluster.  The centroid is a prototype. It does not necessarily describe any given case assigned to the cluster.
Clustering The technique of grouping records together based on their locality and connectivity within the n-dimensional space. This is an unsupervised learning technique.
Collinearity The property of two predictors showing significant correlation without a causal relationship between them
concentration of measure any set of positive probability can be expanded very slightly to contain most of the probability the average of bounded independent random variables is tightly concentrated around its expectation
Conditional Probability The probability of an event happening given that some event has already occurred. For example the chance of a person committing fraud is much greater given that the person had previously committed fraud
Confidence The likelihood of the predicted outcome, given that the rule has been satisfied.
convergence of random variables a sequence of essentially random or unpredictable events can sometimes be expected to settle down into a behaviour that is essentially unchanging when items far enough into the sequence are studied
correlation number that describes the degree of relationship between two variables
Coverage A number that represents either the number of times that a rule can be applied or the percentage of times that it can be applied
Cross-validation The process of holding aside some training data which is not used to build a predictive model and to later use that data to estimate the accuracy of the model on unseen data simulating the real world deployment of the model.
Data Mining Process Define the problem. Select the data. Prepare the data. Mine the data. Deploy the model. Take business action.
Discrete Fourier Transform Concentrates energy in first few coefficients
Entropy A measure often used in data mining algorithms that measures the disorder of a set of data
Error Rate A number that reflects the rate of errors made by a predictive model. It is one minus the accuracy
Expectation–maximization algorithm for estimating parameters where there exist significant missing or inferred values
Expectation-Maximization (EM) Solves estimation with incomplete data. Iteratively use estimates for missing data and continue until convergence
Expert System A data processing system comprising a knowledge base (rules), an inference (rules) engine, and a working memory
Exploratory Data Analysis The processes and techniques for general exploration of data for patterns in preparation for more directed analysis of the data
Factor Analysis A statistical technique which seeks to reduce the number of total predictors from a large number to only a few “factors” that have the majority of the impact on the predicted outcome.
Fuzzy Logic A system of logic based on the fuzzy set theory
Fuzzy Set A set of items whose degree of membership in the set may range from 0 to 1
Fuzzy System A set of rules using fuzzy linguistic variables described by fuzzy sets and processed using fuzzy logic operations
Genetic Algorithm Optimization techniques that use processes such as generic combination, mutation, and natural selection in a design based on the concepts of  revolution
Genetic Operator An operation on the population member strings in a genetic algorithm which are used to produce new strings
Gini Index A measure of the disorder reduction caused by the splitting of data in a decision tree algorithm. Gini and the entropy metric are the most popular ways of selected predictors in the CART decision tree algorithm
Hebbian Learning One of the simplest and oldest forms of training a neural network. It is loosely based on observations of the human brain. The neural net link weights are strengthened between any nodes that are active at the same time.
Hill Climbing A simple optimization technique that modifies a proposed solution by a small amount and then accepts it if it is better than the previous solution. The technique can be slow and suffers from being caught in local optima
Hypothesis Testing The statistical process of proposing a hypothesis to explain the existing data and then testing to see the likelihood of that hypothesis being the explanation
ID3 Decision Tree algorithm
Intelligent Agent A software application which assists a system or a user by automating a task. Intelligent agents must recognize events and use domain knowledge to take appropriate actions based on those events.
Itemset An itemset is any combination of two or more items in a transaction
Jackknife Estimate estimate of parameter is obtained by omitting one value from the set of observed values. Allows you to examine the impact of outliers.
Kernel a function that transforms the input data to a high-dimensional space where the problem is solved
k-Nearest Neighbor A data mining technique that performs prediction by finding the prediction value of records (near neighbors) similar to the record to be predicted
Kohonen Network A type of neural network where locality of the nodes learn as local neighborhoods and locality of the nodes is important in the training process. They are often used for clustering
Latent variable variables inferred from a model rather than observed
Lift A number representing the increase in responses from a targeted marketing application using a predictive model over the response rate achieved when no model is used
Machine Learning A field of science and technology concerned with building machines that learn. In general it differs from Artificial Intelligence in that learning is considered to be just one of a number of ways of creating an artificial intelligence
maximum likelihood method for estimating the parameters of a model
Maximum Likelihood Estimate (MLE) Obtain parameter estimates that maximize the probability that the sample data occurs for the specific model. Joint probability for observing the sample data by multiplying the individual probabilities.
Mean Absolute Error AVG(ABS(predicted_value – actual_value))
Mean Squared Error (MSE) expected value of the squared difference between the estimate and the actual value
Memory-Based Reasoning (MBR) A technique for classifying records in a database by comparing them with similar records that are already classified. A form of nearest neighbor classification.
Minimum Description Length (MDL) Principle The idea that the least complex predictive model (with acceptable accuracy) will be the one that best reflects the true underlying model and performs most accurately on new data.
Model A description that adequately explains and predicts relevant data but is generally much smaller than the data itself
Neural Network A computing model based on the architecture of the brain. A neural network consists of multiple simple processing units connected by adaptive weights
Nominal Categorical Predictor A predictor that is categorical (finite cardinality) but where the values of the predictor have no particular order. For example, red, green, blue as values for the predictor “eye color”.
Ordinal Categorical Predictor A categorical predictor (i.e. has finite number of values) where the values have order but do not convey meaningful intervals or distances between them. For example the values high, middle and low for the income predictor
Outlier Analysis A type of data analysis that seeks to determine and report on records in the database that are significantly different from expectations. The technique is used for data cleansing, spotting emerging trends and recognizing unusually good or bad performers
overfitting The effect in data analysis, data mining and biological learning of training too closely on limited available data and building models that do not generalize well to new unseen data. At the limit, overfitting is synonymous with rote memorization where no generalized model of future situations is built
Point Estimation estimate a population parameter. May be made by calculating the parameter for a sample. May be used to predict value for missing data.
Predictive model model created or used to perform prediction. In contrast to models created solely for pattern detection, exploration or general organization of the data
Predictor The column or field in a database that could be used to build a predictive model to predict the values in another field or column. Also called variable, independent variable, dimension, or feature.
Principle Component Analysis A data analysis technique that seeks to weight the importance of a variety of predictors so that they optimally discriminate between various possible predicted outcomes
Prior Probability The probability of an event occurring without dependence on (conditional to) some other event. In contrast to conditional probability
Purity/Homogeneity the degree to which the resulting child nodes are made up of cases with the same target value
Radial Basis Function Networks Neural networks that combine some of the advantages of neural networks with those of nearest neighbor techniques. In radial basis functions the hidden layer is made up of nodes that represent prototypes or clusters of records
Receiver Operating Characteristic (ROC) The area under the ROC curve (AUC) measures the discriminating ability of a binary classification model. The larger the AUC, the higher the likelihood that an actual positive case will be assigned a higher probability of being positive than an actual negative case. The AUC measure is especially useful for data sets with unbalanced target distribution (one target class dominates the other).
Regression A data analysis technique classically used in statistics for building predictive models for continuous prediction fields. The technique automatically determines a mathematical equation that minimizes some measure of the error between the prediction from the regression model and the actual data
Reinforcement Learning A training model where an intelligence engine (e.g. neural network) is presented with a sequence of input data followed by a reinforcement signal
Root Mean Squared Error SQRT(AVG((predicted_value – actual_value) * (predicted_value – actual_value)))
Sampling The process by which only a fraction of all available data is used to build a model or perform exploratory analysis. Sampling can provide relatively good models at much less computational expense than using the entire database
Segmentation The process or result of the process that creates mutually exclusive collections of records that share similar attributes either in unsupervised learning (such as clustering) or in supervised learning for a particular prediction field
Sensitivity Analysis The process which determines the sensitivity of a predictive model to small fluctuations in predictor value. Through this technique end users can gauge the effects of noise and environmental change on the accuracy of the model
Simulated Annealing An optimization algorithm loosely based on the physical process of annealing metals through controlled heating and cooling
Sparsity This means that a high proportion of the nested rows are not populated.
Statistical Independence The property of two events displaying no causality or relationship of any kind. This can be quantitatively defined as occurring when the product of the probabilities of each event is equal to the probability of the both events occurring
Stepwise Regression Automated Regressions to identify most predictive variables.  1st regression finds most predictive, 2nd regression finds most predictive given 1st regression.
Supervised Algorithm A class of data mining and machine learning applications and techniques where the system builds a model based on the prediction of a well defined prediction field. This is in contrast to unsupervised learning where there is no particular goal aside from pattern detection.
Support The relative frequency or number of times a rule produced by a rule induction system occurs within the database. The higher the support the better the chance of the rule capturing a statistically significant pattern.
Time-Series Prediction The process of using a data mining tool (e.g., neural networks) to learn to predict temporal sequences of patterns, so that, given a set of patterns, it can predict a future value
Unsupervised Algorithm A data analysis technique whereby a model is built without a well defined goal or prediction field. The systems are used for exploration and general data organization. Clustering is an example of an unsupervised learning system
Visualization Graphical display of data and models which helps the user in understanding the structure and meaning of the information contained in them

 

computer vision training images

Below are many downloadable free machine learning datasets. They cover click data, air traffic control data, surveys, temporal datasets of various types, crime data, employee pay data, map data, law data, and many other types.

 

I am a huge fan of SSPS, scikit-learn, opennlp, and other mainstream libraries but for quick analysis and visualization don’t forget about Pentaho data mining (
https://wiki.pentaho.com/display/DATAMINING/Pentaho+Data+Mining+Community+Documentation ) based on University of Waikato’s Weka (
https://www.cs.waikato.ac.nz/ml/index.html ). It can also be used with Pentaho Kettle to submit to a hadoop cluster and perform advanced multi-step analysis.

Public data resources: research-quality, free data mining data sets

All datasets with keywords

Search entries:

Announcing the Article Search API – Open Blog – NYTimes.com

Text mining` article` api` text` corpus` newspaper

Information Extraction: The RISE Repository of Information Sources

Text mining` information` text mining` extraction` reviews` jobs

Repositories

Text mining` links` text mining` books` rdf` ocr` documents

API Documentation — BackType

Text mining` api` blog` comments` text mining` stream` trends` backtype` queryminer

Free book usage data from the University of Huddersfield » “Self-plagiarism is style”

Text mining` books` library` borrowing` recommender` isbn` recommendation` collaborative` filtering

ICWSM 2009 – International AAAI Conference on Weblogs and Social Media

Text mining` blog` crawl` corpus` network` web` link, data mining data sets

Change.gov: The Obama-Biden Transition Team | Join the Discussion: Healthcare

Text mining` textmining` opinion` comment` topic` government` queryminer

Opinion Extraction, Opinion Mining, Sentiment Analysis, Summarization of Customer Reviews

Text mining` sentiment` mining` classification` machine learning` reviews` recommender` text mining` links

http://www.yr-bcn.es/semanticWikipedia

Text mining` wikipedia` named entity` tagged` text ming

Building a (fast) Wikipedia offline reader

Text mining` django` wikipedia` compressed` text mining` howto

Reddit’s Secret API

Text mining` reddit` api` json`

phishingcorpus [JoseWiki]

Text mining` phising` corpus` text` email` text mining` nlp` mail` security

Wikipedia Datasets for the Hadoop Hack | Cloudera

Text mining` wikipedia` hadoop` textmining` links

Main Task QA Data

Text mining` question` answering` trec` nlp` machinelearning

The New York Times Annotated Corpus « YooName – named entity recognition

Text mining` named entity` nytimes` corpus` people` organizations` locations

ADL Gazetteer Development

Text mining` named entity` location` place names` geo` nlp` natural language processing

Beautiful Data – WikiContent

Text mining` book` data` wiki` via:jhammerb

Web FAQ collection | ILPS

Text mining` faq` question_answering` questions` web` crawl` corpus` xml` textmining

Wikipedia:Lists of common misspellings/For machines – Wikipedia, the free encyclopedia

Text mining` spelling` mispelling` wikipedia, data mining data sets

build.kiva: Blog – Introducing the Kiva API

Business and Finance` finance` api` social` kiva` microlending` lending

Visualizing the Growth of Target, 1962-2008 | FlowingData

Business and Finance` visualization` retail` finance` gis` map` location` store` via:magnetbox

The Economy According To Mint

Business and Finance` finance` commercial` consumer` mint` spending

Best Buy Remix – Welcome to the Best Buy Remix Developer Network

Business and Finance` retail` data` api` product` bestbuy

Behavioral Targeting, Analytics and Advertising Service for Publishers, Ad Networks

Business and Finance` analytics` audience` segmentation` toolbar` commercial` sem` search` advertising

Executive PayWatch Database

Business and Finance` ceo` compensation` pay` economics` business` labor

TradingSolutions – Data Sources

Business and Finance` trading` finance` s` api` list

Netflix API – Welcome to the Netflix Developer Network

Business and Finance` netflix` api` movie` mashup` netflixprize` ratings

Open beats Closed: Best Buy’s new APIs – O’Reilly Radar

Business and Finance` retail` bestbuy` api

Tickermine

Business and Finance` custom` research` retail` finance` market` service` analyst`

University of Arkansas – Daily Headlines

Business and Finance` retail` dillards` uark

developerWorks Interviews: Massive data mining and the resurgent mainframe

Business and Finance` price` retail` transaction` sams_club` dillards

opentick :: market data

Business and Finance` opentick` nasdaq` finance` stock` data mining data sets

U.S. Company Filings and Annual Reports

Business and Finance` finance` links` sec

FTP Information – EDGAR Database

Business and Finance` edgar` finance` sec` filing` ftp` instructions

Data Mining For Investing

Business and Finance` investing` finance` datamining` announcement` sec` filing` links

UN General Assembly Voting Data

Government` un` voting` statistics` government

Research Datasets :: CID Data :: Center for International Development at Harvard University (CID)

Government` economics` international` development

Subsidyscope.com

Government` government` banking` csv` tarp` bailout

Data Catalog

Government` dc` government` feeds` transparency` opendata

Announcing the New York Times Campaign Finance API – Open – Code – New York Times Blog

Government` nyt` api` campaign` donations` fec`

Voter registration data; or, HERE IS YOUR HOPE, YOU FOOLS! « The Edge of the American West

Government` voter` registration` politics`2008

import/parse/fec.py at master from aaronsw’s watchdog — GitHub

Government` fec` python` parser` government` campaign

The Watchdog Project: volunteer

Government` government` transparency` parsing` election` python

Dataset of the day: Where are the Obamacans? | Off the Map – Official Blog of FortiusOne

Government` obama` goverment` mashup` gis` geo` map` campaign` donations

Normalized Campaign Contribution Data

Government` cmu` politics` campaign` donations` fec` via:jhammerb` government

Crime data bonanza!!!

Government` timeseries` crime` statistics` publicdata

Ohio voter registration data

Government` voter` voting` politics` government` name` address` registration

Voter List Data Files – Election Department, Clark County, Nevada

Government` voting` voter` registration` name` address` data` election` politics

UNdata

Government` UN` publicdata` government` statistics

RealClearPolitics – Election 2008 – Democratic Presidential Nomination

Government` polls` politics

Crime in the United States 2006

Government` crime` fbi

Daily Kos: Obama helps us track $1,000,000,000,000 of federal spending

Government` corruption` government` politics` finance`

Welcome to USAspending.gov

Government` government` money` politics`

Campaign Finance Reports and Data

Government` campaign` politics` elections

ERS/USDA Data – International Macroeconomic Data Set

Government` usda` economics` population` cpi` gdp` income

State Agency Databases – GODORT

Government` government` directory` links` wiki` states

National Bureasu of Economic Research: Data

Government` economics` links

Bureau of Labor Statistics Data

Government` economics` lumber` building` materials` homedepot

NBI ASCII Files – Bridge – FHWA

Government` government` bridges` safety

Twitter API Wiki / REST API Documentation: Social Graph Methods

Network Analysis` graph` network` api` social` twitter

Using the Wikipedia link dataset — Henry Haselgrove

Network Analysis` graph` network` link` wikipedia` pagerank

twibs : find the businesses on twitter

Network Analysis` directory` businesses` twitter` companies

Massive Scrape of Twitter’s Friend Graph « blog.infochimps.org – Organizing Huge Information Sources

Network Analysis` textmining` twitter` network` socialnetwork` pagerank` graph` queryminer

Twitter Scrape (rough draft) – get.theinfo | Google Groups

Network Analysis` twitter` socialnetwork` graph

wiki.dbpedia.org : Downloads 32

Network Analysis` wikipedia` named_entity` rdf` ontology

ICWSM 2009 – International AAAI Conference on Weblogs and Social Media

Network Analysis` blog` crawl` corpus` network` web` link

Linked Movie Data Base

Network Analysis` rdf` movies` movie` api

YouTube Dataset

Network Analysis` youtube` research` crawl` socialnetwork` network` graph` web

API Documentation – Twitter Development Talk | Google Groups

Network Analysis` twitter` text` api

CRAWDAD

Network Analysis` wireless` RF` radio` signal` dartmouth` network

Yahoo! Music API – YDN

Network Analysis` api` yahoo` music` artists

Lookery Developer Network – Lookery Developer Resources

Web Analytics` web` analytics` api` traffic` advertising` demographics` lookery

True Marble Imagery – Free Download

Spatial Analysis` gis` geo` map` mapping` images` satellite

Zillow – Labs – Neighborhood Boundaries

Spatial Analysis` neighborhoods` geo` gis` maps

Full Examples — PyMVPA Home

Image Analysis and Video Analysis` fmri` neuroscience` python` neuralnetwork

HumanScan : BioID : Downloads : BioID Face Database

Image Analysis and Video Analysis` face` detection` image

Face Detection

Image Analysis and Video Analysis` facerecognition` opencv` face` links

NORB Object Recognition Dataset, Fu Jie Huang, Yann LeCun, New York University

Image Analysis and Video Analysis` image` 3d

LIFE photo archive hosted by Google

Image Analysis and Video Analysis` images` photo` pictures` search

Activity Recognition: Datasets, Bibliography and others

Image Analysis and Video Analysis` activity` recognition` intent

Frontal Face Databases

Image Analysis and Video Analysis` facerecognition` face` image` recognition

Copyright Free and Public Domain Media

Image Analysis and Video Analysis` images` audio` publicdata` maps` video` free

Databases you can use for benchmarking

Image Analysis and Video Analysis` image` vision` recognition`

2007 IEEE AVSS Detection and Tracking Algorithm Datasets

Image Analysis and Video Analysis` tracking` video` detection` image` recognition` vehicle` pedestrian`

OTCBVS

Image Analysis and Video Analysis` image` recognition` detection` pedestrian` thermal` tracking` facerecognition` illumination

Carnegie Mellon University – CMU Graphics Lab – motion capture library

Image Analysis and Video Analysis` gait` pedestrian` walk` motion

public domain sounds | free sound library

Audio Analysis` sound` publicdomain` audio

Full Examples — PyMVPA Home

Bioinformatics` fmri` neuroscience` python` neuralnetwork

CinC Challenge 2000 data sets

Medical Informatics` timeseries` machinelearning` ecg` health` medical` sleep` apnea

UC Berkeley. Sheldon Margen Public Health Library. Statistical/Data Resources

Healthcare Analytics` health` links` resources` publichealth` berkeley

Google Flu Trends | How does this work?

Healthcare Analytics` google` health` trends` search` prediction` epidemiology` biodefence` queries

Eigenvector Research, Inc. : Data Sets Available to Download

Chemoinformatics` NIR` spectra` chemistry` semiconductor` pharmaceutical` matlab`

Vaccines: IIS/Tech/Deduplication Test Cases

Healthcare Analytics` duplicate

Health Data Tools and Statistics

Healthcare Analytics` health` information` public` publicdata

Cardiac MRI dataset – York University

Healthcare Analytics` mri` cardiac

NACDA: Search Holdings

Demography` aging` statistics` studies

Poverty Data Sets General Information

Demography` poverty` statistics

Pew Internet & American Life Project

Demography` internet` demographics` online` web

The 2000 U.S. Census: 1 Billion RDF Triples

Demography` gis` census` rdf` semantic` sparql

Download Database – baseball1.com

Sports Analysis` baseball` database` publicdata` statistics` sports

It’s a Pitch-by-Pitch Scouting Report, Minus the Scout – New York Times

Sports Analysis` baseball` gameday

BART – For Developers

Network Analysis` urban` transportation` feeds` public` sanfrancisco` bart` api`

Tim Davis: UF Sparse Matrix Collection : sparse matrices from a wide range of applications

Matrices` spare` matrix

Idealware: Mapping Blues: Where is the Data?

Pre-processing` resources` links`mapping

Amazon Web Services Public Datasets » Data Wrangling Blog

Amazon Web Services` amazon` ebs` ec2` s3` publicdata` hadoop

Amazon Web Services (AWS) Hosted Public Data Sets

Hosted Datasets` amazon` ebs` publicdata

WSCD09: Workshop on Web Search Click Data 2009

Web Analytics` workshop` search` web` microsoft` log`

downloading – flossmole – Google Code – How to get FLOSSmole data for your own use

Google` opensource` project` activity` mysql` dump

Multi-Domain Sentiment Dataset

Supervised` sentiment` review` product` amazon

Chris Pound’s Name Generation Page

 bizzare` scifi` phrase` name` word` generators` random` perl

Big Huge Thesaurus API: Access 145,000 Words and Phrases

Phrases` webservice` api` thesaurus` textmining` nlp` rest`

Search Query Performance report – Google AdWords Help Center

Performance` adwords` ppc` search` metrics` webanalytics` sem` query` queryminer

Wordze Keyword Research Tool

Web Analytics` queryminer` keyword` tool` research` commercial` search` adwords

Searchable Catalogs of Data

Network Analysis` links` catalogs` social

radiohead – Google Code

Audio Analysis` lidar` visualization` radiohead` google` video

80 Million Tiny Images

Image Analysis and Video Analysis` images` words` english` search` visualization` imagemap

Time Series Center | Harvard University

Temporal Analysis` timeseries` anomaly` detection` astronomical` physics

OpenVisuals – Open Source Visualization Framework

Image Analysis and Video Analysis` visualization` community` design` processing

BGN: Domestic Names – State and Topical Gazetteer Download Files

Demography` gis` usgs

Datasets

Random` benchmark` clustering` regression` machinelearning` list` statistics` mathematics

Isomap Datasets

Image Analysis and Video Analysis` nonlinear` dimensionality` reduction` faces` digits` images` manifold

Yahoo! Search Blog: BOSS — The Next Step in our Open Search Ecosystem

oss` api` open` search` yahoo` BOSS` queryminer

Download the Database – IP Address Lookup – Community Geotarget IP Project

Network Analysis` geocoding` geoip` internet` ip` ipaddress` mysql

Airline Data Project

Government` airline` statistics` finance` revenue` location` travel

Show Us a Better Way: What public data is already available?

Government` statistics` census` uk` school` news` publicdata

NGA: Country Files

Government` country` cities` geo

OHPI – Traffic Volume Trends

Government` government` traffic` statistics` trends` transportation

Semantic Search the US Library of Congress

Government` via:inkdroid` libraries` mashup` rdf` semantic` search` semanticweb` books

reddit.com: Ask Reddit: Where to download a DB dump of Reddit?

Text mining` reddit` socialnetwork` news` web

Collaborative filtering dataset – dating agency

Text mining` collaborative` filtering` dating` rating` profiles` czech

About Us – Predictify

Business and Finance` predictionmarket` tool` finance` buzz` advertising` marketing` startup` mmds

VGChartz.com | Video Games, Charts, News, Forums, Reviews, Wii, PS3, Xbox360, DS, PSP

Business and Finance` sales` ranking` videogames` retail

Store Level Information

Business and Finance` retail` finance` sales` store`

Code for querying and downloading Flickr images

Image Analysis and Video Analysis` image` python` code` flickr` matlab` recognition

Image Parsing Datasets

Image Analysis and Video Analysis` image` recognition

TAGora » Data

Network Analysis` tag` tagging` s

TAGora » Data

Network Analysis` netflixprize` imdb` sparql

Quality of Life Grand Challlenge Dataset: Kitchen Capture

Image Analysis and Video Analysis` machinelearning` motion` capture` sensor

Summize Twitter Search API

Text mining` api` buzz` opinion` trends` text` twitter` summize` search

2008 IEEE InfoVis Contest Dataset

Image Analysis and Video Analysis` visualization` contest` scalability` motion` tracking` pedestrian` sensor

IMDb Pro : Scary Movie 4: Box office

Business and Finance` movie` revenue` sales` box_office` imdb` commercial` movie_study

Spider-Man 2 (2004) – Daily Box Office Results

Business and Finance` movie` revenue` box_office`

Live Search : xRank™ Celebrity — check out who’s hot and who’s not!

Network Analysis` search` query` volume` trends` celebrity` prediction` buzz` named_entity

IMDbPro.com Free Trial Signup

Business and Finance` movie` revenue` timeseries` imdb` commercial` subsription

Free time-series and micro-data to download

Business and Finance` economics` links

PyGTrends: Python API for Google Trends Data

 google` trends` search` web` analytics` api` code` python` hack

Official Google Blog: A new flavor of Google Trends

 google` trends` search` query` api` csv` keyword` timeseries

Open Research – the Data: Lastfm-ArtistTags2007 – Duke Listens!

 last.fm` music` tagging` artists` tags` collaborative` filtering

i2b2: Informatics for Integrating Biology & the Bedside

 medical` obesity`

Tiger Data Set Lecture

 tiger` gis` lectures

Google To Launch Large Scale Geo-Services

 geo` google` gps` location` geolocation` cell` wifi` api` gis

Last.fm’s Playground

 celebrity` misspelling` spelling` names

ImportGenius.com : U.S. Customs Database and Competitive Intelligence Tools

 commercial` shipping` imports` exports` finance` datamining

Directory Listing of Betfair price files

 betting` prediction` betfair` price` csv` predictionmarket

Reuters Spotlight – Article and Media API

 news` text` articles` api` content` media` xml` images` publicdata

DataSets – Scikits – Trac

 scipy` python` machinelearning` statistics` resource

[Wikitech-l] page counters

 wikipedia` pageviews` trends` textmining` seo` topic

Wikipedia article traffic statistics

 via:chl` wikipedia` web` analytics` seo` topic` textmining` traffic

Yahoo! Internet Location Platform – YDN

 yahoo` geo` geocoding` location` landmarks` gis

How to find images on the internet « Random knowledge

 images` links` lists` archive`

Yahoo offers geographic data to Web sites | Tech news blog – CNET News.com

 gis` webservice` yahoo` api` location` landmark

Instructions for Obtaining Search Engine Transaction Logs

 query` search` log` excite` altavista` alltheweb` transaction

TechTC – Technion Repository of Text Categorization Datasets

 datamining` textmining` categorization` classification` odp` directory` text

The TechTC-100 Test Collection for Text Categorization

 textmining` classification` category` odp` directory

FEC Election Contributions: Download Detailed Files by Election Cycle

 individual` donations` government` election` publicdata` fec

Juiced Google Analytics Python API: Juice Analytics

 search` statistics` keywords` analytics` api` python` web` seo` google

Country Name and ISO 3166 Code MySQL Import File

 mysql` states` countries` isocode

geocoded Hotels « GeoNames Blog

 hotels` geonames`

GeoNames webservice and data download

 locations` cities` countries` gis

Index of /download/worldcities

 cities` gis

ualberta dependency based thesaurus and word count data

 corpus` text` similarity` terms

CommonCrawl – About

 web` crawler` bot`

Data sets and corpus / corpora for biological literature and text mining  

 bioinformatics` text` corpora` domainspecific` genomics` corpus`

Office of Defects Investigation (ODI), Flat File Downloads

 defect` recall` automobile` fightclub` nhtsa` saefty

p2psim – kingdata : DNS server latency network distance matrices

 distance` matrix` network` p2p` dns` latency` nmf` queryminer

Sep Kamvar / Personalization /

 pagerank` web` matrix` matlab

beta.opentick.com

 opentick` trading` beta` feeds` finance

WikiXMLDB: Querying Wikipedia with XQuery

 wikipedia` xml` ec2

kiwitobes.com » Blog Archive » Walmart Growth Video

 walmart` visualization` video` freebase` store` retail` locations` opening

Open Cell Id dataset – phone geolocation from GSM cellids

 gis` mobile` geolocation

The Cornell Web Lab – The Cornell Web Lab

 cornell` web` archive` hadoop` crawl

im2gps: estimating geographic information from a single image

 imagerecognition` via:csantos` gis` cmu` gps` imageprocessing` paper` hack` freaking_awesome

Datasets: MUSCLE WP2 Evaluation, Integration and Standards

 image` video` audio` currency` sports` imagerecognition

Open Economics – Store – Index

 economics` list

welcome @ omdb

 free` movie` database` netflixprize

Cogblog » Blog Archive » Cogmap APIs

 api` cogmap` person` name` organization` record_linkage

Wal-Mart : Freebase – The World’s Database

 retail` locations` stores

Cogmap: The Org Chart Wiki

 record_linkage` identity` name` organization` orgchart` marketing

German English Parallel Corpus “de-news”, Daily News 1996-2000

 german` translation` corpus` english` text` via:maxme

Welcome to the CRCNS data sharing activity website — CRCNS

 neuroscience` patch` clamp` recordings` neuron` timeseries` patchclamp` data` neural

Infochimps.org: Free Redistributable Rich Data Sets

 aggregator` links

Frequent Itemset Mining Dataset Repository

 retail` clickstream` traffic` web` links` sales

Dolores Labs Blog » Blog Archive » Our color names data set is online

 colormap` color` mechanicalturk

TeradataUniversityNetwork.com -> Registration

 teradata` retail` transactional` database

Pascal Learning Challenge Large Datasets

 large` competition` challenge` svm` machinelearning` scalability

ECIS 2007 – The 15th European Conference on Information Systems

 retail` dillards` sams_club

Alexa Web Search

 alexa` aws` web` search` api`

State and Federal Case Law

 creativecommons` court` legal` law` via:inkdroid

Access to Web Research Collections VLC2/WT10g/WT2g

 blog` web` text

Lyricsfly Lyrics API, database access to search for music artist and song title

 song` lyrics` database` api`

99 Wikipedia Sources Aiding the Semantic Web » AI3:::Adaptive Information

 links` directory` record_linkage` extraction` wikipeida` named_entity` recognition` textmining` semanticweb

AudioScrobbler Data

 audioscrobbler` recommendation` collaborative` filtering` music

The Linking Open Data dataset cloud

 directory` rdf` semantic` data` soup` graph

Free Economic Data | Economic, Financial, and Demographic Data

 finance` economics` portal` links

::MLSP 2008::: MLSP competition

 machinelearning` trading` competition` backtest` matlab` code` finance` via:DeliciousRob

Computer Vision Test Images

 computer` vision` image` ray` trace` fingerprint` stereo` detection` via:chl

The Dataverse Network Project | The Dataverse Network Project

 statistics` repository` harvard

DVN – Home

 harvard` repository` social` science` research` portal` links

Temperature data (HadCRUT3 and CRUTEM3)

 climate` temperature` netcdf

MNIST handwritten digit database, Yann LeCun and Corinna Cortes

 handwriting` mnist` image` recognition

LFW : Labelled Faces in the Wild

 facerecognition` face` recognition` umass` image

Making random contacts – (37signals)

 generator` names

Test (Sample) Data Generators

 generator` tools` list` via:jd

Compete – Compete Developer Resources

 compete` api` web` statistics` traffic` analytics` mashup

Machine Learning (Theory) » The Peekaboom Dataset

 peekaboom` vision` image` large` human` computation` machinelearning` recognition

Ocean Processes and Modeling: Ocean Data

 links` oceanography` satellite

BlogoCenter data sets

 blog` ucla

Tagged datasets for named entity recognition tasks

 nlp` corpus` tagged` named_entity` recognition` list

del.icio.us stats – deli.ckoma

 del.icio.us`

The Financial Data Finder A – G

 finance` links

Freebase Wikipedia Extraction (WEX)

 wikipedia` xml` structured` corpus

The arXiv.org API

 arxiv` api` open` paper` academic`

England Football Results Betting Odds | Premiership Results & Betting Odds

 gambling` soccer` football` excel` statistics

HughesData – Main – Hughes Lab

 rna` bioinformatics` microarray` expression` gene` machinelearning

Stanford MicroArray Database

 bioinformatics` microarray` expression` gene` machinelearning` stanford

ArrayExpress Home

 bioinformatics` microarray` expression` gene` machinelearning

Gene Expression Omnibus (GEO) Main page

 bioinformatics` microarray` expression` gene` machinelearning

Index of /courts.gov

 corpus` text` legal` law` court` ruling` opensource` publicdata

Welcome to Openvest

 python` finance` edgar` pylons` matplotlib` sec` webservice` via:jolby

Statistical Science Web: Data Sets

 links` statistics

Text Mining, Visualization and Social Media

 crawler` blog` corpus

Aleix Face Database

 facerecognition` machinelearning` face` image

Data Repository Evaluation

 umd` links` statistics` government` sports` via:rickladd

PMC FTP Service

 biology` medicine` articles` text` journal` authors

“uspop2002” data set

 music` similarity` machinelearning

Internet Archive: Details: Amazon ASIN listing and similarity graph

 ASIN` amazon` recommendation` collaborative` filtering` via:keyvowel

European Climate Assessment Daily Weather Data

 weather` europe` ascii` netcdf

StatLib—Datasets Archive

 machinelearning` datamining` cmu` link` collection

National Household Travel Survey (NHTS) Data

 driving` transportation` publicdata

Nielsen BookScan USA

 books` sales` commercial

Home – Numbrary

 finance` data`

About – Numbrary

 searchengine` search` tagging` aggregator` numeric` extraction` tables` collaboration` web2.0

Main Page – OpenTextMining

 textmining` open` nature` standards` search

Metafilter Infodump

 metafilter` comments` network` via:chl

WEBSPAM-UK2007 | Datasets | Web Spam Detection

 web` search` spam` crawler` yahoo

Trust network datasets – TrustLet

 socialnetwork` trustnetwork` trust

TaskForces/CommunityProjects/LinkingOpenData/DataSets – ESW Wiki

 opendata` semantic` rdf` collaboration

Some Datasets Available on the Web » Data Wrangling Blog

 publicdata` links

XML.com: GovTrack.us, Public Data, and the Semantic Web

 semanticweb` rdf` congress` politics` government

CiteULike: Available datasets

 networks` research` graph` tags` paper` record_linkage

Archive-It.org

 archive` internet` web` index`

Challenge: Synopsis – Causality Workbench

 competition` machinelearning` forecasting` contest

Natural Language Processing

 microsoft` text` paraphrase` corpus

LDC – Linguistic Data Consortium – Obtaining Data Resorces

 nlp` text` corpus` ngram` google` commercial` license

1990 Census Name Files

 census` names` identity` frequency` record_linkage

Given Name Frequency Project: Analysis of Given Name Popularity

 name` record_linkage` text` identity` code

Email Datasets

 enron` names` identity` text` record_linkage

ZoomInfo – Welcome to the ZoomInfo Developer API

 api` identity` people` webservice` record_linkage

Name Discrimination Data Named Entity Resolution /  Entity Disambiguation

 record_linkage` corpus` nlp` names

Developers Area – eBay Market Data Documentation – eBay Market Data Documentation

 ebay` api` retail` price` code

New SwetoDblp RDF dataset released with 11M triples

 name` authorship` rdf` record_linkage

LSDIS : SwetoDblp

 bibliography` rdf` ontology` duplicate` name` record_linkage

StrikeIron Super Data Pack Web Service 1.0 – StrikeIron Marketplace

 webservice` publicdata` datacleaning

Duplicate Detection, Record Linkage, and Identity Uncertainty: Datasets

 duplicate` detection` record_linkage` datacleaning` text

INFO 747 – Social and Economic Data

 datacleaning` record_linkage` video` lectures` course` cornell` economics` finance` publicdata

Overstock.com Affiliate Program

 retail` overstock` sales` api` product` price` forecasting

Amazon Web Services Developer Connection : Can Alexa WS provide detailed …

 finance` alexa` amazon` tech

Market Data — eBay Developers Program

 ebay` retail` pricing` sales` api` product

Machine Learning and Data Mining – Datasets

 face` image

GIS for Schools

 epidemiology` gis` health

Google Trends API coming soon | Tech news blog – CNET News.com

 google` trends` api`

MIT Media Lab: Reality Mining

 social` activity` location` cell` gis

RL Competition 2008 – Home

 machinelearning` reinforcement` agent` competition`

Vehicle Routing Data Sets

 optimization` vehicle` routing

EIA – Petroleum Data, Reports, Analysis, Surveys

 oil` energy` statistics` economics` petroleum

DMOZ100k06 – Michael G. Noll

 search` pagerank` text` tags` content

Grading

 machinelearning` CMU` course` projects` graphicalmodel` code` paper

Financial Forecast Center’s Historical Economic and Market Data

 exchangerate` dollar` economics`

Browse Business Cycle Indicators Data

 economics` indicators` time` series

The Numbers Guy : Aspiring to Be the Wikipedia of Numbers

 finance` numberpedia` mechanicalturk` textmining` statistics

Social characteristics of the Marvel Universe

 socialnetwork` graphs` comicbooks

SourceForge.net: Word Lists Collection

 dictionary` words

See Who’s Editing Wikipedia – Diebold, the CIA, a Campaign

 wikipedia` authorship`

Dataset Generator – Perfect data for an imperfect world.

 tools` generator

Entree Chicago Recommendation Data

 recommender` collaborative` restaurant

community resource guide: i’ve been here before – show me the links

 demographics` maps` gis` statistics` links

Social Science Data on the Net

 economics` social` government` health` labor` links

List of films: A – Wikipedia, the free encyclopedia

 netflix` netflixprize` movie` index` wikipedia`

The arXiv on your harddrive

 paper` corpus` arXiv

Insanely Useful Websites | Sunlight Foundation

 links` transparency` government` politics` congress` reference

Technophilia: Where to find public records online – Lifehacker

 public` records` links

Junk email project

 corpus` email` spam` textmining

Enron Email Dataset

 enron` corpus` email` text` social` network

ftp://ftp.bls.gov/pub/special.requests/cpi/cpiai.txt

 finance` cpi` inflation` data

GOS – Geospatial One Stop

 health` gis` epidemiology` links

CIA Factbook Grep in Python

 cia` population` python` code` grep

Miller Center of Public Affairs – Richard Nixon – Oval Office Recordings

 nixon` speech` tapes` audio` mp3` wav` flac

Deborah Jeane Palfrey Legal Defense Fund

 phone` politics

UC San Diego Data Mining Competition – 2007 – Datasets

 housing` refinance` mortgage`

Retail Industry Financial Ratios & Benchmarks

 retail` finance` sales` sqft`

Retail Industry Financial Ratios & Benchmarks

 retail` finance` sales` sqft

stores | POI Factory

 retail` location` poi

GpsPasSion Forums – ** INDEX OF POI COLLECTIONS **

 retail` poi` location` gis` gps

GPS POI US : Home > Retail Stores

 retail` location` gis

Collective Dynamics Group

 smallworld` networking` socialnetwork` graph

Jester Data download page

 collaborative` filtering` jokes

TricTrac: Video Dataset

 video`

Premium Business Information Databases – AlacraWiki

 links` finance` commercial

Index of /edgar

 finance` xml` edgar` sec` code` perl

Mail Index

 EDGAR` sec` mail` text

metafy / Anthracite Idioms

 finance` SEC` scrape` parse` commercial

Retail and Food Services – Time Series Data/Seasonal Factors

 retail` sales` census

TDT

 categorization` textmining` detection` tools

Volume of retail sales: Social Trends 33

 retail` sales` uk

generatedata.com

 tools` generator` random

Melissa DATA – Lookups

 consumer` data` database` api

FactSet: Data Maven – Kiplinger.com

 factset` finance`

IBES (Demo)

 finance` ibes` analyst` forecast` wharton

Thomson Financial I/B/E/S Data

 finance`

Historical Quotes – Yahoo! Finance

 yahoo` finance` stock` price`

Network data

 network` links

Bureau of Labor Statistics Home Page

 statistics` labor` government` consumer

NAR: Research: EHS Data

 housing` sales` finance

RFA – The Industry – Industry Statistics

 ethanol`

Chain Store Guide – Retail Locations

 retail` finance` store` locations` gis

Press Releases – Directions Magazine

 retail` gis` store` locations

Energy Information Administration – EIA – Official Energy Statistics from the U.S. Government

 finance` government` energy` historical` forecasts` fuel` oil

Databases you can use for benchmarking

 links

UPC Database: Downloads

 product` upc` database`

Web Crawling / Crawl Datasets at Tobias Escher at the OII

 crawler` benchmark` search` web` links

TechTC – Technion Repository of Text Categorization Datasets

 corpus` text

TMC data archive download site

 traffic` data`

http://www.volvis.org/

 volume rendering

Computational Vision: Archive

 vision` caltech` image recognition

DC Pedestrian Classification Benchmark

 pedestrian` image` classification` detection

opentick :: home

 finance` economics` feed` free` stock` trading` opentick` opensource

Web as Corpus

 textmining` corpus` concordance` wordlist` n-gram

.:[ packet storm ]:. – http://packetstormsecurity.org/

 dictionary` hack` security` wordlist` password

Enron Dataset

 data` mysql` email` energy` text` social network

Splog Blog Dataset

 blog` corpus` spam

Home Page for 20 Newsgroups Data Set

 corpus` text` newsgroup

White Glove Tracking

 crowd sourcing` image` processing` algorithm` collaborative` distributed` web2.0` code` opensource

NOAA Paleoclimatology Program – Coral and Sclerosponge Data

 paleo climatology` climate` oceanography` coral` sponge` biology

NAICS — North American Industry Classification System

 finance` economics` naics` industry` classifications

Saving Democracy With Web 2.0 –

 democracy` web2.0` mashup` government` funding` article

Congresspedia – Congresspedia

 collaborative` wiki` government` congress` politics` elections` web2.0` directory

Population Estimates Data Sets

 census` data` population` statistics

CRAN Task View: Machine Learning & Statistical Learning

 statistical learning` machine learning` code` R` libraries` cran`

Data for Data Mining

 linkd` datamining` timeseries` text` extraction` socialnetwork

PAIDA – Pure Python scientific analysis package

 python` visualization` library

SUBDUE – Graph Based Knowledge Discovery

 machine learning` network` graph`

AOL search data mirrors

 aol` search`

Python Cheese Shop : shakespeare 0.4

 python` text`

AG’s corpus of news articles

 corpus` nlp` machine learning` textmining

Sampling Techniques for Massive Data – Google Video

 video` machine learning` statistics` matrix` sampling` large` sparse` algorithm` experiment_design

metachronistic » Mirror the Wikipedia

 wikipedia` laptop` install` dump

LETOR: Benchmark Datasets for Learning to Rank

 ranking` search

CN710: Comparative Analysis of Learning Systems (Spring 2006) – Class Project

 machinelearning` algorithm` ogi` bu` greyhound` finance

UrbanSim Home

 python` urban` software` simulation` opensource` GIS` census`

System One – Wikipedia³

 wikipedia` rdf`

System One – Labs

 wikipedia` rdf` tools

Face Recognition Homepage – Databases

 face` algorithm` facere cognition` data` image

CBCL SOFTWARE Face data set

 face` seung` algorithm` recognition` image

Text Analytics Solutions from ClearForest

 extraction` finance` semantic` semanticweb` text

23C3 – Mining Search Queries – Google Video

 aol` search` video` talk` algorithm` information retrieval` datamining` machinelearning

Digital History Hacks: Keywords and Clues

 aol` search` query` analysis

Digital History Hacks: Searching for History

 aol` search` query` analysis

The Tom Kyte Blog: An interesting data set…

 aol` search` oracle` database` code

KDD 2005 – KDD Cup 2005: Aug 21-24, Chicago, IL. USA

 query` categorization` algorithm` google

Statistical NLP / corpus-based computational linguistics resources

 corpus` machine learning` text

Ph.d.-student Rasmus Elsborg Madsen

 text` machine learning` context` matlab

Intelligent Web Search and Mining: Tools & Resources

 machine learning` code` links

PageRank Datasets and Code

 pagerank` code` algorithm

Official Google Research Blog: All Our N-gram are Belong to You

 linguistics` google` ngram` nlp` record_linkage

Hyper-threaded Java – Java World

 clustering` algorithm` java` parallel

Statistical Modeling, Causal Inference, and Social Science

 blog` econometrics` finance` machine learning` math` statistics

Structural Analysis of Discrete Data and Econometric Applications,

 books` econometrics` economics` finance` ebook

Kris Brower » Archives » Google Onpage Search Results Analysis

 google` ranking` aol` search` analytics

CSE 250B Fall 2006

 netflixprize` machine learning` course`

Matrix Market

 matrixmarket` matrix`

Estimation of mean values, covariance matrices and imputation of missing values

 imputation` matlab` missing` EM` machinelearning

Face Detection

 face` image

CSE 250B Project 4, Fall 2006

 subset` netflix prize` dimensionality` reduction

G3DATA

 extract` from` graphs` hack` google` trends

cwm – a general purpose data processor for the semantic web

 python` processor` semantic` web` rdf

WebBase Project

 link` analysis` structure` web` crawler` stanford

sam roweis : data

 machine` learning` matlab` python` hackers` image

Flight Data and Weather Data

 flight data` airplane data` weather data` airline route data` aircraft flight data` in-flight analysis` airline on-time data

Survey of Patients Hospital Experiences

 healthcare analytics` subjective outcomes` healthcare customer service` quality of care` patient care` patient surveys

Largest collection of longitudinal hospital care data in the US

 healthcare analytics` healthcare big data` research datasets` national in-patient statistics` local healthcare statistics` in-patient statistics` hospital cost data` hospital use data

Ambulatory healthcare data

 physician visits data` doctor visit data` outpatient care data` private practice physician data` non-federal healthcare data` physician office data

Index of /data/sequence/mnist

 mnist` xml` format

MNIST handwritten digit database

 mnist`

Book-Crossing Dataset

 data` set` collaborative` filtering` datamining` books` movie

allmovie

 movie` netflix prize` source`netflix

Submissions Guidelines for the Collectorz.com Online Movie Database

 movie` source

cinema.com

 plot` synopsis` movie` netflix prize` prize

LUMIERE

 netflixprize` prize` european` movie` revenue`

Data dumps – Meta

 mediawiki` wikipedia` import` mysql` sql

“phone ***” ” address *” “e-mail” intitle:”curriculum vitae” – Google Search

 resume` google

generatedata.com

 random` generator` database` sql

Lending Club Loan Data

Finance`Loans`business`investing

SMS Spam Collection

spam`email`text analysis

Data Sets | Pew Research Center’s Internet & American Life Project

demography

Flickr personal taxonomies

flickr`taxonomy`images

Yahoo Data for Researchers

yahoo

DBLP Computer Science Bibliography

bibliographies`text mining

ICWSM Spinnr Challenge 2011 dataset

weblog`blog`social media`network analysis

Quantum Chaotic Thoughts: Facebook100 Data Set

facebook`network analysis`social

Public Data Sets on Amazon Web Services (AWS)

Amazon Web Services` amazon` ebs` ec2` s3` publicdata` hadoop

The ClueWeb09 Dataset

human language`text mining

Data | The World Bank

government`finance`economy

ImageNet

images

What is Twitter, a Social Network or a News Media? – WWW’10

twitter`text mining`social

dotbot | DotNetDotCom.org

spider`web analytics

arXiv.org help – arXiv Bulk Data Access – Amazon S3

Amazon Web Services` amazon` ebs` ec2` s3` publicdata` hadoop

YouTube Dataset

youtube`image analysis`video analysis

Face Recognition Homepage – Databases

face recognition`facial recognition`image analysis

UCI Network Data Repository

data repositories

Datasets for “The Elements of Statistical Learning”

learning

MovieLens Data Sets | GroupLens Research

movies`video analysis`business

Translation Task – EMNLP 2011 Sixth Workshop on Statistical Machine Translation

translation`human language

Project Gutenberg

books`text mining

About WordNet – WordNet – About WordNet

wordnet`corpus

Aligned Hansards of the 36th Parliament of Canada

canada`parlaiment`government`text mining

CRCNS – Collaborative Research in Computational Neuroscience – Data sharing

Bioinformatics` fmri` neuroscience` python` neuralnetwork

USENET corpus

usenet`text mining

UniGene

bioinformatics

ChEMBLdb

chemoinformatics

UCI Machine Learning Repository

algorithms

Gene Expression Omnibus (GEO) Main page

genetics`bioinformatics

Social Science Data

social science

IMDB dataset

business

Stanford Large Network Dataset Collection

network analysis

Google Books n-gram dataset

books`text mining

Million Song Dataset | scaling MIR research

audio analysis

Belly Button Biodiversity 2.0

health informatics`bioinformatics

Datasets – Modeling Online Auctions

auctions

2gb of photos of cats

image analysis`pets`cats

Click Dataset | Center for Complex Networks and Systems Research

web analytics

The Electric Rice Cooker — One year of deleted weibos archive

text mining

Registered meteorites that has impacted on Earth visualized – AnalyticBridge

meteorites`atmosphere

GeoJSON files for real-time Virginia transportation data.

road`traffic`accidents`transportation

NYPD Crash Data Band-Aid

road`traffic`accidents`transportationccidents`transportation

State Department of Education datasets

student performance`school demographics`standardized test performance`school quality`education

A list of several sources to learn data science in a hands-on format

https://www.coursera.org/course/ml – The most approachable machine learning course available. And it’s free.

https://www.kaggle.com/wiki/Tutorials – Provides data sources, forums, scenarios, and real-world competitions to teach data mining

http://deeplearning.net/tutorial/ – Tutorial on Deep Learning – introduction to machine learning image analysis algorithms

http://tryr.codeschool.com/ – Interactive introduction to R Language

The elements of big data analytics has roots in statistics, knowledge management, and computer science. Many of the data mining terms below appear in these disciplines but may have different connotation or specialized meaning when applied to our problems. The problems of massive parallel processing and the specialized algorithms employed to perform analysis in a distributed computing environment are enough to require specialized treatment.

Data Mining Terms

Term

Definition
Accuracy A measure of a predictive model that reflects the proportionate number of times that the model is correct when applied to data
Bias Difference between expected value and actual value
Cardinality Data mining terms indicating the number of different values a categorical predictor or OLAP dimension can have. High cardinality predictors and dimensions have large numbers of different values (e.g. zip codes), low cardinality fields have few different values (e.g. eye color).
CART Classification and Regression Trees. A type of decision tree algorithm that automates the pruning process through cross validation and other techniques.
CHAID Chi-Square Automatic Interaction Detector. A decision tree that uses contingency tables and the chi-square test to create the tree. Classification. The process of learning to distinguish and discriminate between different input patterns using a supervised training algorithm. Classification is the process of determining that a record belongs to a group
Cluster Centroid most typical case in a cluster.  The centroid is a prototype. It does not necessarily describe any given case assigned to the cluster.
Clustering The technique of grouping records together based on their locality and connectivity within the n-dimensional space. This is an unsupervised learning technique.
Collinearity The property of two predictors showing significant correlation without a causal relationship between them
concentration of measure any set of positive probability can be expanded very slightly to contain most of the probability the average of bounded independent random variables is tightly concentrated around its expectation
Conditional Probability The probability of an event happening given that some event has already occurred. For example the chance of a person committing fraud is much greater given that the person had previously committed fraud
Confidence The likelihood of the predicted outcome, given that the rule has been satisfied.
convergence of random variables a sequence of essentially random or unpredictable events can sometimes be expected to settle down into a behaviour that is essentially unchanging when items far enough into the sequence are studied
correlation number that describes the degree of relationship between two variables
Coverage A number that represents either the number of times that a rule can be applied or the percentage of times that it can be applied
Cross-validation The process of holding aside some training data which is not used to build a predictive model and to later use that data to estimate the accuracy of the model on unseen data simulating the real world deployment of the model.
Data Mining Process Define the problem. Select the data. Prepare the data. Mine the data. Deploy the model. Take business action.
Discrete Fourier Transform Concentrates energy in first few coefficients
Entropy A measure often used in data mining algorithms that measures the disorder of a set of data
Error Rate A number that reflects the rate of errors made by a predictive model. It is one minus the accuracy
Expectation–maximization algorithm for estimating parameters where there exist significant missing or inferred values
Expectation-Maximization (EM) Solves estimation with incomplete data. Iteratively use estimates for missing data and continue until convergence
Expert System A data processing system comprising a knowledge base (rules), an inference (rules) engine, and a working memory
Exploratory Data Analysis The processes and techniques for general exploration of data for patterns in preparation for more directed analysis of the data
Factor Analysis A statistical technique which seeks to reduce the number of total predictors from a large number to only a few “factors” that have the majority of the impact on the predicted outcome.
Fuzzy Logic A system of logic based on the fuzzy set theory
Fuzzy Set A set of items whose degree of membership in the set may range from 0 to 1
Fuzzy System A set of rules using fuzzy linguistic variables described by fuzzy sets and processed using fuzzy logic operations
Genetic Algorithm Optimization techniques that use processes such as generic combination, mutation, and natural selection in a design based on the concepts of  revolution
Genetic Operator An operation on the population member strings in a genetic algorithm which are used to produce new strings
Gini Index A measure of the disorder reduction caused by the splitting of data in a decision tree algorithm. Gini and the entropy metric are the most popular ways of selected predictors in the CART decision tree algorithm
Hebbian Learning One of the simplest and oldest forms of training a neural network. It is loosely based on observations of the human brain. The neural net link weights are strengthened between any nodes that are active at the same time.
Hill Climbing A simple optimization technique that modifies a proposed solution by a small amount and then accepts it if it is better than the previous solution. The technique can be slow and suffers from being caught in local optima
Hypothesis Testing The statistical process of proposing a hypothesis to explain the existing data and then testing to see the likelihood of that hypothesis being the explanation
ID3 Decision Tree algorithm
Intelligent Agent A software application which assists a system or a user by automating a task. Intelligent agents must recognize events and use domain knowledge to take appropriate actions based on those events.
Itemset An itemset is any combination of two or more items in a transaction
Jackknife Estimate estimate of parameter is obtained by omitting one value from the set of observed values. Allows you to examine the impact of outliers.
Kernel a function that transforms the input data to a high-dimensional space where the problem is solved
k-Nearest Neighbor A data mining technique that performs prediction by finding the prediction value of records (near neighbors) similar to the record to be predicted
Kohonen Network A type of neural network where locality of the nodes learn as local neighborhoods and locality of the nodes is important in the training process. They are often used for clustering
Latent variable variables inferred from a model rather than observed
Lift A number representing the increase in responses from a targeted marketing application using a predictive model over the response rate achieved when no model is used
Machine Learning A field of science and technology concerned with building machines that learn. In general it differs from Artificial Intelligence in that learning is considered to be just one of a number of ways of creating an artificial intelligence
maximum likelihood method for estimating the parameters of a model
Maximum Likelihood Estimate (MLE) Obtain parameter estimates that maximize the probability that the sample data occurs for the specific model. Joint probability for observing the sample data by multiplying the individual probabilities.
Mean Absolute Error AVG(ABS(predicted_value – actual_value))
Mean Squared Error (MSE) expected value of the squared difference between the estimate and the actual value
Memory-Based Reasoning (MBR) A technique for classifying records in a database by comparing them with similar records that are already classified. A form of nearest neighbor classification.
Minimum Description Length (MDL) Principle The idea that the least complex predictive model (with acceptable accuracy) will be the one that best reflects the true underlying model and performs most accurately on new data.
Model A description that adequately explains and predicts relevant data but is generally much smaller than the data itself
Neural Network A computing model based on the architecture of the brain. A neural network consists of multiple simple processing units connected by adaptive weights
Nominal Categorical Predictor A predictor that is categorical (finite cardinality) but where the values of the predictor have no particular order. For example, red, green, blue as values for the predictor “eye color”.
Ordinal Categorical Predictor A categorical predictor (i.e. has finite number of values) where the values have order but do not convey meaningful intervals or distances between them. For example the values high, middle and low for the income predictor
Outlier Analysis A type of data analysis that seeks to determine and report on records in the database that are significantly different from expectations. The technique is used for data cleansing, spotting emerging trends and recognizing unusually good or bad performers
overfitting The effect in data analysis, data mining and biological learning of training too closely on limited available data and building models that do not generalize well to new unseen data. At the limit, overfitting is synonymous with rote memorization where no generalized model of future situations is built
Point Estimation estimate a population parameter. May be made by calculating the parameter for a sample. May be used to predict value for missing data.
Predictive model model created or used to perform prediction. In contrast to models created solely for pattern detection, exploration or general organization of the data
Predictor The column or field in a database that could be used to build a predictive model to predict the values in another field or column. Also called variable, independent variable, dimension, or feature.
Principle Component Analysis A data analysis technique that seeks to weight the importance of a variety of predictors so that they optimally discriminate between various possible predicted outcomes
Prior Probability The probability of an event occurring without dependence on (conditional to) some other event. In contrast to conditional probability
Purity/Homogeneity the degree to which the resulting child nodes are made up of cases with the same target value
Radial Basis Function Networks Neural networks that combine some of the advantages of neural networks with those of nearest neighbor techniques. In radial basis functions the hidden layer is made up of nodes that represent prototypes or clusters of records
Receiver Operating Characteristic (ROC) The area under the ROC curve (AUC) measures the discriminating ability of a binary classification model. The larger the AUC, the higher the likelihood that an actual positive case will be assigned a higher probability of being positive than an actual negative case. The AUC measure is especially useful for data sets with unbalanced target distribution (one target class dominates the other).
Regression A data analysis technique classically used in statistics for building predictive models for continuous prediction fields. The technique automatically determines a mathematical equation that minimizes some measure of the error between the prediction from the regression model and the actual data
Reinforcement Learning A training model where an intelligence engine (e.g. neural network) is presented with a sequence of input data followed by a reinforcement signal
Root Mean Squared Error SQRT(AVG((predicted_value – actual_value) * (predicted_value – actual_value)))
Sampling The process by which only a fraction of all available data is used to build a model or perform exploratory analysis. Sampling can provide relatively good models at much less computational expense than using the entire database
Segmentation The process or result of the process that creates mutually exclusive collections of records that share similar attributes either in unsupervised learning (such as clustering) or in supervised learning for a particular prediction field
Sensitivity Analysis The process which determines the sensitivity of a predictive model to small fluctuations in predictor value. Through this technique end users can gauge the effects of noise and environmental change on the accuracy of the model
Simulated Annealing An optimization algorithm loosely based on the physical process of annealing metals through controlled heating and cooling
Sparsity This means that a high proportion of the nested rows are not populated.
Statistical Independence The property of two events displaying no causality or relationship of any kind. This can be quantitatively defined as occurring when the product of the probabilities of each event is equal to the probability of the both events occurring
Stepwise Regression Automated Regressions to identify most predictive variables.  1st regression finds most predictive, 2nd regression finds most predictive given 1st regression.
Supervised Algorithm A class of data mining and machine learning applications and techniques where the system builds a model based on the prediction of a well defined prediction field. This is in contrast to unsupervised learning where there is no particular goal aside from pattern detection.
Support The relative frequency or number of times a rule produced by a rule induction system occurs within the database. The higher the support the better the chance of the rule capturing a statistically significant pattern.
Term Definition
Time-Series Prediction The process of using a data mining tool (e.g., neural networks) to learn to predict temporal sequences of patterns, so that, given a set of patterns, it can predict a future value
Unsupervised Algorithm A data analysis technique whereby a model is built without a well defined goal or prediction field. The systems are used for exploration and general data organization. Clustering is an example of an unsupervised learning system
Visualization Graphical display of data and models which helps the user in understanding the structure and meaning of the information contained in them

 

This overview of data mining terms is part of a publication, “Dictionary of Data Mining Terms” due out in publication in November 2013 by Don Krapohl.  This post does not use any content from, but acknowledges a similar work by Dr. Vincent Granville at http://www.analyticbridge.com/profiles/blogs/2004291:BlogPost:223153, also containing a significant number of data mining terms.

To foster the study of the structure and dynamics of Web traffic networks, Indiana University has made available a large dataset (‘Click Dataset’) of about 53.5 billion HTTP requests made by users at Indiana University. Gathering anonymized requests directly from the network rather than relying on server logs and browser instrumentation allows one to examine large volumes of traffic data while minimizing biases associated with other data sources. It also provides one with valuable referrer information to reconstruct the subset of the Web graph actually traversed by users. The goal is to develop a better understanding of user behavior online and create more realistic models of Web traffic. The potential applications of this data include improved designs for networks, sites, and server software; more accurate forecasting of traffic trends; classification of sites based on the patterns of activity they inspire; and improved ranking algorithms for search results.

The data was generated by applying a Berkeley Packet Filter to a mirror of the traffic passing through the border router of Indiana University. This filter matched all traffic destined for TCP port 80. A long-running collection process used the pcap library to gather these packets, then applied a small set of regular expressions to their payloads to determine whether they contained HTTP GET requests. If a packet did contain a request, the collection system logged a record with the following fields:

  • a timestamp
  • the requested URL
  • the referring URL
  • a boolean classification of the user agent (browser or bot)
  • a boolean flag for whether the request was generated inside or outside IU.

Some important notes:

  1. Traffic generated outside IU only includes requests from outside IU for pages inside IU. Traffic generated inside IU only includes requests from people at IU (about 100,000 users) for resources outside IU. These two sets of requests have very different sampling biases.
  2. No distinguishing information about the client system was retained: no MAC or IP addresses nor any unique index were ever recorded.
  3. There was no attempt at stream reassembly, and server responses were not analyzed.

During collection, the system generated data at a rate of about 60 million requests per day, or about 30 GB/day of raw data. The data was collected between Sep 2006 and May 2010. Data is missing for about 275 days. The dataset has two collections:

  1. raw: About 25 billion requests, where  only the host name of the referrer is retained. Collected between 26 Sep 2006 and 3 Mar 2008; missing 98 days of data, including the entire month of Jun 2007. Approximately 0.85 TB, compressed.
  2. raw-url: About 28.6 billion requests, where the full referrer URL is retained. Collected between 3 Mar 2008 and 31 May 2010; missing 179 days of data, including the entire months of Dec 2008, Jan 2009, and Feb 2009. Approximately 1.5 TB, compressed.

The dataset is broken into hourly files. The initial line of each file has a set of flags that can be ignored. Each record looks like this:

 XXXXADreferrer  host  path

where XXXX is the timestamp (32-bit Unix epoch in seconds, in little endian order), A is the user-agent flag (“B” for browser or “?” for other, including bots), D is the direction flag (“O” for external traffic to IU, “I” for internal traffic to outside IU), referrer is the referrer hostname or URL (terminated by newline), host is the target hostname (terminated by newline), and path is the target path (terminated by newline).

The Click Dataset is large (~2.5 TB compressed), which requires that it be transferred on a physical hard drive. You will have to provide the drive as well as pre-paid return shipment. Additionally,  the dataset might potentially contain bits of stray personal data. Therefore you will have to sign a data security agreement. Indiana University require that you follow these instructions to request the data.

Citation information and FAQs are available on the team’s page at http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset .