Public data resources: research-quality, free data mining data sets

All datasets with keywords

Search entries:

Announcing the Article Search API – Open Blog –

Text mining` article` api` text` corpus` newspaper

Information Extraction: The RISE Repository of Information Sources

Text mining` information` text mining` extraction` reviews` jobs


Text mining` links` text mining` books` rdf` ocr` documents

API Documentation — BackType

Text mining` api` blog` comments` text mining` stream` trends` backtype` queryminer

Free book usage data from the University of Huddersfield » “Self-plagiarism is style”

Text mining` books` library` borrowing` recommender` isbn` recommendation` collaborative` filtering

ICWSM 2009 – International AAAI Conference on Weblogs and Social Media

Text mining` blog` crawl` corpus` network` web` link, data mining data sets The Obama-Biden Transition Team | Join the Discussion: Healthcare

Text mining` textmining` opinion` comment` topic` government` queryminer

Opinion Extraction, Opinion Mining, Sentiment Analysis, Summarization of Customer Reviews

Text mining` sentiment` mining` classification` machine learning` reviews` recommender` text mining` links

Text mining` wikipedia` named entity` tagged` text ming

Building a (fast) Wikipedia offline reader

Text mining` django` wikipedia` compressed` text mining` howto

Reddit’s Secret API

Text mining` reddit` api` json`

phishingcorpus [JoseWiki]

Text mining` phising` corpus` text` email` text mining` nlp` mail` security

Wikipedia Datasets for the Hadoop Hack | Cloudera

Text mining` wikipedia` hadoop` textmining` links

Main Task QA Data

Text mining` question` answering` trec` nlp` machinelearning

The New York Times Annotated Corpus « YooName – named entity recognition

Text mining` named entity` nytimes` corpus` people` organizations` locations

ADL Gazetteer Development

Text mining` named entity` location` place names` geo` nlp` natural language processing

Beautiful Data – WikiContent

Text mining` book` data` wiki` via:jhammerb

Web FAQ collection | ILPS

Text mining` faq` question_answering` questions` web` crawl` corpus` xml` textmining

Wikipedia:Lists of common misspellings/For machines – Wikipedia, the free encyclopedia

Text mining` spelling` mispelling` wikipedia, data mining data sets

build.kiva: Blog – Introducing the Kiva API

Business and Finance` finance` api` social` kiva` microlending` lending

Visualizing the Growth of Target, 1962-2008 | FlowingData

Business and Finance` visualization` retail` finance` gis` map` location` store` via:magnetbox

The Economy According To Mint

Business and Finance` finance` commercial` consumer` mint` spending

Best Buy Remix – Welcome to the Best Buy Remix Developer Network

Business and Finance` retail` data` api` product` bestbuy

Behavioral Targeting, Analytics and Advertising Service for Publishers, Ad Networks

Business and Finance` analytics` audience` segmentation` toolbar` commercial` sem` search` advertising

Executive PayWatch Database

Business and Finance` ceo` compensation` pay` economics` business` labor

TradingSolutions – Data Sources

Business and Finance` trading` finance` s` api` list

Netflix API – Welcome to the Netflix Developer Network

Business and Finance` netflix` api` movie` mashup` netflixprize` ratings

Open beats Closed: Best Buy’s new APIs – O’Reilly Radar

Business and Finance` retail` bestbuy` api


Business and Finance` custom` research` retail` finance` market` service` analyst`

University of Arkansas – Daily Headlines

Business and Finance` retail` dillards` uark

developerWorks Interviews: Massive data mining and the resurgent mainframe

Business and Finance` price` retail` transaction` sams_club` dillards

opentick :: market data

Business and Finance` opentick` nasdaq` finance` stock` data mining data sets

U.S. Company Filings and Annual Reports

Business and Finance` finance` links` sec

FTP Information – EDGAR Database

Business and Finance` edgar` finance` sec` filing` ftp` instructions

Data Mining For Investing

Business and Finance` investing` finance` datamining` announcement` sec` filing` links

UN General Assembly Voting Data

Government` un` voting` statistics` government

Research Datasets :: CID Data :: Center for International Development at Harvard University (CID)

Government` economics` international` development

Government` government` banking` csv` tarp` bailout

Data Catalog

Government` dc` government` feeds` transparency` opendata

Announcing the New York Times Campaign Finance API – Open – Code – New York Times Blog

Government` nyt` api` campaign` donations` fec`

Voter registration data; or, HERE IS YOUR HOPE, YOU FOOLS! « The Edge of the American West

Government` voter` registration` politics`2008

import/parse/ at master from aaronsw’s watchdog — GitHub

Government` fec` python` parser` government` campaign

The Watchdog Project: volunteer

Government` government` transparency` parsing` election` python

Dataset of the day: Where are the Obamacans? | Off the Map – Official Blog of FortiusOne

Government` obama` goverment` mashup` gis` geo` map` campaign` donations

Normalized Campaign Contribution Data

Government` cmu` politics` campaign` donations` fec` via:jhammerb` government

Crime data bonanza!!!

Government` timeseries` crime` statistics` publicdata

Ohio voter registration data

Government` voter` voting` politics` government` name` address` registration

Voter List Data Files – Election Department, Clark County, Nevada

Government` voting` voter` registration` name` address` data` election` politics


Government` UN` publicdata` government` statistics

RealClearPolitics – Election 2008 – Democratic Presidential Nomination

Government` polls` politics

Crime in the United States 2006

Government` crime` fbi

Daily Kos: Obama helps us track $1,000,000,000,000 of federal spending

Government` corruption` government` politics` finance`

Welcome to

Government` government` money` politics`

Campaign Finance Reports and Data

Government` campaign` politics` elections

ERS/USDA Data – International Macroeconomic Data Set

Government` usda` economics` population` cpi` gdp` income

State Agency Databases – GODORT

Government` government` directory` links` wiki` states

National Bureasu of Economic Research: Data

Government` economics` links

Bureau of Labor Statistics Data

Government` economics` lumber` building` materials` homedepot

NBI ASCII Files – Bridge – FHWA

Government` government` bridges` safety

Twitter API Wiki / REST API Documentation: Social Graph Methods

Network Analysis` graph` network` api` social` twitter

Using the Wikipedia link dataset — Henry Haselgrove

Network Analysis` graph` network` link` wikipedia` pagerank

twibs : find the businesses on twitter

Network Analysis` directory` businesses` twitter` companies

Massive Scrape of Twitter’s Friend Graph « – Organizing Huge Information Sources

Network Analysis` textmining` twitter` network` socialnetwork` pagerank` graph` queryminer

Twitter Scrape (rough draft) – get.theinfo | Google Groups

Network Analysis` twitter` socialnetwork` graph : Downloads 32

Network Analysis` wikipedia` named_entity` rdf` ontology

ICWSM 2009 – International AAAI Conference on Weblogs and Social Media

Network Analysis` blog` crawl` corpus` network` web` link

Linked Movie Data Base

Network Analysis` rdf` movies` movie` api

YouTube Dataset

Network Analysis` youtube` research` crawl` socialnetwork` network` graph` web

API Documentation – Twitter Development Talk | Google Groups

Network Analysis` twitter` text` api


Network Analysis` wireless` RF` radio` signal` dartmouth` network

Yahoo! Music API – YDN

Network Analysis` api` yahoo` music` artists

Lookery Developer Network – Lookery Developer Resources

Web Analytics` web` analytics` api` traffic` advertising` demographics` lookery

True Marble Imagery – Free Download

Spatial Analysis` gis` geo` map` mapping` images` satellite

Zillow – Labs – Neighborhood Boundaries

Spatial Analysis` neighborhoods` geo` gis` maps

Full Examples — PyMVPA Home

Image Analysis and Video Analysis` fmri` neuroscience` python` neuralnetwork

HumanScan : BioID : Downloads : BioID Face Database

Image Analysis and Video Analysis` face` detection` image

Face Detection

Image Analysis and Video Analysis` facerecognition` opencv` face` links

NORB Object Recognition Dataset, Fu Jie Huang, Yann LeCun, New York University

Image Analysis and Video Analysis` image` 3d

LIFE photo archive hosted by Google

Image Analysis and Video Analysis` images` photo` pictures` search

Activity Recognition: Datasets, Bibliography and others

Image Analysis and Video Analysis` activity` recognition` intent

Frontal Face Databases

Image Analysis and Video Analysis` facerecognition` face` image` recognition

Copyright Free and Public Domain Media

Image Analysis and Video Analysis` images` audio` publicdata` maps` video` free

Databases you can use for benchmarking

Image Analysis and Video Analysis` image` vision` recognition`

2007 IEEE AVSS Detection and Tracking Algorithm Datasets

Image Analysis and Video Analysis` tracking` video` detection` image` recognition` vehicle` pedestrian`


Image Analysis and Video Analysis` image` recognition` detection` pedestrian` thermal` tracking` facerecognition` illumination

Carnegie Mellon University – CMU Graphics Lab – motion capture library

Image Analysis and Video Analysis` gait` pedestrian` walk` motion

public domain sounds | free sound library

Audio Analysis` sound` publicdomain` audio

Full Examples — PyMVPA Home

Bioinformatics` fmri` neuroscience` python` neuralnetwork

CinC Challenge 2000 data sets

Medical Informatics` timeseries` machinelearning` ecg` health` medical` sleep` apnea

UC Berkeley. Sheldon Margen Public Health Library. Statistical/Data Resources

Healthcare Analytics` health` links` resources` publichealth` berkeley

Google Flu Trends | How does this work?

Healthcare Analytics` google` health` trends` search` prediction` epidemiology` biodefence` queries

Eigenvector Research, Inc. : Data Sets Available to Download

Chemoinformatics` NIR` spectra` chemistry` semiconductor` pharmaceutical` matlab`

Vaccines: IIS/Tech/Deduplication Test Cases

Healthcare Analytics` duplicate

Health Data Tools and Statistics

Healthcare Analytics` health` information` public` publicdata

Cardiac MRI dataset – York University

Healthcare Analytics` mri` cardiac

NACDA: Search Holdings

Demography` aging` statistics` studies

Poverty Data Sets General Information

Demography` poverty` statistics

Pew Internet & American Life Project

Demography` internet` demographics` online` web

The 2000 U.S. Census: 1 Billion RDF Triples

Demography` gis` census` rdf` semantic` sparql

Download Database –

Sports Analysis` baseball` database` publicdata` statistics` sports

It’s a Pitch-by-Pitch Scouting Report, Minus the Scout – New York Times

Sports Analysis` baseball` gameday

BART – For Developers

Network Analysis` urban` transportation` feeds` public` sanfrancisco` bart` api`

Tim Davis: UF Sparse Matrix Collection : sparse matrices from a wide range of applications

Matrices` spare` matrix

Idealware: Mapping Blues: Where is the Data?

Pre-processing` resources` links`mapping

Amazon Web Services Public Datasets » Data Wrangling Blog

Amazon Web Services` amazon` ebs` ec2` s3` publicdata` hadoop

Amazon Web Services (AWS) Hosted Public Data Sets

Hosted Datasets` amazon` ebs` publicdata

WSCD09: Workshop on Web Search Click Data 2009

Web Analytics` workshop` search` web` microsoft` log`

downloading – flossmole – Google Code – How to get FLOSSmole data for your own use

Google` opensource` project` activity` mysql` dump

Multi-Domain Sentiment Dataset

Supervised` sentiment` review` product` amazon

Chris Pound’s Name Generation Page

 bizzare` scifi` phrase` name` word` generators` random` perl

Big Huge Thesaurus API: Access 145,000 Words and Phrases

Phrases` webservice` api` thesaurus` textmining` nlp` rest`

Search Query Performance report – Google AdWords Help Center

Performance` adwords` ppc` search` metrics` webanalytics` sem` query` queryminer

Wordze Keyword Research Tool

Web Analytics` queryminer` keyword` tool` research` commercial` search` adwords

Searchable Catalogs of Data

Network Analysis` links` catalogs` social

radiohead – Google Code

Audio Analysis` lidar` visualization` radiohead` google` video

80 Million Tiny Images

Image Analysis and Video Analysis` images` words` english` search` visualization` imagemap

Time Series Center | Harvard University

Temporal Analysis` timeseries` anomaly` detection` astronomical` physics

OpenVisuals – Open Source Visualization Framework

Image Analysis and Video Analysis` visualization` community` design` processing

BGN: Domestic Names – State and Topical Gazetteer Download Files

Demography` gis` usgs


Random` benchmark` clustering` regression` machinelearning` list` statistics` mathematics

Isomap Datasets

Image Analysis and Video Analysis` nonlinear` dimensionality` reduction` faces` digits` images` manifold

Yahoo! Search Blog: BOSS — The Next Step in our Open Search Ecosystem

oss` api` open` search` yahoo` BOSS` queryminer

Download the Database – IP Address Lookup – Community Geotarget IP Project

Network Analysis` geocoding` geoip` internet` ip` ipaddress` mysql

Airline Data Project

Government` airline` statistics` finance` revenue` location` travel

Show Us a Better Way: What public data is already available?

Government` statistics` census` uk` school` news` publicdata

NGA: Country Files

Government` country` cities` geo

OHPI – Traffic Volume Trends

Government` government` traffic` statistics` trends` transportation

Semantic Search the US Library of Congress

Government` via:inkdroid` libraries` mashup` rdf` semantic` search` semanticweb` books Ask Reddit: Where to download a DB dump of Reddit?

Text mining` reddit` socialnetwork` news` web

Collaborative filtering dataset – dating agency

Text mining` collaborative` filtering` dating` rating` profiles` czech

About Us – Predictify

Business and Finance` predictionmarket` tool` finance` buzz` advertising` marketing` startup` mmds | Video Games, Charts, News, Forums, Reviews, Wii, PS3, Xbox360, DS, PSP

Business and Finance` sales` ranking` videogames` retail

Store Level Information

Business and Finance` retail` finance` sales` store`

Code for querying and downloading Flickr images

Image Analysis and Video Analysis` image` python` code` flickr` matlab` recognition

Image Parsing Datasets

Image Analysis and Video Analysis` image` recognition

TAGora » Data

Network Analysis` tag` tagging` s

TAGora » Data

Network Analysis` netflixprize` imdb` sparql

Quality of Life Grand Challlenge Dataset: Kitchen Capture

Image Analysis and Video Analysis` machinelearning` motion` capture` sensor

Summize Twitter Search API

Text mining` api` buzz` opinion` trends` text` twitter` summize` search

2008 IEEE InfoVis Contest Dataset

Image Analysis and Video Analysis` visualization` contest` scalability` motion` tracking` pedestrian` sensor

IMDb Pro : Scary Movie 4: Box office

Business and Finance` movie` revenue` sales` box_office` imdb` commercial` movie_study

Spider-Man 2 (2004) – Daily Box Office Results

Business and Finance` movie` revenue` box_office`

Live Search : xRank™ Celebrity — check out who’s hot and who’s not!

Network Analysis` search` query` volume` trends` celebrity` prediction` buzz` named_entity Free Trial Signup

Business and Finance` movie` revenue` timeseries` imdb` commercial` subsription

Free time-series and micro-data to download

Business and Finance` economics` links

PyGTrends: Python API for Google Trends Data

 google` trends` search` web` analytics` api` code` python` hack

Official Google Blog: A new flavor of Google Trends

 google` trends` search` query` api` csv` keyword` timeseries

Open Research – the Data: Lastfm-ArtistTags2007 – Duke Listens!` music` tagging` artists` tags` collaborative` filtering

i2b2: Informatics for Integrating Biology & the Bedside

 medical` obesity`

Tiger Data Set Lecture

 tiger` gis` lectures

Google To Launch Large Scale Geo-Services

 geo` google` gps` location` geolocation` cell` wifi` api` gis’s Playground

 celebrity` misspelling` spelling` names : U.S. Customs Database and Competitive Intelligence Tools

 commercial` shipping` imports` exports` finance` datamining

Directory Listing of Betfair price files

 betting` prediction` betfair` price` csv` predictionmarket

Reuters Spotlight – Article and Media API

 news` text` articles` api` content` media` xml` images` publicdata

DataSets – Scikits – Trac

 scipy` python` machinelearning` statistics` resource

[Wikitech-l] page counters

 wikipedia` pageviews` trends` textmining` seo` topic

Wikipedia article traffic statistics

 via:chl` wikipedia` web` analytics` seo` topic` textmining` traffic

Yahoo! Internet Location Platform – YDN

 yahoo` geo` geocoding` location` landmarks` gis

How to find images on the internet « Random knowledge

 images` links` lists` archive`

Yahoo offers geographic data to Web sites | Tech news blog – CNET

 gis` webservice` yahoo` api` location` landmark

Instructions for Obtaining Search Engine Transaction Logs

 query` search` log` excite` altavista` alltheweb` transaction

TechTC – Technion Repository of Text Categorization Datasets

 datamining` textmining` categorization` classification` odp` directory` text

The TechTC-100 Test Collection for Text Categorization

 textmining` classification` category` odp` directory

FEC Election Contributions: Download Detailed Files by Election Cycle

 individual` donations` government` election` publicdata` fec

Juiced Google Analytics Python API: Juice Analytics

 search` statistics` keywords` analytics` api` python` web` seo` google

Country Name and ISO 3166 Code MySQL Import File

 mysql` states` countries` isocode

geocoded Hotels « GeoNames Blog

 hotels` geonames`

GeoNames webservice and data download

 locations` cities` countries` gis

Index of /download/worldcities

 cities` gis

ualberta dependency based thesaurus and word count data

 corpus` text` similarity` terms

CommonCrawl – About

 web` crawler` bot`

Data sets and corpus / corpora for biological literature and text mining  

 bioinformatics` text` corpora` domainspecific` genomics` corpus`

Office of Defects Investigation (ODI), Flat File Downloads

 defect` recall` automobile` fightclub` nhtsa` saefty

p2psim – kingdata : DNS server latency network distance matrices

 distance` matrix` network` p2p` dns` latency` nmf` queryminer

Sep Kamvar / Personalization /

 pagerank` web` matrix` matlab

 opentick` trading` beta` feeds` finance

WikiXMLDB: Querying Wikipedia with XQuery

 wikipedia` xml` ec2 » Blog Archive » Walmart Growth Video

 walmart` visualization` video` freebase` store` retail` locations` opening

Open Cell Id dataset – phone geolocation from GSM cellids

 gis` mobile` geolocation

The Cornell Web Lab – The Cornell Web Lab

 cornell` web` archive` hadoop` crawl

im2gps: estimating geographic information from a single image

 imagerecognition` via:csantos` gis` cmu` gps` imageprocessing` paper` hack` freaking_awesome

Datasets: MUSCLE WP2 Evaluation, Integration and Standards

 image` video` audio` currency` sports` imagerecognition

Open Economics – Store – Index

 economics` list

welcome @ omdb

 free` movie` database` netflixprize

Cogblog » Blog Archive » Cogmap APIs

 api` cogmap` person` name` organization` record_linkage

Wal-Mart : Freebase – The World’s Database

 retail` locations` stores

Cogmap: The Org Chart Wiki

 record_linkage` identity` name` organization` orgchart` marketing

German English Parallel Corpus “de-news”, Daily News 1996-2000

 german` translation` corpus` english` text` via:maxme

Welcome to the CRCNS data sharing activity website — CRCNS

 neuroscience` patch` clamp` recordings` neuron` timeseries` patchclamp` data` neural Free Redistributable Rich Data Sets

 aggregator` links

Frequent Itemset Mining Dataset Repository

 retail` clickstream` traffic` web` links` sales

Dolores Labs Blog » Blog Archive » Our color names data set is online

 colormap` color` mechanicalturk -> Registration

 teradata` retail` transactional` database

Pascal Learning Challenge Large Datasets

 large` competition` challenge` svm` machinelearning` scalability

ECIS 2007 – The 15th European Conference on Information Systems

 retail` dillards` sams_club

Alexa Web Search

 alexa` aws` web` search` api`

State and Federal Case Law

 creativecommons` court` legal` law` via:inkdroid

Access to Web Research Collections VLC2/WT10g/WT2g

 blog` web` text

Lyricsfly Lyrics API, database access to search for music artist and song title

 song` lyrics` database` api`

99 Wikipedia Sources Aiding the Semantic Web » AI3:::Adaptive Information

 links` directory` record_linkage` extraction` wikipeida` named_entity` recognition` textmining` semanticweb

AudioScrobbler Data

 audioscrobbler` recommendation` collaborative` filtering` music

The Linking Open Data dataset cloud

 directory` rdf` semantic` data` soup` graph

Free Economic Data | Economic, Financial, and Demographic Data

 finance` economics` portal` links

::MLSP 2008::: MLSP competition

 machinelearning` trading` competition` backtest` matlab` code` finance` via:DeliciousRob

Computer Vision Test Images

 computer` vision` image` ray` trace` fingerprint` stereo` detection` via:chl

The Dataverse Network Project | The Dataverse Network Project

 statistics` repository` harvard

DVN – Home

 harvard` repository` social` science` research` portal` links

Temperature data (HadCRUT3 and CRUTEM3)

 climate` temperature` netcdf

MNIST handwritten digit database, Yann LeCun and Corinna Cortes

 handwriting` mnist` image` recognition

LFW : Labelled Faces in the Wild

 facerecognition` face` recognition` umass` image

Making random contacts – (37signals)

 generator` names

Test (Sample) Data Generators

 generator` tools` list` via:jd

Compete – Compete Developer Resources

 compete` api` web` statistics` traffic` analytics` mashup

Machine Learning (Theory) » The Peekaboom Dataset

 peekaboom` vision` image` large` human` computation` machinelearning` recognition

Ocean Processes and Modeling: Ocean Data

 links` oceanography` satellite

BlogoCenter data sets

 blog` ucla

Tagged datasets for named entity recognition tasks

 nlp` corpus` tagged` named_entity` recognition` list stats – deli.ckoma`

The Financial Data Finder A – G

 finance` links

Freebase Wikipedia Extraction (WEX)

 wikipedia` xml` structured` corpus


 arxiv` api` open` paper` academic`

England Football Results Betting Odds | Premiership Results & Betting Odds

 gambling` soccer` football` excel` statistics

HughesData – Main – Hughes Lab

 rna` bioinformatics` microarray` expression` gene` machinelearning

Stanford MicroArray Database

 bioinformatics` microarray` expression` gene` machinelearning` stanford

ArrayExpress Home

 bioinformatics` microarray` expression` gene` machinelearning

Gene Expression Omnibus (GEO) Main page

 bioinformatics` microarray` expression` gene` machinelearning

Index of /

 corpus` text` legal` law` court` ruling` opensource` publicdata

Welcome to Openvest

 python` finance` edgar` pylons` matplotlib` sec` webservice` via:jolby

Statistical Science Web: Data Sets

 links` statistics

Text Mining, Visualization and Social Media

 crawler` blog` corpus

Aleix Face Database

 facerecognition` machinelearning` face` image

Data Repository Evaluation

 umd` links` statistics` government` sports` via:rickladd

PMC FTP Service

 biology` medicine` articles` text` journal` authors

“uspop2002” data set

 music` similarity` machinelearning

Internet Archive: Details: Amazon ASIN listing and similarity graph

 ASIN` amazon` recommendation` collaborative` filtering` via:keyvowel

European Climate Assessment Daily Weather Data

 weather` europe` ascii` netcdf

StatLib—Datasets Archive

 machinelearning` datamining` cmu` link` collection

National Household Travel Survey (NHTS) Data

 driving` transportation` publicdata

Nielsen BookScan USA

 books` sales` commercial

Home – Numbrary

 finance` data`

About – Numbrary

 searchengine` search` tagging` aggregator` numeric` extraction` tables` collaboration` web2.0

Main Page – OpenTextMining

 textmining` open` nature` standards` search

Metafilter Infodump

 metafilter` comments` network` via:chl

WEBSPAM-UK2007 | Datasets | Web Spam Detection

 web` search` spam` crawler` yahoo

Trust network datasets – TrustLet

 socialnetwork` trustnetwork` trust

TaskForces/CommunityProjects/LinkingOpenData/DataSets – ESW Wiki

 opendata` semantic` rdf` collaboration

Some Datasets Available on the Web » Data Wrangling Blog

 publicdata` links, Public Data, and the Semantic Web

 semanticweb` rdf` congress` politics` government

CiteULike: Available datasets

 networks` research` graph` tags` paper` record_linkage

 archive` internet` web` index`

Challenge: Synopsis – Causality Workbench

 competition` machinelearning` forecasting` contest

Natural Language Processing

 microsoft` text` paraphrase` corpus

LDC – Linguistic Data Consortium – Obtaining Data Resorces

 nlp` text` corpus` ngram` google` commercial` license

1990 Census Name Files

 census` names` identity` frequency` record_linkage

Given Name Frequency Project: Analysis of Given Name Popularity

 name` record_linkage` text` identity` code

Email Datasets

 enron` names` identity` text` record_linkage

ZoomInfo – Welcome to the ZoomInfo Developer API

 api` identity` people` webservice` record_linkage

Name Discrimination Data Named Entity Resolution /  Entity Disambiguation

 record_linkage` corpus` nlp` names

Developers Area – eBay Market Data Documentation – eBay Market Data Documentation

 ebay` api` retail` price` code

New SwetoDblp RDF dataset released with 11M triples

 name` authorship` rdf` record_linkage

LSDIS : SwetoDblp

 bibliography` rdf` ontology` duplicate` name` record_linkage

StrikeIron Super Data Pack Web Service 1.0 – StrikeIron Marketplace

 webservice` publicdata` datacleaning

Duplicate Detection, Record Linkage, and Identity Uncertainty: Datasets

 duplicate` detection` record_linkage` datacleaning` text

INFO 747 – Social and Economic Data

 datacleaning` record_linkage` video` lectures` course` cornell` economics` finance` publicdata Affiliate Program

 retail` overstock` sales` api` product` price` forecasting

Amazon Web Services Developer Connection : Can Alexa WS provide detailed …

 finance` alexa` amazon` tech

Market Data — eBay Developers Program

 ebay` retail` pricing` sales` api` product

Machine Learning and Data Mining – Datasets

 face` image

GIS for Schools

 epidemiology` gis` health

Google Trends API coming soon | Tech news blog – CNET

 google` trends` api`

MIT Media Lab: Reality Mining

 social` activity` location` cell` gis

RL Competition 2008 – Home

 machinelearning` reinforcement` agent` competition`

Vehicle Routing Data Sets

 optimization` vehicle` routing

EIA – Petroleum Data, Reports, Analysis, Surveys

 oil` energy` statistics` economics` petroleum

DMOZ100k06 – Michael G. Noll

 search` pagerank` text` tags` content


 machinelearning` CMU` course` projects` graphicalmodel` code` paper

Financial Forecast Center’s Historical Economic and Market Data

 exchangerate` dollar` economics`

Browse Business Cycle Indicators Data

 economics` indicators` time` series

The Numbers Guy : Aspiring to Be the Wikipedia of Numbers

 finance` numberpedia` mechanicalturk` textmining` statistics

Social characteristics of the Marvel Universe

 socialnetwork` graphs` comicbooks Word Lists Collection

 dictionary` words

See Who’s Editing Wikipedia – Diebold, the CIA, a Campaign

 wikipedia` authorship`

Dataset Generator – Perfect data for an imperfect world.

 tools` generator

Entree Chicago Recommendation Data

 recommender` collaborative` restaurant

community resource guide: i’ve been here before – show me the links

 demographics` maps` gis` statistics` links

Social Science Data on the Net

 economics` social` government` health` labor` links

List of films: A – Wikipedia, the free encyclopedia

 netflix` netflixprize` movie` index` wikipedia`

The arXiv on your harddrive

 paper` corpus` arXiv

Insanely Useful Websites | Sunlight Foundation

 links` transparency` government` politics` congress` reference

Technophilia: Where to find public records online – Lifehacker

 public` records` links

Junk email project

 corpus` email` spam` textmining

Enron Email Dataset

 enron` corpus` email` text` social` network

 finance` cpi` inflation` data

GOS – Geospatial One Stop

 health` gis` epidemiology` links

CIA Factbook Grep in Python

 cia` population` python` code` grep

Miller Center of Public Affairs – Richard Nixon – Oval Office Recordings

 nixon` speech` tapes` audio` mp3` wav` flac

Deborah Jeane Palfrey Legal Defense Fund

 phone` politics

UC San Diego Data Mining Competition – 2007 – Datasets

 housing` refinance` mortgage`

Retail Industry Financial Ratios & Benchmarks

 retail` finance` sales` sqft`

Retail Industry Financial Ratios & Benchmarks

 retail` finance` sales` sqft

stores | POI Factory

 retail` location` poi

GpsPasSion Forums – ** INDEX OF POI COLLECTIONS **

 retail` poi` location` gis` gps

GPS POI US : Home > Retail Stores

 retail` location` gis

Collective Dynamics Group

 smallworld` networking` socialnetwork` graph

Jester Data download page

 collaborative` filtering` jokes

TricTrac: Video Dataset


Premium Business Information Databases – AlacraWiki

 links` finance` commercial

Index of /edgar

 finance` xml` edgar` sec` code` perl

Mail Index

 EDGAR` sec` mail` text

metafy / Anthracite Idioms

 finance` SEC` scrape` parse` commercial

Retail and Food Services – Time Series Data/Seasonal Factors

 retail` sales` census


 categorization` textmining` detection` tools

Volume of retail sales: Social Trends 33

 retail` sales` uk

 tools` generator` random

Melissa DATA – Lookups

 consumer` data` database` api

FactSet: Data Maven –

 factset` finance`

IBES (Demo)

 finance` ibes` analyst` forecast` wharton

Thomson Financial I/B/E/S Data


Historical Quotes – Yahoo! Finance

 yahoo` finance` stock` price`

Network data

 network` links

Bureau of Labor Statistics Home Page

 statistics` labor` government` consumer

NAR: Research: EHS Data

 housing` sales` finance

RFA – The Industry – Industry Statistics


Chain Store Guide – Retail Locations

 retail` finance` store` locations` gis

Press Releases – Directions Magazine

 retail` gis` store` locations

Energy Information Administration – EIA – Official Energy Statistics from the U.S. Government

 finance` government` energy` historical` forecasts` fuel` oil

Databases you can use for benchmarking


UPC Database: Downloads

 product` upc` database`

Web Crawling / Crawl Datasets at Tobias Escher at the OII

 crawler` benchmark` search` web` links

TechTC – Technion Repository of Text Categorization Datasets

 corpus` text

TMC data archive download site

 traffic` data`

 volume rendering

Computational Vision: Archive

 vision` caltech` image recognition

DC Pedestrian Classification Benchmark

 pedestrian` image` classification` detection

opentick :: home

 finance` economics` feed` free` stock` trading` opentick` opensource

Web as Corpus

 textmining` corpus` concordance` wordlist` n-gram

.:[ packet storm ]:. –

 dictionary` hack` security` wordlist` password

Enron Dataset

 data` mysql` email` energy` text` social network

Splog Blog Dataset

 blog` corpus` spam

Home Page for 20 Newsgroups Data Set

 corpus` text` newsgroup

White Glove Tracking

 crowd sourcing` image` processing` algorithm` collaborative` distributed` web2.0` code` opensource

NOAA Paleoclimatology Program – Coral and Sclerosponge Data

 paleo climatology` climate` oceanography` coral` sponge` biology

NAICS — North American Industry Classification System

 finance` economics` naics` industry` classifications

Saving Democracy With Web 2.0 –

 democracy` web2.0` mashup` government` funding` article

Congresspedia – Congresspedia

 collaborative` wiki` government` congress` politics` elections` web2.0` directory

Population Estimates Data Sets

 census` data` population` statistics

CRAN Task View: Machine Learning & Statistical Learning

 statistical learning` machine learning` code` R` libraries` cran`

Data for Data Mining

 linkd` datamining` timeseries` text` extraction` socialnetwork

PAIDA – Pure Python scientific analysis package

 python` visualization` library

SUBDUE – Graph Based Knowledge Discovery

 machine learning` network` graph`

AOL search data mirrors

 aol` search`

Python Cheese Shop : shakespeare 0.4

 python` text`

AG’s corpus of news articles

 corpus` nlp` machine learning` textmining

Sampling Techniques for Massive Data – Google Video

 video` machine learning` statistics` matrix` sampling` large` sparse` algorithm` experiment_design

metachronistic » Mirror the Wikipedia

 wikipedia` laptop` install` dump

LETOR: Benchmark Datasets for Learning to Rank

 ranking` search

CN710: Comparative Analysis of Learning Systems (Spring 2006) – Class Project

 machinelearning` algorithm` ogi` bu` greyhound` finance

UrbanSim Home

 python` urban` software` simulation` opensource` GIS` census`

System One – Wikipedia³

 wikipedia` rdf`

System One – Labs

 wikipedia` rdf` tools

Face Recognition Homepage – Databases

 face` algorithm` facere cognition` data` image

CBCL SOFTWARE Face data set

 face` seung` algorithm` recognition` image

Text Analytics Solutions from ClearForest

 extraction` finance` semantic` semanticweb` text

23C3 – Mining Search Queries – Google Video

 aol` search` video` talk` algorithm` information retrieval` datamining` machinelearning

Digital History Hacks: Keywords and Clues

 aol` search` query` analysis

Digital History Hacks: Searching for History

 aol` search` query` analysis

The Tom Kyte Blog: An interesting data set…

 aol` search` oracle` database` code

KDD 2005 – KDD Cup 2005: Aug 21-24, Chicago, IL. USA

 query` categorization` algorithm` google

Statistical NLP / corpus-based computational linguistics resources

 corpus` machine learning` text

Ph.d.-student Rasmus Elsborg Madsen

 text` machine learning` context` matlab

Intelligent Web Search and Mining: Tools & Resources

 machine learning` code` links

PageRank Datasets and Code

 pagerank` code` algorithm

Official Google Research Blog: All Our N-gram are Belong to You

 linguistics` google` ngram` nlp` record_linkage

Hyper-threaded Java – Java World

 clustering` algorithm` java` parallel

Statistical Modeling, Causal Inference, and Social Science

 blog` econometrics` finance` machine learning` math` statistics

Structural Analysis of Discrete Data and Econometric Applications,

 books` econometrics` economics` finance` ebook

Kris Brower » Archives » Google Onpage Search Results Analysis

 google` ranking` aol` search` analytics

CSE 250B Fall 2006

 netflixprize` machine learning` course`

Matrix Market

 matrixmarket` matrix`

Estimation of mean values, covariance matrices and imputation of missing values

 imputation` matlab` missing` EM` machinelearning

Face Detection

 face` image

CSE 250B Project 4, Fall 2006

 subset` netflix prize` dimensionality` reduction


 extract` from` graphs` hack` google` trends

cwm – a general purpose data processor for the semantic web

 python` processor` semantic` web` rdf

WebBase Project

 link` analysis` structure` web` crawler` stanford

sam roweis : data

 machine` learning` matlab` python` hackers` image

Flight Data and Weather Data

 flight data` airplane data` weather data` airline route data` aircraft flight data` in-flight analysis` airline on-time data

Survey of Patients Hospital Experiences

 healthcare analytics` subjective outcomes` healthcare customer service` quality of care` patient care` patient surveys

Largest collection of longitudinal hospital care data in the US

 healthcare analytics` healthcare big data` research datasets` national in-patient statistics` local healthcare statistics` in-patient statistics` hospital cost data` hospital use data

Ambulatory healthcare data

 physician visits data` doctor visit data` outpatient care data` private practice physician data` non-federal healthcare data` physician office data

Index of /data/sequence/mnist

 mnist` xml` format

MNIST handwritten digit database


Book-Crossing Dataset

 data` set` collaborative` filtering` datamining` books` movie


 movie` netflix prize` source`netflix

Submissions Guidelines for the Online Movie Database

 movie` source

 plot` synopsis` movie` netflix prize` prize


 netflixprize` prize` european` movie` revenue`

Data dumps – Meta

 mediawiki` wikipedia` import` mysql` sql

“phone ***” ” address *” “e-mail” intitle:”curriculum vitae” – Google Search

 resume` google

 random` generator` database` sql

Lending Club Loan Data


SMS Spam Collection

spam`email`text analysis

Data Sets | Pew Research Center’s Internet & American Life Project


Flickr personal taxonomies


Yahoo Data for Researchers


DBLP Computer Science Bibliography

bibliographies`text mining

ICWSM Spinnr Challenge 2011 dataset

weblog`blog`social media`network analysis

Quantum Chaotic Thoughts: Facebook100 Data Set

facebook`network analysis`social

Public Data Sets on Amazon Web Services (AWS)

Amazon Web Services` amazon` ebs` ec2` s3` publicdata` hadoop

The ClueWeb09 Dataset

human language`text mining

Data | The World Bank




What is Twitter, a Social Network or a News Media? – WWW’10

twitter`text mining`social

dotbot |

spider`web analytics help – arXiv Bulk Data Access – Amazon S3

Amazon Web Services` amazon` ebs` ec2` s3` publicdata` hadoop

YouTube Dataset

youtube`image analysis`video analysis

Face Recognition Homepage – Databases

face recognition`facial recognition`image analysis

UCI Network Data Repository

data repositories

Datasets for “The Elements of Statistical Learning”


MovieLens Data Sets | GroupLens Research

movies`video analysis`business

Translation Task – EMNLP 2011 Sixth Workshop on Statistical Machine Translation

translation`human language

Project Gutenberg

books`text mining

About WordNet – WordNet – About WordNet


Aligned Hansards of the 36th Parliament of Canada

canada`parlaiment`government`text mining

CRCNS – Collaborative Research in Computational Neuroscience – Data sharing

Bioinformatics` fmri` neuroscience` python` neuralnetwork

USENET corpus

usenet`text mining





UCI Machine Learning Repository


Gene Expression Omnibus (GEO) Main page


Social Science Data

social science

IMDB dataset


Stanford Large Network Dataset Collection

network analysis

Google Books n-gram dataset

books`text mining

Million Song Dataset | scaling MIR research

audio analysis

Belly Button Biodiversity 2.0

health informatics`bioinformatics

Datasets – Modeling Online Auctions


2gb of photos of cats

image analysis`pets`cats

Click Dataset | Center for Complex Networks and Systems Research

web analytics

The Electric Rice Cooker — One year of deleted weibos archive

text mining

Registered meteorites that has impacted on Earth visualized – AnalyticBridge


GeoJSON files for real-time Virginia transportation data.


NYPD Crash Data Band-Aid


State Department of Education datasets

student performance`school demographics`standardized test performance`school quality`education