Announcing the Article Search API – Open Blog – NYTimes.com
|
Text mining` article` api` text` corpus` newspaper
|
Information Extraction: The RISE Repository of Information Sources
|
Text mining` information` text mining` extraction` reviews` jobs
|
Repositories
|
Text mining` links` text mining` books` rdf` ocr` documents
|
API Documentation — BackType
|
Text mining` api` blog` comments` text mining` stream` trends` backtype` queryminer
|
Free book usage data from the University of Huddersfield » “Self-plagiarism is style”
|
Text mining` books` library` borrowing` recommender` isbn` recommendation` collaborative` filtering
|
ICWSM 2009 – International AAAI Conference on Weblogs and Social Media
|
Text mining` blog` crawl` corpus` network` web` link, data mining data sets
|
Change.gov: The Obama-Biden Transition Team | Join the Discussion: Healthcare
|
Text mining` textmining` opinion` comment` topic` government` queryminer
|
Opinion Extraction, Opinion Mining, Sentiment Analysis, Summarization of Customer Reviews
|
Text mining` sentiment` mining` classification` machine learning` reviews` recommender` text mining` links
|
http://www.yr-bcn.es/semanticWikipedia
|
Text mining` wikipedia` named entity` tagged` text ming
|
Building a (fast) Wikipedia offline reader
|
Text mining` django` wikipedia` compressed` text mining` howto
|
Reddit’s Secret API
|
Text mining` reddit` api` json`
|
phishingcorpus [JoseWiki]
|
Text mining` phising` corpus` text` email` text mining` nlp` mail` security
|
Wikipedia Datasets for the Hadoop Hack | Cloudera
|
Text mining` wikipedia` hadoop` textmining` links
|
Main Task QA Data
|
Text mining` question` answering` trec` nlp` machinelearning
|
The New York Times Annotated Corpus « YooName – named entity recognition
|
Text mining` named entity` nytimes` corpus` people` organizations` locations
|
ADL Gazetteer Development
|
Text mining` named entity` location` place names` geo` nlp` natural language processing
|
Beautiful Data – WikiContent
|
Text mining` book` data` wiki` via:jhammerb
|
Web FAQ collection | ILPS
|
Text mining` faq` question_answering` questions` web` crawl` corpus` xml` textmining
|
Wikipedia:Lists of common misspellings/For machines – Wikipedia, the free encyclopedia
|
Text mining` spelling` mispelling` wikipedia, data mining data sets
|
build.kiva: Blog – Introducing the Kiva API
|
Business and Finance` finance` api` social` kiva` microlending` lending
|
Visualizing the Growth of Target, 1962-2008 | FlowingData
|
Business and Finance` visualization` retail` finance` gis` map` location` store` via:magnetbox
|
The Economy According To Mint
|
Business and Finance` finance` commercial` consumer` mint` spending
|
Best Buy Remix – Welcome to the Best Buy Remix Developer Network
|
Business and Finance` retail` data` api` product` bestbuy
|
Behavioral Targeting, Analytics and Advertising Service for Publishers, Ad Networks
|
Business and Finance` analytics` audience` segmentation` toolbar` commercial` sem` search` advertising
|
Executive PayWatch Database
|
Business and Finance` ceo` compensation` pay` economics` business` labor
|
TradingSolutions – Data Sources
|
Business and Finance` trading` finance` s` api` list
|
Netflix API – Welcome to the Netflix Developer Network
|
Business and Finance` netflix` api` movie` mashup` netflixprize` ratings
|
Open beats Closed: Best Buy’s new APIs – O’Reilly Radar
|
Business and Finance` retail` bestbuy` api
|
Tickermine
|
Business and Finance` custom` research` retail` finance` market` service` analyst`
|
University of Arkansas – Daily Headlines
|
Business and Finance` retail` dillards` uark
|
developerWorks Interviews: Massive data mining and the resurgent mainframe
|
Business and Finance` price` retail` transaction` sams_club` dillards
|
opentick :: market data
|
Business and Finance` opentick` nasdaq` finance` stock` data mining data sets
|
U.S. Company Filings and Annual Reports
|
Business and Finance` finance` links` sec
|
FTP Information – EDGAR Database
|
Business and Finance` edgar` finance` sec` filing` ftp` instructions
|
Data Mining For Investing
|
Business and Finance` investing` finance` datamining` announcement` sec` filing` links
|
UN General Assembly Voting Data
|
Government` un` voting` statistics` government
|
Research Datasets :: CID Data :: Center for International Development at Harvard University (CID)
|
Government` economics` international` development
|
Subsidyscope.com
|
Government` government` banking` csv` tarp` bailout
|
Data Catalog
|
Government` dc` government` feeds` transparency` opendata
|
Announcing the New York Times Campaign Finance API – Open – Code – New York Times Blog
|
Government` nyt` api` campaign` donations` fec`
|
Voter registration data; or, HERE IS YOUR HOPE, YOU FOOLS! « The Edge of the American West
|
Government` voter` registration` politics`2008
|
import/parse/fec.py at master from aaronsw’s watchdog — GitHub
|
Government` fec` python` parser` government` campaign
|
The Watchdog Project: volunteer
|
Government` government` transparency` parsing` election` python
|
Dataset of the day: Where are the Obamacans? | Off the Map – Official Blog of FortiusOne
|
Government` obama` goverment` mashup` gis` geo` map` campaign` donations
|
Normalized Campaign Contribution Data
|
Government` cmu` politics` campaign` donations` fec` via:jhammerb` government
|
Crime data bonanza!!!
|
Government` timeseries` crime` statistics` publicdata
|
Ohio voter registration data
|
Government` voter` voting` politics` government` name` address` registration
|
Voter List Data Files – Election Department, Clark County, Nevada
|
Government` voting` voter` registration` name` address` data` election` politics
|
UNdata
|
Government` UN` publicdata` government` statistics
|
RealClearPolitics – Election 2008 – Democratic Presidential Nomination
|
Government` polls` politics
|
Crime in the United States 2006
|
Government` crime` fbi
|
Daily Kos: Obama helps us track $1,000,000,000,000 of federal spending
|
Government` corruption` government` politics` finance`
|
Welcome to USAspending.gov
|
Government` government` money` politics`
|
Campaign Finance Reports and Data
|
Government` campaign` politics` elections
|
ERS/USDA Data – International Macroeconomic Data Set
|
Government` usda` economics` population` cpi` gdp` income
|
State Agency Databases – GODORT
|
Government` government` directory` links` wiki` states
|
National Bureasu of Economic Research: Data
|
Government` economics` links
|
Bureau of Labor Statistics Data
|
Government` economics` lumber` building` materials` homedepot
|
NBI ASCII Files – Bridge – FHWA
|
Government` government` bridges` safety
|
Twitter API Wiki / REST API Documentation: Social Graph Methods
|
Network Analysis` graph` network` api` social` twitter
|
Using the Wikipedia link dataset — Henry Haselgrove
|
Network Analysis` graph` network` link` wikipedia` pagerank
|
twibs : find the businesses on twitter
|
Network Analysis` directory` businesses` twitter` companies
|
Massive Scrape of Twitter’s Friend Graph « blog.infochimps.org – Organizing Huge Information Sources
|
Network Analysis` textmining` twitter` network` socialnetwork` pagerank` graph` queryminer
|
Twitter Scrape (rough draft) – get.theinfo | Google Groups
|
Network Analysis` twitter` socialnetwork` graph
|
wiki.dbpedia.org : Downloads 32
|
Network Analysis` wikipedia` named_entity` rdf` ontology
|
ICWSM 2009 – International AAAI Conference on Weblogs and Social Media
|
Network Analysis` blog` crawl` corpus` network` web` link
|
Linked Movie Data Base
|
Network Analysis` rdf` movies` movie` api
|
YouTube Dataset
|
Network Analysis` youtube` research` crawl` socialnetwork` network` graph` web
|
API Documentation – Twitter Development Talk | Google Groups
|
Network Analysis` twitter` text` api
|
CRAWDAD
|
Network Analysis` wireless` RF` radio` signal` dartmouth` network
|
Yahoo! Music API – YDN
|
Network Analysis` api` yahoo` music` artists
|
Lookery Developer Network – Lookery Developer Resources
|
Web Analytics` web` analytics` api` traffic` advertising` demographics` lookery
|
True Marble Imagery – Free Download
|
Spatial Analysis` gis` geo` map` mapping` images` satellite
|
Zillow – Labs – Neighborhood Boundaries
|
Spatial Analysis` neighborhoods` geo` gis` maps
|
Full Examples — PyMVPA Home
|
Image Analysis and Video Analysis` fmri` neuroscience` python` neuralnetwork
|
HumanScan : BioID : Downloads : BioID Face Database
|
Image Analysis and Video Analysis` face` detection` image
|
Face Detection
|
Image Analysis and Video Analysis` facerecognition` opencv` face` links
|
NORB Object Recognition Dataset, Fu Jie Huang, Yann LeCun, New York University
|
Image Analysis and Video Analysis` image` 3d
|
LIFE photo archive hosted by Google
|
Image Analysis and Video Analysis` images` photo` pictures` search
|
Activity Recognition: Datasets, Bibliography and others
|
Image Analysis and Video Analysis` activity` recognition` intent
|
Frontal Face Databases
|
Image Analysis and Video Analysis` facerecognition` face` image` recognition
|
Copyright Free and Public Domain Media
|
Image Analysis and Video Analysis` images` audio` publicdata` maps` video` free
|
Databases you can use for benchmarking
|
Image Analysis and Video Analysis` image` vision` recognition`
|
2007 IEEE AVSS Detection and Tracking Algorithm Datasets
|
Image Analysis and Video Analysis` tracking` video` detection` image` recognition` vehicle` pedestrian`
|
OTCBVS
|
Image Analysis and Video Analysis` image` recognition` detection` pedestrian` thermal` tracking` facerecognition` illumination
|
Carnegie Mellon University – CMU Graphics Lab – motion capture library
|
Image Analysis and Video Analysis` gait` pedestrian` walk` motion
|
public domain sounds | free sound library
|
Audio Analysis` sound` publicdomain` audio
|
Full Examples — PyMVPA Home
|
Bioinformatics` fmri` neuroscience` python` neuralnetwork
|
CinC Challenge 2000 data sets
|
Medical Informatics` timeseries` machinelearning` ecg` health` medical` sleep` apnea
|
UC Berkeley. Sheldon Margen Public Health Library. Statistical/Data Resources
|
Healthcare Analytics` health` links` resources` publichealth` berkeley
|
Google Flu Trends | How does this work?
|
Healthcare Analytics` google` health` trends` search` prediction` epidemiology` biodefence` queries
|
Eigenvector Research, Inc. : Data Sets Available to Download
|
Chemoinformatics` NIR` spectra` chemistry` semiconductor` pharmaceutical` matlab`
|
Vaccines: IIS/Tech/Deduplication Test Cases
|
Healthcare Analytics` duplicate
|
Health Data Tools and Statistics
|
Healthcare Analytics` health` information` public` publicdata
|
Cardiac MRI dataset – York University
|
Healthcare Analytics` mri` cardiac
|
NACDA: Search Holdings
|
Demography` aging` statistics` studies
|
Poverty Data Sets General Information
|
Demography` poverty` statistics
|
Pew Internet & American Life Project
|
Demography` internet` demographics` online` web
|
The 2000 U.S. Census: 1 Billion RDF Triples
|
Demography` gis` census` rdf` semantic` sparql
|
Download Database – baseball1.com
|
Sports Analysis` baseball` database` publicdata` statistics` sports
|
It’s a Pitch-by-Pitch Scouting Report, Minus the Scout – New York Times
|
Sports Analysis` baseball` gameday
|
BART – For Developers
|
Network Analysis` urban` transportation` feeds` public` sanfrancisco` bart` api`
|
Tim Davis: UF Sparse Matrix Collection : sparse matrices from a wide range of applications
|
Matrices` spare` matrix
|
Idealware: Mapping Blues: Where is the Data?
|
Pre-processing` resources` links`mapping
|
Amazon Web Services Public Datasets » Data Wrangling Blog
|
Amazon Web Services` amazon` ebs` ec2` s3` publicdata` hadoop
|
Amazon Web Services (AWS) Hosted Public Data Sets
|
Hosted Datasets` amazon` ebs` publicdata
|
WSCD09: Workshop on Web Search Click Data 2009
|
Web Analytics` workshop` search` web` microsoft` log`
|
downloading – flossmole – Google Code – How to get FLOSSmole data for your own use
|
Google` opensource` project` activity` mysql` dump
|
Multi-Domain Sentiment Dataset
|
Supervised` sentiment` review` product` amazon
|
Chris Pound’s Name Generation Page
|
bizzare` scifi` phrase` name` word` generators` random` perl
|
Big Huge Thesaurus API: Access 145,000 Words and Phrases
|
Phrases` webservice` api` thesaurus` textmining` nlp` rest`
|
Search Query Performance report – Google AdWords Help Center
|
Performance` adwords` ppc` search` metrics` webanalytics` sem` query` queryminer
|
Wordze Keyword Research Tool
|
Web Analytics` queryminer` keyword` tool` research` commercial` search` adwords
|
Searchable Catalogs of Data
|
Network Analysis` links` catalogs` social
|
radiohead – Google Code
|
Audio Analysis` lidar` visualization` radiohead` google` video
|
80 Million Tiny Images
|
Image Analysis and Video Analysis` images` words` english` search` visualization` imagemap
|
Time Series Center | Harvard University
|
Temporal Analysis` timeseries` anomaly` detection` astronomical` physics
|
OpenVisuals – Open Source Visualization Framework
|
Image Analysis and Video Analysis` visualization` community` design` processing
|
BGN: Domestic Names – State and Topical Gazetteer Download Files
|
Demography` gis` usgs
|
Datasets
|
Random` benchmark` clustering` regression` machinelearning` list` statistics` mathematics
|
Isomap Datasets
|
Image Analysis and Video Analysis` nonlinear` dimensionality` reduction` faces` digits` images` manifold
|
Yahoo! Search Blog: BOSS — The Next Step in our Open Search Ecosystem
|
oss` api` open` search` yahoo` BOSS` queryminer
|
Download the Database – IP Address Lookup – Community Geotarget IP Project
|
Network Analysis` geocoding` geoip` internet` ip` ipaddress` mysql
|
Airline Data Project
|
Government` airline` statistics` finance` revenue` location` travel
|
Show Us a Better Way: What public data is already available?
|
Government` statistics` census` uk` school` news` publicdata
|
NGA: Country Files
|
Government` country` cities` geo
|
OHPI – Traffic Volume Trends
|
Government` government` traffic` statistics` trends` transportation
|
Semantic Search the US Library of Congress
|
Government` via:inkdroid` libraries` mashup` rdf` semantic` search` semanticweb` books
|
reddit.com: Ask Reddit: Where to download a DB dump of Reddit?
|
Text mining` reddit` socialnetwork` news` web
|
Collaborative filtering dataset – dating agency
|
Text mining` collaborative` filtering` dating` rating` profiles` czech
|
About Us – Predictify
|
Business and Finance` predictionmarket` tool` finance` buzz` advertising` marketing` startup` mmds
|
VGChartz.com | Video Games, Charts, News, Forums, Reviews, Wii, PS3, Xbox360, DS, PSP
|
Business and Finance` sales` ranking` videogames` retail
|
Store Level Information
|
Business and Finance` retail` finance` sales` store`
|
Code for querying and downloading Flickr images
|
Image Analysis and Video Analysis` image` python` code` flickr` matlab` recognition
|
Image Parsing Datasets
|
Image Analysis and Video Analysis` image` recognition
|
TAGora » Data
|
Network Analysis` tag` tagging` s
|
TAGora » Data
|
Network Analysis` netflixprize` imdb` sparql
|
Quality of Life Grand Challlenge Dataset: Kitchen Capture
|
Image Analysis and Video Analysis` machinelearning` motion` capture` sensor
|
Summize Twitter Search API
|
Text mining` api` buzz` opinion` trends` text` twitter` summize` search
|
2008 IEEE InfoVis Contest Dataset
|
Image Analysis and Video Analysis` visualization` contest` scalability` motion` tracking` pedestrian` sensor
|
IMDb Pro : Scary Movie 4: Box office
|
Business and Finance` movie` revenue` sales` box_office` imdb` commercial` movie_study
|
Spider-Man 2 (2004) – Daily Box Office Results
|
Business and Finance` movie` revenue` box_office`
|
Live Search : xRank™ Celebrity — check out who’s hot and who’s not!
|
Network Analysis` search` query` volume` trends` celebrity` prediction` buzz` named_entity
|
IMDbPro.com Free Trial Signup
|
Business and Finance` movie` revenue` timeseries` imdb` commercial` subsription
|
Free time-series and micro-data to download
|
Business and Finance` economics` links
|
PyGTrends: Python API for Google Trends Data
|
google` trends` search` web` analytics` api` code` python` hack
|
Official Google Blog: A new flavor of Google Trends
|
google` trends` search` query` api` csv` keyword` timeseries
|
Open Research – the Data: Lastfm-ArtistTags2007 – Duke Listens!
|
last.fm` music` tagging` artists` tags` collaborative` filtering
|
i2b2: Informatics for Integrating Biology & the Bedside
|
medical` obesity`
|
Tiger Data Set Lecture
|
tiger` gis` lectures
|
Google To Launch Large Scale Geo-Services
|
geo` google` gps` location` geolocation` cell` wifi` api` gis
|
Last.fm’s Playground
|
celebrity` misspelling` spelling` names
|
ImportGenius.com : U.S. Customs Database and Competitive Intelligence Tools
|
commercial` shipping` imports` exports` finance` datamining
|
Directory Listing of Betfair price files
|
betting` prediction` betfair` price` csv` predictionmarket
|
Reuters Spotlight – Article and Media API
|
news` text` articles` api` content` media` xml` images` publicdata
|
DataSets – Scikits – Trac
|
scipy` python` machinelearning` statistics` resource
|
[Wikitech-l] page counters
|
wikipedia` pageviews` trends` textmining` seo` topic
|
Wikipedia article traffic statistics
|
via:chl` wikipedia` web` analytics` seo` topic` textmining` traffic
|
Yahoo! Internet Location Platform – YDN
|
yahoo` geo` geocoding` location` landmarks` gis
|
How to find images on the internet « Random knowledge
|
images` links` lists` archive`
|
Yahoo offers geographic data to Web sites | Tech news blog – CNET News.com
|
gis` webservice` yahoo` api` location` landmark
|
Instructions for Obtaining Search Engine Transaction Logs
|
query` search` log` excite` altavista` alltheweb` transaction
|
TechTC – Technion Repository of Text Categorization Datasets
|
datamining` textmining` categorization` classification` odp` directory` text
|
The TechTC-100 Test Collection for Text Categorization
|
textmining` classification` category` odp` directory
|
FEC Election Contributions: Download Detailed Files by Election Cycle
|
individual` donations` government` election` publicdata` fec
|
Juiced Google Analytics Python API: Juice Analytics
|
search` statistics` keywords` analytics` api` python` web` seo` google
|
Country Name and ISO 3166 Code MySQL Import File
|
mysql` states` countries` isocode
|
geocoded Hotels « GeoNames Blog
|
hotels` geonames`
|
GeoNames webservice and data download
|
locations` cities` countries` gis
|
Index of /download/worldcities
|
cities` gis
|
ualberta dependency based thesaurus and word count data
|
corpus` text` similarity` terms
|
CommonCrawl – About
|
web` crawler` bot`
|
Data sets and corpus / corpora for biological literature and text mining
|
bioinformatics` text` corpora` domainspecific` genomics` corpus`
|
Office of Defects Investigation (ODI), Flat File Downloads
|
defect` recall` automobile` fightclub` nhtsa` saefty
|
p2psim – kingdata : DNS server latency network distance matrices
|
distance` matrix` network` p2p` dns` latency` nmf` queryminer
|
Sep Kamvar / Personalization /
|
pagerank` web` matrix` matlab
|
beta.opentick.com
|
opentick` trading` beta` feeds` finance
|
WikiXMLDB: Querying Wikipedia with XQuery
|
wikipedia` xml` ec2
|
kiwitobes.com » Blog Archive » Walmart Growth Video
|
walmart` visualization` video` freebase` store` retail` locations` opening
|
Open Cell Id dataset – phone geolocation from GSM cellids
|
gis` mobile` geolocation
|
The Cornell Web Lab – The Cornell Web Lab
|
cornell` web` archive` hadoop` crawl
|
im2gps: estimating geographic information from a single image
|
imagerecognition` via:csantos` gis` cmu` gps` imageprocessing` paper` hack` freaking_awesome
|
Datasets: MUSCLE WP2 Evaluation, Integration and Standards
|
image` video` audio` currency` sports` imagerecognition
|
Open Economics – Store – Index
|
economics` list
|
welcome @ omdb
|
free` movie` database` netflixprize
|
Cogblog » Blog Archive » Cogmap APIs
|
api` cogmap` person` name` organization` record_linkage
|
Wal-Mart : Freebase – The World’s Database
|
retail` locations` stores
|
Cogmap: The Org Chart Wiki
|
record_linkage` identity` name` organization` orgchart` marketing
|
German English Parallel Corpus “de-news”, Daily News 1996-2000
|
german` translation` corpus` english` text` via:maxme
|
Welcome to the CRCNS data sharing activity website — CRCNS
|
neuroscience` patch` clamp` recordings` neuron` timeseries` patchclamp` data` neural
|
Infochimps.org: Free Redistributable Rich Data Sets
|
aggregator` links
|
Frequent Itemset Mining Dataset Repository
|
retail` clickstream` traffic` web` links` sales
|
Dolores Labs Blog » Blog Archive » Our color names data set is online
|
colormap` color` mechanicalturk
|
TeradataUniversityNetwork.com -> Registration
|
teradata` retail` transactional` database
|
Pascal Learning Challenge Large Datasets
|
large` competition` challenge` svm` machinelearning` scalability
|
ECIS 2007 – The 15th European Conference on Information Systems
|
retail` dillards` sams_club
|
Alexa Web Search
|
alexa` aws` web` search` api`
|
State and Federal Case Law
|
creativecommons` court` legal` law` via:inkdroid
|
Access to Web Research Collections VLC2/WT10g/WT2g
|
blog` web` text
|
Lyricsfly Lyrics API, database access to search for music artist and song title
|
song` lyrics` database` api`
|
99 Wikipedia Sources Aiding the Semantic Web » AI3:::Adaptive Information
|
links` directory` record_linkage` extraction` wikipeida` named_entity` recognition` textmining` semanticweb
|
AudioScrobbler Data
|
audioscrobbler` recommendation` collaborative` filtering` music
|
The Linking Open Data dataset cloud
|
directory` rdf` semantic` data` soup` graph
|
Free Economic Data | Economic, Financial, and Demographic Data
|
finance` economics` portal` links
|
::MLSP 2008::: MLSP competition
|
machinelearning` trading` competition` backtest` matlab` code` finance` via:DeliciousRob
|
Computer Vision Test Images
|
computer` vision` image` ray` trace` fingerprint` stereo` detection` via:chl
|
The Dataverse Network Project | The Dataverse Network Project
|
statistics` repository` harvard
|
DVN – Home
|
harvard` repository` social` science` research` portal` links
|
Temperature data (HadCRUT3 and CRUTEM3)
|
climate` temperature` netcdf
|
MNIST handwritten digit database, Yann LeCun and Corinna Cortes
|
handwriting` mnist` image` recognition
|
LFW : Labelled Faces in the Wild
|
facerecognition` face` recognition` umass` image
|
Making random contacts – (37signals)
|
generator` names
|
Test (Sample) Data Generators
|
generator` tools` list` via:jd
|
Compete – Compete Developer Resources
|
compete` api` web` statistics` traffic` analytics` mashup
|
Machine Learning (Theory) » The Peekaboom Dataset
|
peekaboom` vision` image` large` human` computation` machinelearning` recognition
|
Ocean Processes and Modeling: Ocean Data
|
links` oceanography` satellite
|
BlogoCenter data sets
|
blog` ucla
|
Tagged datasets for named entity recognition tasks
|
nlp` corpus` tagged` named_entity` recognition` list
|
del.icio.us stats – deli.ckoma
|
del.icio.us`
|
The Financial Data Finder A – G
|
finance` links
|
Freebase Wikipedia Extraction (WEX)
|
wikipedia` xml` structured` corpus
|
The arXiv.org API
|
arxiv` api` open` paper` academic`
|
England Football Results Betting Odds | Premiership Results & Betting Odds
|
gambling` soccer` football` excel` statistics
|
HughesData – Main – Hughes Lab
|
rna` bioinformatics` microarray` expression` gene` machinelearning
|
Stanford MicroArray Database
|
bioinformatics` microarray` expression` gene` machinelearning` stanford
|
ArrayExpress Home
|
bioinformatics` microarray` expression` gene` machinelearning
|
Gene Expression Omnibus (GEO) Main page
|
bioinformatics` microarray` expression` gene` machinelearning
|
Index of /courts.gov
|
corpus` text` legal` law` court` ruling` opensource` publicdata
|
Welcome to Openvest
|
python` finance` edgar` pylons` matplotlib` sec` webservice` via:jolby
|
Statistical Science Web: Data Sets
|
links` statistics
|
Text Mining, Visualization and Social Media
|
crawler` blog` corpus
|
Aleix Face Database
|
facerecognition` machinelearning` face` image
|
Data Repository Evaluation
|
umd` links` statistics` government` sports` via:rickladd
|
PMC FTP Service
|
biology` medicine` articles` text` journal` authors
|
“uspop2002” data set
|
music` similarity` machinelearning
|
Internet Archive: Details: Amazon ASIN listing and similarity graph
|
ASIN` amazon` recommendation` collaborative` filtering` via:keyvowel
|
European Climate Assessment Daily Weather Data
|
weather` europe` ascii` netcdf
|
StatLib—Datasets Archive
|
machinelearning` datamining` cmu` link` collection
|
National Household Travel Survey (NHTS) Data
|
driving` transportation` publicdata
|
Nielsen BookScan USA
|
books` sales` commercial
|
Home – Numbrary
|
finance` data`
|
About – Numbrary
|
searchengine` search` tagging` aggregator` numeric` extraction` tables` collaboration` web2.0
|
Main Page – OpenTextMining
|
textmining` open` nature` standards` search
|
Metafilter Infodump
|
metafilter` comments` network` via:chl
|
WEBSPAM-UK2007 | Datasets | Web Spam Detection
|
web` search` spam` crawler` yahoo
|
Trust network datasets – TrustLet
|
socialnetwork` trustnetwork` trust
|
TaskForces/CommunityProjects/LinkingOpenData/DataSets – ESW Wiki
|
opendata` semantic` rdf` collaboration
|
Some Datasets Available on the Web » Data Wrangling Blog
|
publicdata` links
|
XML.com: GovTrack.us, Public Data, and the Semantic Web
|
semanticweb` rdf` congress` politics` government
|
CiteULike: Available datasets
|
networks` research` graph` tags` paper` record_linkage
|
Archive-It.org
|
archive` internet` web` index`
|
Challenge: Synopsis – Causality Workbench
|
competition` machinelearning` forecasting` contest
|
Natural Language Processing
|
microsoft` text` paraphrase` corpus
|
LDC – Linguistic Data Consortium – Obtaining Data Resorces
|
nlp` text` corpus` ngram` google` commercial` license
|
1990 Census Name Files
|
census` names` identity` frequency` record_linkage
|
Given Name Frequency Project: Analysis of Given Name Popularity
|
name` record_linkage` text` identity` code
|
Email Datasets
|
enron` names` identity` text` record_linkage
|
ZoomInfo – Welcome to the ZoomInfo Developer API
|
api` identity` people` webservice` record_linkage
|
Name Discrimination Data Named Entity Resolution / Entity Disambiguation
|
record_linkage` corpus` nlp` names
|
Developers Area – eBay Market Data Documentation – eBay Market Data Documentation
|
ebay` api` retail` price` code
|
New SwetoDblp RDF dataset released with 11M triples
|
name` authorship` rdf` record_linkage
|
LSDIS : SwetoDblp
|
bibliography` rdf` ontology` duplicate` name` record_linkage
|
StrikeIron Super Data Pack Web Service 1.0 – StrikeIron Marketplace
|
webservice` publicdata` datacleaning
|
Duplicate Detection, Record Linkage, and Identity Uncertainty: Datasets
|
duplicate` detection` record_linkage` datacleaning` text
|
INFO 747 – Social and Economic Data
|
datacleaning` record_linkage` video` lectures` course` cornell` economics` finance` publicdata
|
Overstock.com Affiliate Program
|
retail` overstock` sales` api` product` price` forecasting
|
Amazon Web Services Developer Connection : Can Alexa WS provide detailed …
|
finance` alexa` amazon` tech
|
Market Data — eBay Developers Program
|
ebay` retail` pricing` sales` api` product
|
Machine Learning and Data Mining – Datasets
|
face` image
|
GIS for Schools
|
epidemiology` gis` health
|
Google Trends API coming soon | Tech news blog – CNET News.com
|
google` trends` api`
|
MIT Media Lab: Reality Mining
|
social` activity` location` cell` gis
|
RL Competition 2008 – Home
|
machinelearning` reinforcement` agent` competition`
|
Vehicle Routing Data Sets
|
optimization` vehicle` routing
|
EIA – Petroleum Data, Reports, Analysis, Surveys
|
oil` energy` statistics` economics` petroleum
|
DMOZ100k06 – Michael G. Noll
|
search` pagerank` text` tags` content
|
Grading
|
machinelearning` CMU` course` projects` graphicalmodel` code` paper
|
Financial Forecast Center’s Historical Economic and Market Data
|
exchangerate` dollar` economics`
|
Browse Business Cycle Indicators Data
|
economics` indicators` time` series
|
The Numbers Guy : Aspiring to Be the Wikipedia of Numbers
|
finance` numberpedia` mechanicalturk` textmining` statistics
|
Social characteristics of the Marvel Universe
|
socialnetwork` graphs` comicbooks
|
SourceForge.net: Word Lists Collection
|
dictionary` words
|
See Who’s Editing Wikipedia – Diebold, the CIA, a Campaign
|
wikipedia` authorship`
|
Dataset Generator – Perfect data for an imperfect world.
|
tools` generator
|
Entree Chicago Recommendation Data
|
recommender` collaborative` restaurant
|
community resource guide: i’ve been here before – show me the links
|
demographics` maps` gis` statistics` links
|
Social Science Data on the Net
|
economics` social` government` health` labor` links
|
List of films: A – Wikipedia, the free encyclopedia
|
netflix` netflixprize` movie` index` wikipedia`
|
The arXiv on your harddrive
|
paper` corpus` arXiv
|
Insanely Useful Websites | Sunlight Foundation
|
links` transparency` government` politics` congress` reference
|
Technophilia: Where to find public records online – Lifehacker
|
public` records` links
|
Junk email project
|
corpus` email` spam` textmining
|
Enron Email Dataset
|
enron` corpus` email` text` social` network
|
ftp://ftp.bls.gov/pub/special.requests/cpi/cpiai.txt
|
finance` cpi` inflation` data
|
GOS – Geospatial One Stop
|
health` gis` epidemiology` links
|
CIA Factbook Grep in Python
|
cia` population` python` code` grep
|
Miller Center of Public Affairs – Richard Nixon – Oval Office Recordings
|
nixon` speech` tapes` audio` mp3` wav` flac
|
Deborah Jeane Palfrey Legal Defense Fund
|
phone` politics
|
UC San Diego Data Mining Competition – 2007 – Datasets
|
housing` refinance` mortgage`
|
Retail Industry Financial Ratios & Benchmarks
|
retail` finance` sales` sqft`
|
Retail Industry Financial Ratios & Benchmarks
|
retail` finance` sales` sqft
|
stores | POI Factory
|
retail` location` poi
|
GpsPasSion Forums – ** INDEX OF POI COLLECTIONS **
|
retail` poi` location` gis` gps
|
GPS POI US : Home > Retail Stores
|
retail` location` gis
|
Collective Dynamics Group
|
smallworld` networking` socialnetwork` graph
|
Jester Data download page
|
collaborative` filtering` jokes
|
TricTrac: Video Dataset
|
video`
|
Premium Business Information Databases – AlacraWiki
|
links` finance` commercial
|
Index of /edgar
|
finance` xml` edgar` sec` code` perl
|
Mail Index
|
EDGAR` sec` mail` text
|
metafy / Anthracite Idioms
|
finance` SEC` scrape` parse` commercial
|
Retail and Food Services – Time Series Data/Seasonal Factors
|
retail` sales` census
|
TDT
|
categorization` textmining` detection` tools
|
Volume of retail sales: Social Trends 33
|
retail` sales` uk
|
generatedata.com
|
tools` generator` random
|
Melissa DATA – Lookups
|
consumer` data` database` api
|
FactSet: Data Maven – Kiplinger.com
|
factset` finance`
|
IBES (Demo)
|
finance` ibes` analyst` forecast` wharton
|
Thomson Financial I/B/E/S Data
|
finance`
|
Historical Quotes – Yahoo! Finance
|
yahoo` finance` stock` price`
|
Network data
|
network` links
|
Bureau of Labor Statistics Home Page
|
statistics` labor` government` consumer
|
NAR: Research: EHS Data
|
housing` sales` finance
|
RFA – The Industry – Industry Statistics
|
ethanol`
|
Chain Store Guide – Retail Locations
|
retail` finance` store` locations` gis
|
Press Releases – Directions Magazine
|
retail` gis` store` locations
|
Energy Information Administration – EIA – Official Energy Statistics from the U.S. Government
|
finance` government` energy` historical` forecasts` fuel` oil
|
Databases you can use for benchmarking
|
links
|
UPC Database: Downloads
|
product` upc` database`
|
Web Crawling / Crawl Datasets at Tobias Escher at the OII
|
crawler` benchmark` search` web` links
|
TechTC – Technion Repository of Text Categorization Datasets
|
corpus` text
|
TMC data archive download site
|
traffic` data`
|
http://www.volvis.org/
|
volume rendering
|
Computational Vision: Archive
|
vision` caltech` image recognition
|
DC Pedestrian Classification Benchmark
|
pedestrian` image` classification` detection
|
opentick :: home
|
finance` economics` feed` free` stock` trading` opentick` opensource
|
Web as Corpus
|
textmining` corpus` concordance` wordlist` n-gram
|
.:[ packet storm ]:. – http://packetstormsecurity.org/
|
dictionary` hack` security` wordlist` password
|
Enron Dataset
|
data` mysql` email` energy` text` social network
|
Splog Blog Dataset
|
blog` corpus` spam
|
Home Page for 20 Newsgroups Data Set
|
corpus` text` newsgroup
|
White Glove Tracking
|
crowd sourcing` image` processing` algorithm` collaborative` distributed` web2.0` code` opensource
|
NOAA Paleoclimatology Program – Coral and Sclerosponge Data
|
paleo climatology` climate` oceanography` coral` sponge` biology
|
NAICS — North American Industry Classification System
|
finance` economics` naics` industry` classifications
|
Saving Democracy With Web 2.0 –
|
democracy` web2.0` mashup` government` funding` article
|
Congresspedia – Congresspedia
|
collaborative` wiki` government` congress` politics` elections` web2.0` directory
|
Population Estimates Data Sets
|
census` data` population` statistics
|
CRAN Task View: Machine Learning & Statistical Learning
|
statistical learning` machine learning` code` R` libraries` cran`
|
Data for Data Mining
|
linkd` datamining` timeseries` text` extraction` socialnetwork
|
PAIDA – Pure Python scientific analysis package
|
python` visualization` library
|
SUBDUE – Graph Based Knowledge Discovery
|
machine learning` network` graph`
|
AOL search data mirrors
|
aol` search`
|
Python Cheese Shop : shakespeare 0.4
|
python` text`
|
AG’s corpus of news articles
|
corpus` nlp` machine learning` textmining
|
Sampling Techniques for Massive Data – Google Video
|
video` machine learning` statistics` matrix` sampling` large` sparse` algorithm` experiment_design
|
metachronistic » Mirror the Wikipedia
|
wikipedia` laptop` install` dump
|
LETOR: Benchmark Datasets for Learning to Rank
|
ranking` search
|
CN710: Comparative Analysis of Learning Systems (Spring 2006) – Class Project
|
machinelearning` algorithm` ogi` bu` greyhound` finance
|
UrbanSim Home
|
python` urban` software` simulation` opensource` GIS` census`
|
System One – Wikipedia³
|
wikipedia` rdf`
|
System One – Labs
|
wikipedia` rdf` tools
|
Face Recognition Homepage – Databases
|
face` algorithm` facere cognition` data` image
|
CBCL SOFTWARE Face data set
|
face` seung` algorithm` recognition` image
|
Text Analytics Solutions from ClearForest
|
extraction` finance` semantic` semanticweb` text
|
23C3 – Mining Search Queries – Google Video
|
aol` search` video` talk` algorithm` information retrieval` datamining` machinelearning
|
Digital History Hacks: Keywords and Clues
|
aol` search` query` analysis
|
Digital History Hacks: Searching for History
|
aol` search` query` analysis
|
The Tom Kyte Blog: An interesting data set…
|
aol` search` oracle` database` code
|
KDD 2005 – KDD Cup 2005: Aug 21-24, Chicago, IL. USA
|
query` categorization` algorithm` google
|
Statistical NLP / corpus-based computational linguistics resources
|
corpus` machine learning` text
|
Ph.d.-student Rasmus Elsborg Madsen
|
text` machine learning` context` matlab
|
Intelligent Web Search and Mining: Tools & Resources
|
machine learning` code` links
|
PageRank Datasets and Code
|
pagerank` code` algorithm
|
Official Google Research Blog: All Our N-gram are Belong to You
|
linguistics` google` ngram` nlp` record_linkage
|
Hyper-threaded Java – Java World
|
clustering` algorithm` java` parallel
|
Statistical Modeling, Causal Inference, and Social Science
|
blog` econometrics` finance` machine learning` math` statistics
|
Structural Analysis of Discrete Data and Econometric Applications,
|
books` econometrics` economics` finance` ebook
|
Kris Brower » Archives » Google Onpage Search Results Analysis
|
google` ranking` aol` search` analytics
|
CSE 250B Fall 2006
|
netflixprize` machine learning` course`
|
Matrix Market
|
matrixmarket` matrix`
|
Estimation of mean values, covariance matrices and imputation of missing values
|
imputation` matlab` missing` EM` machinelearning
|
Face Detection
|
face` image
|
CSE 250B Project 4, Fall 2006
|
subset` netflix prize` dimensionality` reduction
|
G3DATA
|
extract` from` graphs` hack` google` trends
|
cwm – a general purpose data processor for the semantic web
|
python` processor` semantic` web` rdf
|
WebBase Project
|
link` analysis` structure` web` crawler` stanford
|
sam roweis : data
|
machine` learning` matlab` python` hackers` image
|
Flight Data and Weather Data
|
flight data` airplane data` weather data` airline route data` aircraft flight data` in-flight analysis` airline on-time data
|
Survey of Patients Hospital Experiences
|
healthcare analytics` subjective outcomes` healthcare customer service` quality of care` patient care` patient surveys
|
Largest collection of longitudinal hospital care data in the US
|
healthcare analytics` healthcare big data` research datasets` national in-patient statistics` local healthcare statistics` in-patient statistics` hospital cost data` hospital use data
|
Ambulatory healthcare data
|
physician visits data` doctor visit data` outpatient care data` private practice physician data` non-federal healthcare data` physician office data
|
Index of /data/sequence/mnist
|
mnist` xml` format
|
MNIST handwritten digit database
|
mnist`
|
Book-Crossing Dataset
|
data` set` collaborative` filtering` datamining` books` movie
|
allmovie
|
movie` netflix prize` source`netflix
|
Submissions Guidelines for the Collectorz.com Online Movie Database
|
movie` source
|
cinema.com
|
plot` synopsis` movie` netflix prize` prize
|
LUMIERE
|
netflixprize` prize` european` movie` revenue`
|
Data dumps – Meta
|
mediawiki` wikipedia` import` mysql` sql
|
“phone ***” ” address *” “e-mail” intitle:”curriculum vitae” – Google Search
|
resume` google
|
generatedata.com
|
random` generator` database` sql
|
Lending Club Loan Data
|
Finance`Loans`business`investing
|
SMS Spam Collection
|
spam`email`text analysis
|
Data Sets | Pew Research Center’s Internet & American Life Project
|
demography
|
Flickr personal taxonomies
|
flickr`taxonomy`images
|
Yahoo Data for Researchers
|
yahoo
|
DBLP Computer Science Bibliography
|
bibliographies`text mining
|
ICWSM Spinnr Challenge 2011 dataset
|
weblog`blog`social media`network analysis
|
Quantum Chaotic Thoughts: Facebook100 Data Set
|
facebook`network analysis`social
|
Public Data Sets on Amazon Web Services (AWS)
|
Amazon Web Services` amazon` ebs` ec2` s3` publicdata` hadoop
|
The ClueWeb09 Dataset
|
human language`text mining
|
Data | The World Bank
|
government`finance`economy
|
ImageNet
|
images
|
What is Twitter, a Social Network or a News Media? – WWW’10
|
twitter`text mining`social
|
dotbot | DotNetDotCom.org
|
spider`web analytics
|
arXiv.org help – arXiv Bulk Data Access – Amazon S3
|
Amazon Web Services` amazon` ebs` ec2` s3` publicdata` hadoop
|
YouTube Dataset
|
youtube`image analysis`video analysis
|
Face Recognition Homepage – Databases
|
face recognition`facial recognition`image analysis
|
UCI Network Data Repository
|
data repositories
|
Datasets for “The Elements of Statistical Learning”
|
learning
|
MovieLens Data Sets | GroupLens Research
|
movies`video analysis`business
|
Translation Task – EMNLP 2011 Sixth Workshop on Statistical Machine Translation
|
translation`human language
|
Project Gutenberg
|
books`text mining
|
About WordNet – WordNet – About WordNet
|
wordnet`corpus
|
Aligned Hansards of the 36th Parliament of Canada
|
canada`parlaiment`government`text mining
|
CRCNS – Collaborative Research in Computational Neuroscience – Data sharing
|
Bioinformatics` fmri` neuroscience` python` neuralnetwork
|
USENET corpus
|
usenet`text mining
|
UniGene
|
bioinformatics
|
ChEMBLdb
|
chemoinformatics
|
UCI Machine Learning Repository
|
algorithms
|
Gene Expression Omnibus (GEO) Main page
|
genetics`bioinformatics
|
Social Science Data
|
social science
|
IMDB dataset
|
business
|
Stanford Large Network Dataset Collection
|
network analysis
|
Google Books n-gram dataset
|
books`text mining
|
Million Song Dataset | scaling MIR research
|
audio analysis
|
Belly Button Biodiversity 2.0
|
health informatics`bioinformatics
|
Datasets – Modeling Online Auctions
|
auctions
|
2gb of photos of cats
|
image analysis`pets`cats
|
Click Dataset | Center for Complex Networks and Systems Research
|
web analytics
|
The Electric Rice Cooker — One year of deleted weibos archive
|
text mining
|
Registered meteorites that has impacted on Earth visualized – AnalyticBridge
|
meteorites`atmosphere
|
GeoJSON files for real-time Virginia transportation data.
|
road`traffic`accidents`transportation
|
NYPD Crash Data Band-Aid
|
road`traffic`accidents`transportationccidents`transportation
|
State Department of Education datasets
|
student performance`school demographics`standardized test performance`school quality`education
|