Download: Semantically Annotated Noun Phrases from NELLThis page allows you to download NELL's semantic category labels for millions of noun phrases, annotated using 271 possible semantic categories from NELL's ontology. You should download two files:
1. Download vocabulary of NELL semantic categories. See below for details of the file format.
2. Download one or more of the following files of annotated noun phrases. There are several sets of noun phrases, collected from different text corpora:
From ClueWeb09. Version 1 released May 2013, labeled by NELL at age/iteration 734
- 13.4 million noun phrases collected from the ClueWeb09 text corpus. download (353 Mbytes)
From KBP 2012 Source Document Collection. Version 1 released on May 31, 2013, labeled by NELL at age/iteration 734
- Labels for 11 million noun phrases containing up to three words, extracted automatically with fairly good accuracy from the KBP 2012 source document collection: download (307 Mbytes)
- Labels for 10.4 million noun phrases containing four or more words, extracted automatically from the KBP 2012 source document collection: download (286 Mbytes)
- Labels for 6.5 thousand noun phrases extracted from the training and evaluation annotations made available for the KBP 2012 track. This set of labelings includes the entity IDs from the annotations for easy cross-referencing: download (241 Kbytes)
From KBP 2013 Source Document Collection. Version 1 released on Jun 14, 2013, labeled by NELL at age/iteration 734
- Labels for 13 million noun phrases containing up to three words, extracted automatically with fairly good accuracy from the KBP 2013 source document collection: download (325 Mbytes)
- Labels for 13.5 million noun phrases containing four or more words, extracted automatically from the KBP 2013 source document collection: download (391 Mbytes)
Details & File Format
- The vocabulary of NELL semantic categories file
lists the names of the 271 semantic categories used, plus their ancestors in the category hierarchy,
in tab-delimited format. Each line in this file gives the name of a NELL category, followed by a
tab character, followed by the names of its super-categories in the hierarchy (separated by spaces,
and not listed in any particular order). For example:
bone bodypart everypromotedthing item physiologicalcondition everypromotedthing abstractthing politician person everypromotedthing humanagent agent meat everypromotedthing item food magazine mediacompany everypromotedthing organization humanagent company agent publication skyscraper location building everypromotedthing geolocatablething
The root of the category ontology is "everypromotedthing". The NELL category hierarchy is also browsable in our KB browser, which includes human readable descriptions of each category.
- The Annotated NP files list the noun phrases, one per line, along with
NELL's category assignments and their confidence scores. Here is a sample:
walmart cakes food 0.9950034255883934 bakedgood 0.9936223796319958 walmart chicken fillet food 0.9120391365495848 meat 0.7938324890967661 walmart computer householditem 0.9930006960040956 tableitem 0.9743665732692279 walmart food problem cognitiveactions 0.9546198511420078 walmart foods retailstore 0.9633229830359848 Wallmart retailstore 0.9952728850944121 Walmart manager jobposition 0.9797067038004044 Walmart store retailstore 0.8941848786449225 Walmart blog retailstore 0.5 Walmart products product 0.5904620565265137 personalcareitem 0.5785527316534766 WakeUpWalmart.com website 0.6484081720243292 Walmart Employee person 0.6482422202114423 Walmart Supercenter building 0.9996091798476484 shoppingmall 0.9841013139469271 retailstore 0.8778046916000724 attraction 0.6018111088928949 Walmart logo retailstore 0.5 Walmart Scholarship person 0.6362291197367791 Walmart Music company 0.8980230629335543 person 0.7694512922867878 recordlabel 0.7498610989111582 Walmart shopping hobby 0.6753644318904531 physicalaction 0.5583626348382723 Walmarting hobby 0.8720528297110349 sport 0.7153318897051574 physicalaction 0.6814023516120948
Each line begins with a case-sensitive noun phrase, followed by a tab, followed by a list of one or more category-confidence pairs, where each such pair is also separated by a tab. The category-confidence pairs are listed from highest to lowest confidence. Confidences range from 0.5 to 1.0, and are not calibrated probabilites -- we find that beliefs with confidence of at least 0.90 are correct more often than not, and those with confidence below 0.80 are often unreliable. The noun phrases were extracted from the text corpus using a set of rules from POS-tagged setences.
For example, the first line in the above sample denotes NELL's belief that the noun phrase "walmart cakes" refers to a 'food' with confidence 0.995, and to a 'bakedgood' with slightly lower confidence. Each line contains only labels with confidence at least 0.5, and lists the more general category only if NELL has higher confidence in the more general category than in the more specific category. Otherwise, the printing of more general categories is suppressed (these more general categories can be read from the Category Hierarchy file.
This data is made freely available by the NELL research project, for anybody who would like to use it for any legal purpose whatsoever. You may cite this in publications as
- CMU NELL noun phrase annotations, version NELL.08m.734.categories, http://rtw.ml.cmu.edu/rtw/nps, Machine Learning Department, Carnegie Mellon University, May 2013.