Read the Web: WSDM 2010 Supplementary Online Materials

These files are supplementary online material for the WSDM 2010 paper "Coupled Semi-Supervised Learning for Information Extraction" by Carlson, Betteridge, Wang, Hruschka, and Mitchell. This work is part of the Read the Web research project.

Knowledge Bases

These Knowledge Bases contain all of the information about each entity from the experimental runs in the paper. Each entity has information stored in an XML file. Promoted values as well as candidate values are stored, along with sources and probabilities.

Seed Ontology

These spreadsheets contain the seed ontology information about categories and relations, including the 15 seed instances for each predicate.

All Instances Promoted by each Algorithm

These tab-separated text files contain every item promoted by each algorithm during the runs from the paper.

All Promoted Textual Patterns

Textual Patterns: A tab-separated text file that contains every textual extraction pattern promoted during the MBL, CPL, and UPL runs.

Labeled Samples from All Promotions for each Algorithm

This text file contains all of the items (up to 30 per predicate) sampled for each algorithm which were used to estimate the precision of all instances promoted by each algorithm for each predicate. The format is "algorithm TAB predicate TAB instance TAB correct/incorrect" where "correct/incorrect" indicates the label from the Mechanical Turk majority vote.

Samples from all promotions

Labeled Samples at Minimum Recall to Compare Two Algorithms

These text files contain all of the items (up to 30 per predicate) sampled using the "minimum recall" threshold described in the paper to directly compare four pairs of algorithms. The last column specifies the label assigned by the majority vote from Mechanical Turk.

Samples at Minimum Recall between MBL and CPL: Comparing Meta-Bootstrap Learner and Coupled Pattern Learner
Samples at Minimum Recall between MBL and CSEAL: Comparing Meta-Bootstrap Learner and Coupled SEAL
Samples at Minimum Recall between CSEAL and SEAL: Comparing Coupled SEAL and Uncoupled SEAL
Samples at Minimum Recall between CPL and UPL: Comparing Coupled Pattern Learner and Uncoupled Pattern Learner

Lists Used in Segmenting Noun Phrases

These lists are used in segmenting noun phrases.

Mechanical Turk Evaluation Files

These files include the judgments from our Mechanical Turk evaluation, detailed predicate descriptions shown to MT evaluators, and files that may be useful to researchers who want to use our MT command line tool templates as a starting point for their own MT HITs.

Screenshot of Mechanical Turk interface: the interface shown to labelers asked for labels for 15 instances of a predicate.
Mechanical Turk judgments
Descriptions of predicates given to Mechanical Turk users: Tab-separated file containing descriptions of each predicate, and the range and domain of each relation.
Mechanical Turk 'question' template: Used with the Mechanical Turk command line tools and the 'input' file below to generate HITs with up to 15 judgments per HIT.
Mechanical Turk 'input' file used with 'question' file: Used as input to the 'question' file above to generate Mechanical Turk HITs.
Perl script to tally votes from 'results' file obtained from Mechanical Turk: After you get the tab-separated results back from Mechanical Turk, this script will tally the results.

All Candidate Instances Extracted by each Algorithm

These tab-separated text files contain every item extracted (but not necessarily promoted) by each algorithm during the runs from the paper.