Read the Web: WSDM 2010 Supplementary Online Materials



These files are supplementary online material for the WSDM 2010 paper "Coupled Semi-Supervised Learning for Information Extraction" by Carlson, Betteridge, Wang, Hruschka, and Mitchell. This work is part of the Read the Web research project.

Knowledge Bases

These Knowledge Bases contain all of the information about each entity from the experimental runs in the paper. Each entity has information stored in an XML file. Promoted values as well as candidate values are stored, along with sources and probabilities.

Seed Ontology

These spreadsheets contain the seed ontology information about categories and relations, including the 15 seed instances for each predicate.

All Instances Promoted by each Algorithm

These tab-separated text files contain every item promoted by each algorithm during the runs from the paper.

All Promoted Textual Patterns


Labeled Samples from All Promotions for each Algorithm

This text file contains all of the items (up to 30 per predicate) sampled for each algorithm which were used to estimate the precision of all instances promoted by each algorithm for each predicate. The format is "algorithm TAB predicate TAB instance TAB correct/incorrect" where "correct/incorrect" indicates the label from the Mechanical Turk majority vote.

Labeled Samples at Minimum Recall to Compare Two Algorithms

These text files contain all of the items (up to 30 per predicate) sampled using the "minimum recall" threshold described in the paper to directly compare four pairs of algorithms. The last column specifies the label assigned by the majority vote from Mechanical Turk.

Lists Used in Segmenting Noun Phrases

These lists are used in segmenting noun phrases.

Mechanical Turk Evaluation Files

These files include the judgments from our Mechanical Turk evaluation, detailed predicate descriptions shown to MT evaluators, and files that may be useful to researchers who want to use our MT command line tool templates as a starting point for their own MT HITs.

All Candidate Instances Extracted by each Algorithm

These tab-separated text files contain every item extracted (but not necessarily promoted) by each algorithm during the runs from the paper.