Efficient and Expressive Knowledge Base Completion Using Subgraph Feature Extraction

Matt Gardner and Tom Mitchell

International Conference on Empirical Methods for Natural Language Processing (EMNLP 2015), Lisbon, Portugal.


A preprint of the paper can be found here.


The code used in this paper lives in a repository on github.


You need three pieces of data to reproduce the experiments in this paper:

The graph files describe the NELL + SVO graph that we used. The split files contain the train and test data for each of the 10 relations we learned models for, and the NELL metadata contains domains and ranges and inverses for the relations, among other things. If you look at the documentation available at the github repository linked above, you should be able to find how all of these pieces fit together when running the code, as well as a description of the file formats for each of these pieces of data.

Quick Start

To quickly reproduce the best result from the paper, pick a directory on a unix-based machine (I only tried this on linux, but it should work on MacOS too), then run the following commands:

git clone https://github.com/matt-gardner/pra.git
cd pra/examples
wget http://rtw.ml.cmu.edu/emnlp2015_sfe/graph.tgz
mkdir -p graphs/nell/
tar -xzf graph.tgz
mv kb_svo/ graphs/nell/
wget http://rtw.ml.cmu.edu/emnlp2015_sfe/split.tgz
mkdir splits/
tar -xzf split.tgz
mv final_nell_split_with_negatives/ splits/
wget http://rtw.ml.cmu.edu/emnlp2015_sfe/nell_metadata.tgz
mkdir relation_metadata/
tar -xzf nell_metadata.tgz
mv nell/ relation_metadata/
cd ..
sbt "run ./examples/ sfe_bfs_pra_anyrel"

The last command (actually running the code) took 4-5 minutes when I ran it on a fairly large EC2 instance (it used ~30 cores and something like ~30G of RAM). Your mileage may vary. You will need to run that command twice, incidentally - the first time, you should select the ExperimentRunner option, and the second time you should select ExperimentScorer, to see the results.

Note that this requires that you already have git, wget, and sbt installed on your machine. If you don't have them, finding instructions to install them should be easy. Please don't ask me for help getting this to run on Windows; I don't use Windows, and I won't be able to help you. I also suspect that the code just won't work, because it has plenty of hard-coded path separators (/ instead of \). Also note that I am still actively developing this code for various things I'm working on, and I occasionally make changes that break things. If you try running those commands exactly as they are, and something doesn't work, or the results don't match what's in the paper, let me know and I will fix it.

Once you've reproduced the result, you'll likely want to run the code on your own dataset, or run your algorithm on my dataset. If that's what you want to do, see below.

Old (and slightly more explanatory) instructions

If you want to actually run the code to reproduce the experiments in this paper, it should be pretty easy. The experiment spec files are checked in to the github repository under examples/experiment_specs/nell/final_emnlp2015/. You'll need to untar the graph files into graphs/nell/kb_svo/, the split files into splits/final_nell_split_with_negatives/, and the NELL metadata into relation_metadata/nell/. The easiest thing to do is probably to put all of those (the graphs/, splits/ and relation_metadata/ directories) underneath examples/ where you checked out the code, then execute the following command from the base code directory: sbt run ./examples/ sfe_bfs_pra_anyrel. That should reproduce the best result from the paper. Changing the last argument will run different experiments (e.g., using final_emnlp2015 will try to run basically all of the experiments in the paper, though there might be a few for which I didn't provide enough input files).

See the documentation for more information on running the code, and let me know if you run into problems (contact information is below). A quick note: you very well may need to increase the heap space used by the JVM, and the mechanism to do that with sbt is different from when running java code. To do this with sbt, you need to modify the build.sbt file at the root of the repository. Add a line somewhere that looks like this: javaOptions ++= Seq("-Xmx40g"), with a blank line above and below it.

The SVO data itself can be found here.

Using the data and/or code

If you all you want is to run your new algorithm and compare it against this work, you'll need the data files (to run your algorithm on), and my results files from running on the same data. The data files are linked above, but I haven't put up the results files yet, because I need to clean it up so the download isn't several gigabytes. If you want the results to do a thorough comparison, let me know, and I will put them up (or you could run the code yourself, as described above; SFE is deterministic and you should get identical results, though PRA is not deterministic. The code tends to need a large machine, however, with at least tens of gigabytes of RAM).

If you want to use this method on your data, see the github repository linked above (and feel free to file bugs or feature requests, send pull requests, etc.).

If you want to extend the algorithm in some way using my code as a starting point, see the github repository linked above (and feel free to file bugs or feature requests, send pull requests, etc.).

If you want something else, contact information is below.


If you have any questions about the paper, about using the code, or about obtaining the data, the main point of contact for this paper is Matt Gardner.