The codes provided here will compute CoSimRank. You can also use it to compute the cosine of personalized PageRank vectors. Our experiments have shown that CoSimRank is more accurate.
To learn more about CoSimRank, read the following paper: http://anthology.aclweb.org/P/P14/P14-1131.pdf Or watch the talk (20min): http://techtalks.tv/talks/cosimrank-a-flexible-efficient-graph-theoretic-similarity-measur/60483/git clone https://github.com/casaro/CoSimRank.git
make
sandbox/sandbox.sh
The graph files are in matrix market file format. Each graph can contain several files, each one presenting one edge type. The code comes with an English and a German graph, with approximately 30k words each. The graphs have no edges type (or just one).
Each graph requires a dictionary file to connect nodes and words. A dictionary files contains N lines (where as N is the number of words in the graph, i.e. the matrix market files has dimension NxN). Each lines starts with the the index, followed by the word and the pos. All separated by space. The codes also contains the matching dictionary files for the graphs.
For bilingual experiments a seed file is required. This file contains known translations. It is also in matrix market file file format and usually contains edges of weigth 1. But other weights might be interesting as well. For monolingual experiments (synonym extraction) this file is not used.
You can provide a testset file if you want (see sandbox/example.sh for an example). The code will execute all tests and compute acc@1, acc@10 and MRR. A testset file starts with a line containing the number of tests T, the source graph and the target graph (either A or B). The following T lines will contain the pos (a, n or v), the source word and the expected answer. The anser can be several word separated by ",". All other entries are separated by tabulator. The code has got a TS1000 testset for lexicon extraction and a TS68 testset for synonym extraction.
If you use CoSimRank, please cite the following paper:
@InProceedings{rothe-schutze:2014:P14-1, author = {Rothe, Sascha and Sch\"{u}tze, Hinrich}, title = {CoSimRank: A Flexible \& Efficient Graph-Theoretic Similarity Measure}, booktitle = {Proceedings of the ACL}, month = {June}, year = {2014}, address = {Baltimore, Maryland}, publisher = {Association for Computational Linguistics}, pages = {1392--1402}, url = {http://www.aclweb.org/anthology/P/P14/P14-1131} }
Contact: Sascha Rothe (cis page)