2.5K-Graphs Software
We make available two software packages to demonstrate the algorithms and estimators described in our paper 2.5K-Graphs: from Sampling to Generation, that appeared in INFOCOM 2013.- Estimation_GraphSample: Provides a standalone script that estimates the degree-dependent clustering coefficient distribution and network average clustering coefficient. It receives as input a random walk graph sample (Released in Sept. 2013).
- 2.5K-Estimation_Generation: Provides a package that implements all the algorithms and estimators included in our paper in classes Estimation and Generation. It receives as an input a fully known graph and then simulates a random walk graph sample of given size. The class Estimation provides functions that estimate the degree-dependent clustering coefficient (CCK) and joint degree distribution (JDD). The class Generation provides functions that generate a 2.5K graph given specific CCK and JDD distributions. (Released in July 2012)
If you use our software, please cite our paper using the below bibtex entry:
@inproceedings{gjoka13_2.5K_Graphs, title= {{2.5K-Graphs: from Sampling to Generation}}, author= {Minas Gjoka and Maciej Kurant and Athina Markopoulou}, booktitle = {Proceedings of IEEE INFOCOM '13}, address = {Turin, Italy}, month = {April}, year = {2013} }
Estimation_GraphSample
Download here. Instructions of usage follows below.Input
The Python script Estimation_GraphSample.py expects as input two files.- The first file contains the order of visits in the random walk. Line i contains the i-th node id sampled during the simple random walk. Here is an example:
uid_1| uid_3| uid_2| uid_1| ..
- The second file contains, for each sampled node id in the random walk, the sampled node id and the node ids of its neighbors (adjacency list format). Each sampled node id and its neighbors appear in a separate line. Here is an example, where character '|' is used as a separator.
uid_1|uid_2|uid_3|uid_4|uid_5 uid_2|uid_1|uid_3 uid_3|uid_1|uid_2|uid_6 ..
We provide real examples of random walk samples and the corresponding graph samples at http://www.minasgjoka.com/2.5K/samples_randomwalk/. The file "samples-sources.txt" contains more information for each sample.
The script Estimation_GraphSample.py also optionally accepts setting the separator character,which in the above case is '|'. By default, whitespace is the separator character.
Output
The output of this script consists of estimation results for- the Traversed Edges estimator of CCK
- the Induced Edges Estimator of CCK
- the Hybrid Estimator of CCK
- the Harsen-Hurwitz estimator of the node degree distribution.
Example
Below is the output of an execution with a random walk sample of size 50K from Last.fm:> ./Estimation_GraphSample.py graphsample_lastfm randomwalk_lastfm "|" *** Loading graph sample graphsample_lastfm *** Graph sample contains 9135540 triangles Graph sample has 41130 nodes and 3909183 edges *** Loading random walk traversal randomwalk_lastfm *** Random walk contains 50000 total node samples and 41130 unique node samples Degree distribution estimated from random walk. Average degree is 10.424 *** Starting estimation of degree-dependent clustering coefficient distribution *** [Induced Edges] Network average clustering coefficient: 0.167639 [Traversed Edges] Network average clustering coefficient: 0.172050 [Hybrid] Network average clustering coefficient: 0.170205 *** Storing estimation results in text files *** Induced Edges estimation dumped in file 'graphsample_lastfm_cck-ind.txt' Traversed Edges estimation dumped in file 'graphsample_lastfm_cck-edge.txt' Hybrid estimation dumped in file 'graphsample_lastfm_cck-hybrid.txt' Node degree distribution estimation dumped in file 'graphsample_lastfm_nodeDegreeDist.txt' Degree-dependent clustering coefficient graph available at 'graphsample_lastfm_cck graph.png' Node degree distribution graph available at 'graphsample_lastfm_nodeDegreeDist-graph.png'
2.5K-Estimation_Generation
Download here. It consists of two classes (i) the "Generation" and (ii) the "Estimation" class. The following functions are exposed in each class.Estimation
-
load_graph(fname)
Loads graph. Accepts edgelist (edges.gz) and Matlab adjacency matrix (.mat) formats. The graphs used in the paper and referenced in the file examples.py are conveniently located at http://www.minasgjoka.com/2.5K/graphs/. -
calcfull_CCK()
Calculates full degree-dependent clustering coefficient distribution. Assumes "load_graph" has been previously called. -
calcfull_JDD()
Calculates full joint degree distribution. Assumes "load_graph" has been previously called. -
sample('rw', p_sample)
Simulates a random walk sample from the loaded graph. Assumes "load_graph" has been called. -
estimate_CCK()
Estimates degree-dependent clustering coefficient distribution. Assumes "load_graph", "sample" have been called. -
estimate_JDD()
Estimates joint degree distribution distribution. Assumes "load_graph", "sample" have been called. -
estimation_summary()
Provides summary of the result of either the estimation or the calculation of CCK and JDD.
-
get_KTRI('full')
Returns full KTRI calculation. Assumes "load_graph", "calcfull_CCK" have been called. -
get_KTRI('estimate')
Returns estimated KTRI. Assumes "load_graph", "sample", "estimate_CCK" have been called. -
get_JDD('full')
Returns full joint degree distribution calculation. Assumes "load_graph", "calcfull_JDD" have been called. -
get_JDD('realizable')
Returns realizable joint degree distribution. Assumes "load_graph", "sample", "estimate_JDD" have been called.
KTRI is the non-normalized version of the degree-dependent clustering coefficient. KTRI contains the number of triangles connected to nodes of degree k.
Generation
-
set_JDD( jdd_input)
Sets the target joint degree distribution. It accepts input from function get_JDD of the Estimation class. -
set_KTRI( jdd_input )
Sets the target (non-normalized) degree-dependent clustering coefficient distribution. It accepts input from function get_KTRI of the Estimation class.
-
construct_triangles_2K()
Construct simple exact 2K with high clustering, that is our algorithm for 2.5K. It assumes "set_JDD" has been previously called. -
construct_simple_2K()
Construct simple exact 2K using the Sandia Labs construction algorithm. It assumes "set_JDD" has been previously called. -
construct_dkseries_2K()
Construct simple 2K with potential multigraph edges using the CAIDA construction algorithm. It assumes "set_JDD" has been called
-
mcmc_improved_2_5_K(error_threshold=0.03)
Do improved MCMC to reach 2.5K. Assumes "set_JDD", "set_KTRI" have been called and one of the three 2K constructions has been completed. -
mcmc_random_2_5_K(error_threshold=0.03)
Do random MCMC to reach 2.5K. Assumes "set_JDD", "set_KTRI" have been called and one of the three 2K constructions has been completed. -
save_graphs(filename)
Save constructed graph as a networkx Graph type to a python Pickle file.
Examples
The file examples.py provides examples that demonstrate how to combine the functions from the Estimation and Generation classes so as to go "from Sampling to Generation".
As an example of usage of these classes, below is included Python source code that
- loads the Facebook New Orleans graph
- simulates a random walk sample of size equal to 30% of the graph size, including repetitions
- estimates JDD and CCK
- generates a synthetic graph with the same target JDD and CCK using our paper's algorithm
(2K with Triangles + Improved MCMC). - saves the generated synthetic graph as a networkx Graph in a python pickle.
from Estimation import Estimation from Generation import Generation p_sample = 0.30 fname = "Facebook-New-Orleans.edges.gz" myest = Estimation() myest.load_graph(fname) myest.sample('rw', p_sample) myest.estimate_JDD() myest.estimate_CCK() myest.estimation_summary() mygen = Generation() mygen.set_JDD( myest.get_JDD('realizable') ) mygen.set_KTRI( myest.get_KTRI('estimate') ) mygen.construct_triangles_2K() mygen.mcmc_improved_2_5_K(error_threshold=0.05) mygen.save_graphs('%s_2KT+ImpMCMC_%.2fsample' % (fname, p_sample))Here follows an instance of an execution
Loading graph -------- Edges:816884 Nodes:63392 AvgDegree:25.772 Simulation of sample -------- 30.00% rw sample Estimation -------- JDD Estimation CCK estimation Postprocessing to make JDD realizable Summary of estimation results. Realizable JDD - Edges:828588.00 Nodes:62909.00 AvgDegree:26.342 Estimated CCK - #Triangles:21718829 Avg Clust Coeff: 0.229 Generation 2K --------- Estimated JDD set as target JDD 2K triangles construction started 1st pass allpairs smart residual stubs: 266360/266360 1st pass kneighb residual stubs: 1582/1582 2nd pass iteration 0 residual stubs: 16/16 2nd pass iteration 1 residual stubs: 0/0 Constructed Graph - Edges:828588.00 Nodes:62909.00 AvgDegree:26.342 Generation 2.5K --------- Improved double edge swaps --2K Graph : #Triangles:40031742 C_avg:0.483 --Target Graph : #Triangles:21718829 C_avg:0.229 ** Simulation completes when error metric NMAE reaches threshold 0.050 ** [Thu Sep 12 01:17:52 2013] [4.5][216.59] #Swaps:973, C_avg:0.4794, NMAE:0.7740, #Triangles:39763992 [3.9][249.88] #Swaps:1940, C_avg:0.4760, NMAE:0.7644, #Triangles:39500406 .. .. [3.8][24.18] #Swaps:295603, C_avg:0.2248, NMAE:0.0500, #Triangles:21229320 Total time:1830 Saving 2K at "Facebook-New-Orleans.edges.gz_2KT+ImpMCMC_0.30sample_2K.edges.gz" Saving 2.5K at "Facebook-New-Orleans.edges.gz_2KT+ImpMCMC_0.30sample_2.5K.edges.gz" ** 2.5K Graph : #Triangles:21229320 C_avg:0.225 **
© 2013 Sept.