Minas Gjoka

2.5K-Graphs Software

We make available two software packages to demonstrate the algorithms and estimators described in our paper 2.5K-Graphs: from Sampling to Generation, that appeared in INFOCOM 2013.

If you use our software, please cite our paper using the below bibtex entry:

  title= {{2.5K-Graphs: from Sampling to Generation}},
  author= {Minas Gjoka and Maciej Kurant and Athina Markopoulou},
  booktitle = {Proceedings of IEEE INFOCOM '13},
  address = {Turin, Italy},
  month = {April},
  year = {2013}


Download here. Instructions of usage follows below.


The Python script Estimation_GraphSample.py expects as input two files.

We provide real examples of random walk samples and the corresponding graph samples at http://www.minasgjoka.com/2.5K/samples_randomwalk/. The file "samples-sources.txt" contains more information for each sample.

The script Estimation_GraphSample.py also optionally accepts setting the separator character,which in the above case is '|'. By default, whitespace is the separator character.


The output of this script consists of estimation results for
  1. the Traversed Edges estimator of CCK
  2. the Induced Edges Estimator of CCK
  3. the Hybrid Estimator of CCK
  4. the Harsen-Hurwitz estimator of the node degree distribution.
Each estimation result is stored in a separate file. Additionally, if the matplotlib and numpy packages are installed, the script plots and saves two graphs. The first graph plots all the estimations of the degree-dependent clustering coefficient distributions and the second graph plots the estimation of the node degree distribution.


Below is the output of an execution with a random walk sample of size 50K from Last.fm:
> ./Estimation_GraphSample.py graphsample_lastfm randomwalk_lastfm "|"
*** Loading graph sample graphsample_lastfm ***
 Graph sample contains 9135540 triangles
 Graph sample has 41130 nodes and 3909183 edges

*** Loading random walk traversal randomwalk_lastfm ***
 Random walk contains 50000 total node samples and 41130 unique node samples
 Degree distribution estimated from random walk. Average degree is 10.424

*** Starting estimation of degree-dependent clustering coefficient distribution ***
 [Induced Edges]   Network average clustering coefficient: 0.167639
 [Traversed Edges] Network average clustering coefficient: 0.172050
 [Hybrid]          Network average clustering coefficient: 0.170205

*** Storing estimation results in text files ***
 Induced Edges estimation dumped in file 'graphsample_lastfm_cck-ind.txt'
 Traversed Edges estimation dumped in file 'graphsample_lastfm_cck-edge.txt'
 Hybrid estimation dumped in file 'graphsample_lastfm_cck-hybrid.txt'
 Node degree distribution estimation dumped in file 'graphsample_lastfm_nodeDegreeDist.txt'

 Degree-dependent clustering coefficient graph available at 'graphsample_lastfm_cck graph.png'
 Node degree distribution graph available at 'graphsample_lastfm_nodeDegreeDist-graph.png'

estimation of the node degree distribution estimation of the degree-dependent clustering coefficient distribution


Download here. It consists of two classes (i) the "Generation" and (ii) the "Estimation" class. The following functions are exposed in each class.


KTRI is the non-normalized version of the degree-dependent clustering coefficient. KTRI contains the number of triangles connected to nodes of degree k.



The file examples.py provides examples that demonstrate how to combine the functions from the Estimation and Generation classes so as to go "from Sampling to Generation".

As an example of usage of these classes, below is included Python source code that

  1. loads the Facebook New Orleans graph
  2. simulates a random walk sample of size equal to 30% of the graph size, including repetitions
  3. estimates JDD and CCK
  4. generates a synthetic graph with the same target JDD and CCK using our paper's algorithm
    (2K with Triangles + Improved MCMC).
  5. saves the generated synthetic graph as a networkx Graph in a python pickle.

from Estimation import Estimation
from Generation import Generation	

p_sample = 0.30
fname = "Facebook-New-Orleans.edges.gz"

myest = Estimation()
myest.sample('rw', p_sample)
mygen = Generation()
mygen.set_JDD(  myest.get_JDD('realizable') )
mygen.set_KTRI( myest.get_KTRI('estimate')  ) 
mygen.save_graphs('%s_2KT+ImpMCMC_%.2fsample' % (fname, p_sample))
Here follows an instance of an execution
Loading graph
Edges:816884  Nodes:63392  AvgDegree:25.772

Simulation of sample
30.00% rw sample 

JDD Estimation  
CCK estimation  
Postprocessing to make JDD realizable

Summary of estimation results.
Realizable JDD - Edges:828588.00  Nodes:62909.00  AvgDegree:26.342
Estimated  CCK - #Triangles:21718829        Avg Clust Coeff: 0.229

Generation 2K
Estimated JDD set as target JDD

2K triangles construction started

1st pass  allpairs smart
residual stubs: 266360/266360

1st pass  kneighb
residual stubs: 1582/1582

2nd pass iteration 0
residual stubs: 16/16

2nd pass iteration 1
residual stubs: 0/0

Constructed Graph - Edges:828588.00  Nodes:62909.00  AvgDegree:26.342

Generation 2.5K
Improved double edge swaps
--2K      Graph	: #Triangles:40031742	C_avg:0.483
--Target  Graph	: #Triangles:21718829	C_avg:0.229

 ** Simulation completes when error metric NMAE reaches threshold 0.050 ** 
[Thu Sep 12 01:17:52 2013]
[4.5][216.59] #Swaps:973,  C_avg:0.4794, NMAE:0.7740, #Triangles:39763992
[3.9][249.88] #Swaps:1940, C_avg:0.4760, NMAE:0.7644, #Triangles:39500406
[3.8][24.18] #Swaps:295603, C_avg:0.2248, NMAE:0.0500, #Triangles:21229320
Total time:1830
Saving 2K   at "Facebook-New-Orleans.edges.gz_2KT+ImpMCMC_0.30sample_2K.edges.gz"
Saving 2.5K at "Facebook-New-Orleans.edges.gz_2KT+ImpMCMC_0.30sample_2.5K.edges.gz"
 ** 2.5K Graph	: #Triangles:21229320	C_avg:0.225 **

© 2013 Sept.