Quick Start Guide

Running CM++

Suppose you have a network and a clustering of the network that you want to refine. For this tutorial, I will use the following network and clustering.

You know that these clusters aren't well-connected, and you want to use CM++ to enforce better connectivity.

Visualizing the Data

First, you can visualize the graph and clustering. This can be done via the following Python script:

import matplotlib.pyplot as plt
import networkx as nx
import pandas as pd

# Load the clustering data
clustering_file = 'clustering.tsv'
clustering_data = pd.read_csv(clustering_file, sep='\t', header=None, names=['node_id', 'cluster_id'])

# Load the edge list data
edge_list_file = 'network.tsv'
G = nx.read_edgelist(edge_list_file, delimiter='\t', nodetype=int)

# Assign cluster IDs to nodes in the graph
node_cluster_mapping = dict(clustering_data.values)
nx.set_node_attributes(G, node_cluster_mapping, 'cluster')

# Generate a color map for clusters
unique_clusters = set(node_cluster_mapping.values())
color_map = plt.cm.get_cmap('viridis', len(unique_clusters))
colors = {cluster: color_map(i) for i, cluster in enumerate(unique_clusters)}

# Assign color to each node based on cluster
node_colors = [colors[node_cluster_mapping[node]] for node in G.nodes()]

# Plot the graph
plt.figure(figsize=(12, 8))
nx.draw(G, node_color=node_colors, with_labels=False, node_size=50, edge_color='gray')
plt.savefig('patitions.png')
plt.show()

Running this script results in the following:

network

Installing and Running CM++

CM++ can form these into well-connected clusters. Firstly, CM++ can be quickly installed with the following command:

pip install git+https://github.com/illinois-or-research-analytics/cm_pipeline

If you want all clusters to have a connectivity above log10(n) (all clusters have a mincut value above log10 of the number of nodes in the cluster), you can run the command below:

python3 -m hm01.cm -i network.tsv -e clustering.tsv -o out.tsv -c leiden -g 0.5 -t 1log10

Visualize Post-CM Clustering

CM++ pruned some nodes from this clustering to form well-connected clusters. You can visualize the clustering through the following script:

import matplotlib.pyplot as plt
import networkx as nx
import pandas as pd

# Load the clustering data
clustering_file = 'out.tsv'
clustering_data = pd.read_csv(clustering_file, sep='\t', header=None, names=['node_id', 'cluster_id'])

# Load the edge list data
edge_list_file = 'network.tsv'
G = nx.read_edgelist(edge_list_file, delimiter='\t', nodetype=int)

# Assign cluster IDs to nodes in the graph
node_cluster_mapping = dict(clustering_data.values)
nx.set_node_attributes(G, node_cluster_mapping, 'cluster')

# Generate a color map for clusters
unique_clusters = set(node_cluster_mapping.values())
color_map = plt.cm.get_cmap('viridis', len(unique_clusters))
colors = {cluster: color_map(i) for i, cluster in enumerate(unique_clusters)}
colors['pruned'] = (0.7, 0, 0)

# Assign color to each node based on cluster
node_colors = [(0.7, 0, 0) if node not in node_cluster_mapping else colors[node_cluster_mapping[node]] for node in G.nodes()]

# Plot the graph
plt.figure(figsize=(12, 8))
nx.draw(G, node_color=node_colors, with_labels=False, node_size=50, edge_color='gray')

# Create a legend
plt.scatter([], [], c=colors['0\u03B4'], label='Cluster 0\u03B4')
plt.scatter([], [], c=colors['1\u03B4'], label='Cluster 1\u03B4')
plt.scatter([], [], c=colors['pruned'], label='Pruned nodes')

plt.legend()
plt.savefig('patitions2.png')
plt.show()

Resulting in the following visualization:

postcm

The red nodes are nodes that have been pruned off by CM++ to make clusters well-connected.

Running CM Pipeline

Suppose now, you do not have a clustering, and you want to compute statistics on these clusters. Then, you must run the CM Pipeline.

Installing the CM Pipeline

A pip install alone is not sufficient for the entire pipeline. The pipeline can be installed via Git cloning:

Clone the cm_pipeline repository
Activate the venv which has the necessary packages
Run pip install -r requirements.txt && pip install .
Make sure everything installed properly by running cd tests && pytest

Downloading the Data

For this tutorial, I will use the following network

network.tsv

Constructing the Pipeline

In order to make a pipeline with the following stages,

Clustering
Compute Statistics
CM++
Compute Statistics

in the root of the repository, create a pipeline.json file like the following:

{
    "title": "example-leiden",
    "name": "example",
    "input_file": "/path/to/network.tsv",
    "output_dir": "samples/",
    "algorithm": "leiden",
    "params": [{
        "res": 0.01,
        "i": 2
    }],
    "stages": [
        {
            "name": "clustering"
        },
        {
            "name": "stats"
        },
        {
            "name": "connectivity_modifier",
            "threshold": "1log10"
        },
        {
            "name": "stats"
        }
    ]
}

Replace "input_file" with the path to the network.tsv file. This path can be relative to the pipeline.json file or absolute. The "title" and "name" fields exist for output nomenclature and can be arbitrarily renamed.

Running the Pipeline

Once a pipeline.json file is ready, from the root of the repository, simply run:

python3 -m main pipeline.json

Results

When running the pipeline, you should get the outputs in a directory called samples/(name)-leiden-(timestamp)/ where name is replaced with the name of the run and the timestamp is the time at which the CM Pipeline was run.

Before CM++

You should result in statistics like the following:

cluster	n	m	modularity	cpm_score	connectivity	connectivity_normalized_log10(n)	connectivity_normalized_log2(n)	connectivity_normalized_sqrt(n)/5	conductance
0	150	555	0.2492196158940795	443.25	2.0	0.9190791018883572	0.2766703780563079	0.8164965809277261	0.0009000900090009
1	150	598	0.24921961589407954	486.25	2.0	0.9190791018883572	0.2766703780563079	0.8164965809277261	0.0009000900090009
Overall	300	1154	0.4984392317881591	929.5

Notice how the connectivity normalized by log10(n) is less than 1, indicating that the clusters have mincuts less than the log of the number of nodes. Therefore, they aren't well connected.

After CM++

cluster	n	m	modularity	cpm_score	connectivity	connectivity_normalized_log10(n)	connectivity_normalized_log2(n)	connectivity_normalized_sqrt(n)/5	conductance
1δ	149	596	0.2483846030234675	485.74	3.0	1.380461511102679	0.4155603227015324	1.2288478807785608	0.0026954177897574125
0δ	149	553	0.24832002468994893	442.74	3.0	1.380461511102679	0.4155603227015324	1.2288478807785608	0.002705139765554554
Overall	300	1154	0.4967046277134164	928.48

As you can see, nodes have been pruned, and the mincut values have increased. The normalized mincut value to log10(n) is now above 1, meaning it fits the connectivity threshold.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search