JSON Input Documentation

The following document will go over the different parameters for each stage as well as rules and limitations for each stage. To view a valid pipeline with all parameters included, refer to pipeline_template.json.

Overall Parameters

The following is a general overview of the overall parameters that don't belong to a single stage but rather the entire pipeline:

    "title": "cit-new-pp-output",
    "name": "cit_patents",
    "input_file": "/data3/chackoge/networks/cit_patents_cleaned.tsv",
    "output_dir": "samples/",
    "algorithm": "leiden",
    "params": [
        {
            "res": 0.5,
            "i": 2
        },
        {  
            "res": 0.1,
            "i": 2
        },
        {
            "res": 0.01,
            "i": 2
        }
    ]
    "stages": ["..."]

All of the following parameter values are required:

NOTE Paths must be relative to the json file, or absolute.

Algorithmic Parameters

The params field contains dictionaries mapping parameter values to their values. These field names will vary based on the algorithm being used.

Leiden-CPM

As shown above, Leiden-CPM takes resolution and iterations parameters. Designated in the json as "res" and "i" fields.

{
    "res": 0.5,
    "i": 2
}

Leiden-Mod

Leiden-Mod doesn't need to use a resolution parameter since it optimizes modularity and not CPM. Therefore, only an iterations parameter needs to be passed.

{
    "i": 2
}

IKC

IKC only needs a k-core value passed as a parameter, designated as "k".

{
    "k": 10
}

Infomap

Infomap doesn't take ay parameters, so its dictionary is always empty. However, to fit the style constraints of the pipeline, any algorithm that doesnt take any parameters should use an empty dictionary as follows.

{}

Since you can't have multiple runs of the same parameter set, the overall "params" field in the json will look like this:

"params": [{}]

Your Own Clustering Method

Refer to the customization documentation for more details on how to create your own clustering. You will be able to assign parameter names for your own clustering using this pipeline.

Supposing you create a pipeline with two parameters "a" and "b" with integer values, you will be able to designate them in the pipeline file.

{
    "a": 1,
    "b": 2
}

Stages

Cleanup

This stage removes any self loops (i.e. edges $(u, u)$ ) and parallel edges (i.e. duplicate edges $(u, v)$ with more than one occurrence in the edge list). This stage does not take any extra parameters and has the following syntax. Add the following object in the stages array:

{
    "name": "cleanup"
}

Limitations: This stage cannot come after a stage that outputs a clustering (ex. filtering, connectivity_modifier).

Clustering

This stage uses the clustering algorithm specified in the overall parameters to cluster a cleaned network. If resolutions and/or iterations are arrays, multiple clusterings are outputted. To add this stage, add the following to the stages array. Modify the parameters as needed:

{
    "name": "clustering",
    "parallel_limit": 2
}

Optional Parameter:

Limitations: This stage cannot come after a stage that outputs a clustering.

Filtering

This stage takes a clustering and filters it according to a script, or series of scripts. To add this stage, add the following to the stages array. Modify the scripts as needed.:

{
    "name": "filtering",
    "scripts": [
        "./scripts/subset_graph_nonetworkit_treestar.R",
        "./scripts/make_cm_ready.R"
    ]
}

Required Paramters:

Limitations: This stage must come after a stage that outputs a clustering.

Connectivity Modifier

This is the stage that applies CM++ to a clustering to ensure connectivity requirements in clusters. To add the stage, simply add the following to the stages array. Change the parameters as needed. If the parameters are optional, they can be deleted from this template:

{
    "name": "connectivity_modifier",
    "memprof": true,
    "threshold": "1log10",
    "nprocs": 32,
    "quiet": true,
}

Required Parameter:

Optional Parameters:

Limitations: This must come after a stage that outputs a clustering.

Stats

This stage reports statistics of a clustering that was outputted by a stage preceding it. For more information on the statistics reporting ans its outputs. The code for the stage is the following:

{
    "name": "stats",
    "parallel_limit": 2,
    "universal_before": false,
    "summarize": false
}

Optional Parameters:

Limitations:

Using an Existing Clustering

{
    "title": "cit-new-pp-output-leiden-skipstage",
    "name": "cit_patents",
    "input_file": "/data3/chackoge/networks/cit_patents_cleaned.tsv",
    "output_dir": "samples/",
    "algorithm": "leiden",
    "params": [
        {
            "res": 0.5,
            "i": 2,
            "existing_clustering": "samples/cit-new-pp-output-leiden_mod-20230614-23:55:59/res-0.5-i2/S2_cit_patents_leiden.0.5_i2_clustering.tsv"
        },
        {
            "res": 0.1,
            "i": 2,
            "existing_clustering": "samples/cit-new-pp-output-leiden_mod-20230614-23:55:59/res-0.1-i2/S2_cit_patents_leiden.0.1_i2_clustering.tsv"
        }
    ],
    "stages": ["..."]
}

To use an existing clustering, add a value "existing_clustering" per parameter entry in your json header. This is applicable for any clustering method.

Examples

View the following folder to check out examples: examples/