Whats is VirClust?

VirClust is a bioinformatics tool which can be used for:

• virus clustering

• protein annotation

• core protein calculation

At its core is the grouping of viral proteins into clusters of three different levels:

• at the first level, proteins are grouped based on their reciprocal BLASTP similarities into protein clusters, or PCs.

• at the second level, PCs are grouped based on their Hidden Markov Model (HMM) similarities into protein superclusters, or PSCs.

• at the third, still experimental level, PSCs are grouped based on their HMM similarities into protein super-superclusters, or PSSC.

More about the how it works can be read here DOI: 10.1101/2021.06.14.448304.

How to use VirClust?

You can run VirClust as a web-service, by going to 'VirClust WEB' tab. Alternatively, you can download and install VirClust on your own servers, either as a singularity container or directly the source-code, in its own Conda environment. The singularity container can be downloaded from the 'Downloads' tab. The source code and the corresponding Conda environment can be downloaded from the VirClust github repository.

How to cite the use of VirClust?

If you are using VirClust, please cite the following publication: • Moraru, C. (2023) VirClus - A Tool for Hierarchical Clustering, Core Protein Detection and Annotation of (Prokaryotic) Viruses, Viruses 15(4), pp 1007 DOI: 10.3390/v15041007.

Additionally, if you are performing viral protein annotations using VirClust, please also cite the respective databases used for the annotations, see VirClust manuscript for the complete citations.

Developer - Cristina Moraru, PhD

I've developed VirClust during my possition as Senior Scientist in the Department of The Biology of Geological Processes, at the Institute for Chemistry and Biology of the Marine Environment, Germany. The VirClust web-site and web-service are hosted at this institution.

My current possition is that of Senior Scientist in the Environmental Metagenomics Department, at the Research Center One Health Ruhr of the University Alliance Ruhr, Essen, Germany.

Please post any VirClust related problems or feature requests here in the Issues section of the correponding github repository.

CREATE NEW PROJECT

Minimum number of genomes:

• 3 for genome clustering (steps 4-7)

• 1 for only protein clusters and annotations

Accepted input formats: .fasta, .fna or .fa

Sequence names should contain at least one letter.

** no multifasta files here

* Mandatory for new projects

LOAD EXISTING PROJECT

Info board

Each project you create is given a project ID and can be accessed at a later time point, as long as you have performed any calculations (basically, pressed the “Run” button in the next tab). VirClust calculations can take a long time and the browser can disconnect from the server. Save the project ID, to be able to access the results later.

PROTEIN CLUSTERING

Step 1A. Genomes to Proteins

Download results

Step 2A. Proteins to Protein Clusters (PCs)

Remove matches if

Download results

GENOME CLUSTERING

Step 3A. Order genomes hierarchically

Download results

Plot intergenomic similarities

Download results

Step 4A. Calculate stats and split in genome clusters (VGCs)

Download results

*The clustering distance is minimum 0.1 and maximum 1. The higher the value, the lower the number of clusters resulted. At a value of 1, all genomes will belong to the same cluster.

*Known issues: If the chosen clustering distance results in each genome forming its own VGC, then the output PDF will be empty. To solve this problem: increase the clustering distance progressively and recalculate steps 5 and 6.

Output genome clustering PDF

Other options

Download results

CORE PROTEINS

Step 5A. Calculate core proteins for each VGC, based on PCs

Download results

PROTEIN ANNOTATIONS

Step 6A. Annotate proteins

Query the InterPro database using InterProScan
Query the pVOGs database using hhsearch
Query the VOGDB database using hhsearch
Query the PHROG database using hhsearch
Query the Efam database using hmmscan
Query the Efam-XC database using hmmscan
Query the NCBI database using BLASTP

Merge annotation tables

PROTEIN CLUSTERING

Step 1B. PCs to Protein Superclusters (PSCs)

Keep matches if ...
conditional 1 is true

AND

OR
conditional 2 is true:

AND

AND

Download results

GENOME CLUSTERING

Step 2B. Order genomes hierarchically

Download results

Plot intergenomic similarities

Download results

Step 3B. Calculate stats and split in genome clusters (VGCs)

Download results

*The clustering distance is minimum 0.1 and maximum 1. The higher the value, the lower the number of clusters resulted. At a value of 1, all genomes will belong to the same cluster.

*Known issues: If the chosen clustering distance results in each genome forming its own VGC, then the output PDF will be empty. To solve this problem: increase the clustering distance progressively and recalculate steps 5 and 6.

Output genome clustering PDF

Other options

Download results

CORE PROTEINS

Step 4B. Calculate core proteins for each VGC, based on PSCs
Download results

PROTEIN ANNOTATIONS

Step 5B. Annotate proteins

Query the InterPro database using InterProScan
Query the pVOGs database using hhsearch
Query the VOGDB database using hhsearch
Query the PHROG database using hhsearch
Query the Efam database using hmmscan
Query the Efam-XC database using hmmscan
Query the NCBI database using BLASTP

Merge annotation tables

PROTEIN CLUSTERING

Step 1C. PSCs to Protein Super-superclusters (PSSCs)

Keep matches if ...
conditional 1 is true

AND

OR
conditional 2 is true:

AND

AND

Download results

GENOME CLUSTERING

Step 2C. Order genomes hierarchically

Download results

Plot intergenomic similarities

Download results

Step 3C. Calculate stats and split in genome clusters (VGCs)

Download results

*The clustering distance is minimum 0.1 and maximum 1. The higher the value, the lower the number of clusters resulted. At a value of 1, all genomes will belong to the same cluster.

*Known issues: If the chosen clustering distance results in each genome forming its own VGC, then the output PDF will be empty. To solve this problem: increase the clustering distance progressively and recalculate steps 5 and 6.

Output clustering PDF

Other options

Download results

Step 4C. CORE PROTEINS

Calculate core proteins for each VGC, based on PSSCs
Download results

PROTEIN ANNOTATIONS

Step 5C. Annotate proteins

Query the InterPro database using InterProScan
Query the pVOGs database using hhsearch
Query the VOGDB database using hhsearch
Query the PHROG database using hhsearch
Query the Efam database using hmmscan
Query the Efam-XC database using hmmscan
Query the NCBI database using BLASTP

Merge annotation tables

Developer - Cristina Moraru, PhD

I've developed VirClust during my possition as Senior Scientist in the Department of The Biology of Geological Processes, at the Institute for Chemistry and Biology of the Marine Environment, Germany. The VirClust web-site and web-service are hosted at this institution.

My current possition is that of Senior Scientist in the Environmental Metagenomics Department, at the Research Center One Health Ruhr of the University Alliance Ruhr, Essen, Germany.

Please post any VirClust related problems or feature requests here in the Issues section of the correponding github repository.

VirClust v2 web-server

VirClust v2 stand-alone

Annotation databases for VirClust stand-alone

For the annotation of viral genomes, VirClust relies on several databases previously published. The InterProScan and the BLASTNR database should be installed by the user as described in the manual for the VirClust v2 stand-alone version. The other databases (Efam, Efam_XC, PHROG, pVOGs and VOGDB) need to be in a format specific for VirClust and they can be downloaded below. For each database used, please cite the original publications describing the respective databases.

Download database ...

Publication to cite ...

Zayed, A.A., Lücking, D., Mohssen, M., Cronin, D., Bolduc, B., Gregory, A.C., Hargreaves, K.R., Piehowski, P.D., White, R.A., Huang, E.L., Adkins, J.N., Roux, S., Moraru, C., and Sullivan, M.B. (2021) efam: an expanded, metaproteome-supported HMM profile database of viral protein families. Bioinformatics (Oxford, England), doi: 10.1093/bioinformatics/btab451.

Zayed, A.A., Lücking, D., Mohssen, M., Cronin, D., Bolduc, B., Gregory, A.C., Hargreaves, K.R., Piehowski, P.D., White, R.A., Huang, E.L., Adkins, J.N., Roux, S., Moraru, C., and Sullivan, M.B. (2021) efam: an expanded, metaproteome-supported HMM profile database of viral protein families. Bioinformatics (Oxford, England), doi: 10.1093/bioinformatics/btab451.

Terzian, P., Olo Ndela, E., Galiez, C., Lossouarn, J., Pérez Bucio, R.E., Mom, R., Toussaint, A., Petit, M.-A., and Enault, F. (2021) PHROG: families of prokaryotic virus proteins clustered using remote homology. NAR genomics and bioinformatics, doi: 10.1093/nargab/lqab067.

Kiening, M., Ochsenreiter, R., Hellinger, H.-J., Rattei, T., Hofacker, I., and Frishman, D. (2019) Conserved Secondary Structures in Viral mRNAs. Viruses, doi: 10.3390/v11050401.

Grazziotin, A.L., Koonin, E.V., and Kristensen, D.M. (2017) Prokaryotic virus orthologous groups (pVOGs). A resource for comparative genomics and protein family annotation. Nucleic acids research, doi: 10.1093/nar/gkw975.

Developer - Cristina Moraru, PhD

I've developed VirClust during my possition as Senior Scientist in the Department of The Biology of Geological Processes, at the Institute for Chemistry and Biology of the Marine Environment, Germany. The VirClust web-site and web-service are hosted at this institution.

My current possition is that of Senior Scientist in the Environmental Metagenomics Department, at the Research Center One Health Ruhr of the University Alliance Ruhr, Essen, Germany.

Please post any VirClust related problems or feature requests here in the Issues section of the correponding github repository.