VAMB

Metagenomic Binning

VAMB (Variational Autoencoders for Metagenomic Binning) uses deep variational autoencoders to learn a latent representation of contigs from their tetranucleotide frequencies and co-abundance profiles across multiple samples, enabling accurate metagenomic binning. [1]

How to Obtain Output Model File

Below is a brief workflow the team ran to obtain the output model examples we present on the tools page.

Input

Co-assembled contigs (FASTA) + per-sample BAM alignment files

Output

TSV with per-cluster metrics: cluster name, radius, peak/valley ratio, kind (normal/loner/fallback), total bp, contig count, medoid contig

conda install -c bioconda vamb

Docker image: quay.io/biocontainers/vamb:4.1.3--pyhdfd78af_0

Sample 1 gut microbiome (3 samples from PRJNA795985, 1M read pairs each) co-assembled with Assembly and binned with VAMB v5.0.4, producing 7,212 clusters.

  1. 1

    Download 3 samples from PRJNA795985

    for SRR in SRR17531757 SRR17531762 SRR17531772; do fastq-dump --split-files --gzip -X 1000000 --outdir reads/ $SRR; done

    Downloads 1M read pairs per sample (~70 MB compressed each) from the "Diet and Antimicrobial Resistance in Healthy US Adults" study (Shrestha et al., mBio 2022).

  2. 2

    Co-assemble all samples with MEGAHIT

    megahit -1 reads/SRR17531757_1.fastq.gz,reads/SRR17531762_1.fastq.gz,reads/SRR17531772_1.fastq.gz -2 reads/SRR17531757_2.fastq.gz,reads/SRR17531762_2.fastq.gz,reads/SRR17531772_2.fastq.gz -o assembly --min-contig-len 1500 -t 8 --presets meta-sensitive

    Produces 14,934 contigs (≥1,500 bp).

  3. 3

    Map each sample to contigs

    for SRR in SRR17531757 SRR17531762 SRR17531772; do minimap2 -ax sr -t 8 assembly/final.contigs.fa reads/${SRR}_1.fastq.gz reads/${SRR}_2.fastq.gz | samtools sort -@ 4 -o bams/${SRR}.sorted.bam && samtools index bams/${SRR}.sorted.bam; done
  4. 4

    Run VAMB v5.0.4

    vamb bin default --outdir vamb_out --fasta assembly/final.contigs.fa --bamdir bams/ -m 1500 --minfasta 200000 -p 8

    Produces vae_clusters_metadata.tsv with 7,212 clusters (715 normal, 6,472 loner, 25 fallback).

Upload vae_clusters_metadata.tsv to IntMeta

Materials Used

Charts Reference

Detailed descriptions for all 11 visualizations generated by VAMB in IntMeta.

cluster-kind-distribution

cluster-kind-distribution

Pie chart of cluster kinds from the 'kind' column: normal (separated by density peaks in latent space), loner (isolated single-contig clusters), and fallback (assigned the default clustering radius). A healthy assembly produces mostly normal clusters; many fallback clusters suggest poor separation.

genome-size-distribution

genome-size-distribution

Bar chart of total bp per cluster, sorted by size and colored by kind. VAMB recommends filtering clusters below 250 Kbp and discarding fallback clusters for downstream analysis.

radius-vs-pvr

radius-vs-pvr

Scatter plot of clustering radius vs peak-to-valley ratio (PVR, from the 'peak valley ratio' column), colored by kind. Normal clusters typically have higher PVR (clearer density separation) and tighter radius. Low PVR with large radius suggests poorly resolved clusters.

genome-size-vs-contigs

genome-size-vs-contigs

Scatter plot of total bp (genome size) vs ncontigs per cluster. Well-resolved bins cluster at moderate contig counts with substantial genome sizes. Points in the upper-left (many contigs, small genome) may indicate chimeric clusters.

contigs-per-cluster

contigs-per-cluster

Bar chart of ncontigs per cluster, colored by kind. Very high contig counts may indicate over-fragmented or chimeric clusters that merged unrelated contigs.

avg-contig-length

avg-contig-length

Bar chart of average contig length (total bp / ncontigs) per cluster. Higher values indicate better-assembled genomes with longer individual contigs. Low averages suggest highly fragmented assemblies.

metrics-by-kind

metrics-by-kind

Grouped box plot comparing distributions of genome size (bp), contig count, and radius across the three cluster kinds (normal, loner, fallback). Reveals systematic differences — e.g., loner clusters tend to be smaller with single contigs.

cluster-metrics-heatmap

cluster-metrics-heatmap

Heatmap of min-max normalized metrics (bp, ncontigs, radius, peak valley ratio) across the top clusters. Each metric is scaled 0–1 within its column; darker cells = higher relative values. Useful for spotting outlier clusters.

comp-quality-tiers

comp-quality-tiers

Grouped bar chart comparing the distribution of heuristic quality tiers (High / Medium / Low based on genome size and N50 thresholds) across samples. Reveals which sample produced more high-quality VAMB clusters.

comp-genome-size

comp-genome-size

Box plot or grouped bar chart comparing the genome size (total bp) distribution of VAMB clusters across samples. Differences may reflect varying community complexity or co-assembly depth.

comp-kind-distribution

comp-kind-distribution

Stacked or grouped bar chart comparing the proportion of normal, loner, and fallback clusters across samples. A higher fraction of normal clusters indicates better latent-space separation and more reliable binning.

References

[1]Nissen, J.N., Johansen, J., Allesøe, R.L. et al. Improved metagenome binning and assembly using deep variational autoencoders. Nat Biotechnol 39, 555–560 (2021).
IntMeta — Interactive Metagenomics Visualizations