ArchaeaHQ
A curated collection of high-quality archaeal genomes linked with environmental metadata — built for ecology-driven comparative genomics.
21,644
HQ Genomes
35,993
NCBI Source Genomes
14,363
Georeferenced Genomes
≥90%
Min. Completeness
From NCBI to ArchaeaHQ
ArchaeaHQ was built by applying a rigorous quality-filtering pipeline to all archaeal genomes publicly available in NCBI. Starting from a pool of 35,993 archaeal sequences, each genome was evaluated for completeness and contamination using CheckM2.
Only genomes meeting strict quality thresholds — ≥90% completeness and ≤5% contamination — were retained, resulting in a final collection of 21,644 high-quality assemblies spanning the full breadth of archaeal diversity.
Fig. 1 — Sankey diagram illustrating the sequential filtering steps from raw NCBI archaeal genomes (35,993) to the final ArchaeaHQ collection (21,644 high-quality genomes). Colour bands represent quality tiers applied at each filtering stage.
Genome Quality Metrics
Every genome in ArchaeaHQ was assessed using CheckM2, which estimates completeness and contamination based on lineage-specific marker gene sets. The scatter plot confirms the high-quality nature of the final collection.
The vast majority of genomes cluster in the high-completeness, low-contamination region, reflecting the stringent thresholds applied during curation. This ensures that downstream comparative analyses are built on a reliable genomic foundation.
Fig. 2 — Scatter plot of genome completeness (x-axis) versus contamination (y-axis) for all ArchaeaHQ entries. Each dot represents one genome. Colour gradient reflects the contamination level, with darker tones indicating lower contamination.
Global Distribution
ArchaeaHQ genomes originate from sampling sites spanning all continents and major ocean basins. Environmental metadata — including GPS coordinates, habitat type, and sampling depth — was manually curated and linked to each genome.
This geographic breadth enables large-scale biogeographic analyses, allowing researchers to investigate how archaeal diversity and community composition vary across latitudinal gradients, climate zones, and ecosystem types.
Fig. 3 — Geographic distribution of the 14,363 georeferenced genomes in ArchaeaHQ. Each dot marks a sampling site. Dot colour reflects archaeal phylum affiliation; dot size is proportional to the number of genomes recovered from that location.
Habitat Coverage
ArchaeaHQ spans a broad spectrum of environments — from marine water columns and sediments to terrestrial soils, hot springs, hydrothermal vents, and host-associated niches. Asgard archaea and DPANN superphylum members are both well represented.
This environmental diversity makes ArchaeaHQ particularly suited for ecology-driven comparative genomics, enabling researchers to link genomic traits with specific habitat characteristics and uncover the ecological drivers of archaeal evolution.
Fig. 4 — Environmental distribution of ArchaeaHQ genomes shown as a parliament-style chart. Each segment represents a major habitat category. Colour coding distinguishes broad environmental groups including marine, terrestrial, engineered, and host-associated biomes.
Use ArchaeaHQ in Your Research
ArchaeaHQ is freely available. Reach out to discuss collaborations, data access, or training in archaeal comparative genomics.
Get in Touch