Description
A database of orthologous groups and functional annotation. The pipeline starts with mapping all target sequences in the database against the Pfam database, creating groups of protein sequences that consistently align with at least one known domain. These groups of sequences that share a Pfam domain are referred to as 'protein families' from this point onwards. Multidomain proteins will therefore appear in as many families as the number of Pfam domains they contain, and all sequences within a protein family will contain at least one shared domain. This strategy prevents non-alignable proteins from being grouped together in the same cluster, as we previously observed occurring with protein families containing promiscuous domains. Furthermore, it provides a domain-centric view of the orthology relationships within each set, enabling multidomain proteins to be recruited into different families. All protein sequences without a detectable Pfam domain are clustered de novo based on MMseqs, producing a large pool of putative protein families. Next, to identify OGs within each protein family cluster, a multiple sequence alignment and a phylogenetic tree are inferred for each cluster. We aligned sequences using MAFFT (<1000 sequences) or FAMSA (larger protein sets), removed uninformative alignment columns using an ad hoc script, inferred phylogenetic trees using FastTree2, and rooted using the Minimum variance method from the FastRoot package. Subsequently, the Orthologs Group Delineation (OGD) algorithm, an in-house script, is programmatically applied to scan each gene tree. The OGD algorithm is designed to identify OGs and pairwise orthologs directly from the phylogenetic tree. It infers OGs by detecting duplication events using a modified version of the species overlap algorithm. This process defines subfamilies that represent OGs across various taxonomic levels, all while maintaining the tree's inherent hierarchical structure and accommodating potential phylogenetic inconsistencies and artifacts.