|File Name ↓||File Size ↓||Date ↓|
Steinegger M. and Söding J., Clustering huge protein sequence sets in linear timebiorxiv, doi: 10.1101/104034 2018.
Metaclust was created by clustering and assembleing 1.59 billion protein sequence fragments predicted by Prodigal in ~2200 metagenomic and metatranscriptomic datasets.
We offer two fasta files, (1) a non-redundant (nr) where we eliminated subfragments which could be aligned to a longer sequence with 99% of their residues and a sequence identity of 95% and (2) a version clustered to 50% sequence identity at 90% converage.
We clustered the data using Linclust. Linclust has been integrated into our free GPLv3-licenced MMseqs2 software suite. The source code and binaries for Linclust can be download on Github.
Each file contains the representative sequences of every cluster in FASTA format. Each FASTA entry has an unique numeric identifier.
All files are available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.