|File Name ↓||File Size ↓||Date ↓|
Steinegger M. and Söding J., Clustering huge protein sequence sets in linear timebiorxiv, doi: 10.1101/104034 2018.
Metaclust was created by clustering and assembleing 1.59 billion protein sequence fragments predicted by Prodigal in ~2200 metagenomic and metatranscriptomic datasets.
We offer two fasta files, (1) a non-redundant (nr) where we eliminated subfragments which could be aligned to a longer sequence with 99% of their residues and a sequence identity of 95% and (2) a version clustered to 50% sequence identity at 90% converage.
We clustered the data using Linclust. Linclust has been integrated into our free GPLv3-licenced MMseqs2 software suite. The source code and binaries for Linclust can be download on Github.
Each file contains the representative sequences of every cluster in FASTA format. The file metaclust_50_cluster.fasta contains all sequence per cluster in a fasta like format.
The fasta header contains an id and the prodigal output. The id are from different sources, either Uniprot, JGI, NCBI or OM-RGC. The JGI data can be accessed by using the first part of the fasta identifier at the organism field of the following url. https://genome.jgi.doe.gov/portal/pages/dynamicOrganismDownload.jsf?organism=RifCSPlowO2_12
>RifCSPlowO2_12_1023861.scaffolds.fasta_scaffold367679_1 # 24 # 428 # -1 # ID=367679_1;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.435
All files are available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.