CoCoNat - Bologna Biocomputing Group

CoCoNat: coiled-coil prediction

CoCoNat consists of a three-step procedure combining a deep learning architecture, a conditional random field, and single-layer neural network in a cascading way. The first two steps detects coiled-coil segment boundaries and annotate the residue-level annotation heptad repeat registers within each predicted segment. The third step is dedicated to the prediction of segment oligomerization state.

Datasets

We trained CoCoNat on a dataset comprising 2191 proteins containing CCDs and 9040 proteins not endowed with CCDs (negative examples). We tested on a blind test set comprising 429 CCDs and 278 non-CCD proteins. Both datasets are derived from literature. Both datasets are available in the Datasets section of this site.

Input encoding

CoCoNat makes use of residue representations from large-scale protein Language Models (pLMs). We adopted two state-of-the-art pLMs: ProtT5 and ESM2, generating residue embeddings of 1024 and 1280 features, respectively. ProtT5 and ESM2 embeddings are concatenated together, leading to vectors of dimension 2304 for representing each residue in the sequence.

Detection of coiled-coil segments and registers

The first step is based on a convolutional layer (CNN) followed by a Long Short-Term Memory (LSTM) layer. The CNN takes as input the 2304 embedding vector and outputs a 40-features vector, applying a kernel of size 15. We then used an LSTM layer with 128 cells, followed by a dense layer of size 64 and a fully connected output layer of size 8 (one for each possible register in coiled-coil plus one for non-coil residues) with a sigmoid activation function. Then, the final outputs of this step are per-residue probabilities relative to these 8 classes.
The second step takes in input the probabilities computed from step 1, and it is based on Grammatical-Restrained Hidden Conditional Random Fields (GRHCRFs). The output of the step 2 is computed by a Posterior-Viterbi decoder and it is a residue level labelling of each residue into 8 possible classes: a-g for registers or i residues outside coiled-coil regions.

Prediction of coiled-coil oligomerization state

The prediction of the oligomerization state is addressed using a simple feed-forward network with a single hidden layer comprising 128 neurons and four output units corresponding to the four possible oligomerization states: parallel dimer, antiparallel dimer, trimer, and tetramer.

CoCoNat performance

CoCoNat performance on the detection of coiled-coils segments, register and oligomeric state prediction are reported in Tables 1, 2 and 3, respectively. Performance scores are computed on a blind test set comprising 718 proteins (see Datasets section of this site). All results are published and extracted from the CoCoNat reference paper.

Table 1. Performance of CoCoNat on the detection of coiled-coils segments

PRE_R	REC_R	F1_R	PRAUC	PRE_S	REC_S	F1_S	SOV_O	SOV_P
0.55	0.53	0.54	0.46	0.57	0.43	0.49	54.35	66.93

PRE_R, REC_R and F1_R: Precision, Recall and F1 score, respectively, computed at the residue level; PRE_S, REC_S and F1_S: Precision, Recall and F1 score, respectively, computed at the coiled-coil segment level; PRAUC: precision-recall area under the curve; SOV_O, SOV_P: segment overlap (SOV) measures, one taking as reference observed residues (SOV_O) and one taking as reference predicted residues (SOV_P)

Table 2. Performance of CoCoNat on the coiled-coil register prediction

MCC(a)	MCC(b)	MCC(c)	MCC(d)	MCC(e)	MCC(f)	MCC(g)
0.84	0.84	0.84	0.84	0.83	0.83	0.83

MCC(x): Matthews Correlation Coefficient computed for the coiled-coil register type x in the set [a,b,c,d,e,f,g]

Table 3. Performance of CoCoNat on the prediction of coiled-coil oligomeric states

MCC(parallel dimer)	MCC(antiparallel dimer)	MCC(trimer)	MCC(tetramer)
0.66	0.70	0.50	0.46

MCCs are computed independently for each possible oligomeric state: parallel/antiparallel dimers, trimers and tetramers.