Target Categories in CASP8

Some targets are easy to predict, as they have very close templates among known structures, other targets are quite challenging. It is essential to evaluate predictions taking into account target difficulty, since performance of different algorithms depends on it. Grouping targets into categories of approximately the same prediction difficulty brings out the flavors of how each method deals with different target types.

In the early days of CASP, targets were classified in three general categories: comparative modeling, fold recognition and ab initio prediction, to reflect the method that was used to obtain models. It became clear with time that the best approach is to use a combination of various methods as what matters is the quality of the final prediction. Therefore, it is logical to group targets into categories by the prediction quality.

It appears that a general approach described here leads to a well-defined boundaries between categories coming out naturally from the data. The approach is rooted in the suggestion by the Baker group to use prediction scores of the top 10 models (see ROBETTA evaluation pages) and is similar to what we used for target classification in CASP5 assessment 1). We resorted to a traditional model quality metric that stood the test of time – LGA GDT-TS scores. Targets for which domain-based evaluation is essential, as established by our analysis, were split into domains, and other targets remained as whole chains and were considered as single domain targets for evaluation purpose. This procedure resulted in 147 "domains" gathered from 125 targets. For each of these domains, top 10 GDT-TS scores for the first server models were averaged and used as a measure of each target's difficulty.

We looked for naturally emerging clusters in these average GDT-TS scores and used density-based algorithm. Gaussian kernel density estimator is a function ρ(x) = ∑_in e^{-(x − μ_i)²/(2 σ²)}/(√2π σ n), where n is the number of domains, μ_i is average GDT-TS score for a domain i, and σ is a standard deviation, called bandwidth. Conceptually, each domain score generates a Gaussian centered at that score and with standard deviation σ. Averaging these Gaussians gives a density function ρ(x) that reveals score groups. Maxima of this function correspond to the group centers, and minima mark the boundaries between group. When the bandwidth is very narrow (= variance very small), each domain forms its own group. When the bandwidth is broad (= variance very large), all domains are in one group. Some optimal bandwidth setting should reveal meaningful groups in the data.

We plotted estimated densities for varying bandwidths, from 0.3 to 8.2 GDT-TS % units. Apparently, lower end bandwidth is too small and results in too many clusters (magenta curves on a plot below). Higher bandwidths around 8% (cyan curves) reveal two major clusters: a large cluster centered at about 73% GDT-TS and a smaller one around 41%. These major clusters can be used for evaluation as they demonstrate that the dataset splits naturally into two groups: "hard" and "easy", with 52% GDT-TS defining a boundary between the groups: surprisingly non-surprising cutoff. Bandwidth of 4% (black curve) yields three groups: hard, medium and easy, with the cutoffs 52% and 81%, i.e. this "medium" group splits from former "easy" group of a two-cluster breakdown. Finally, 2% bandwidth (yellow-framed brown curve) reveals 5 groups, and this is about the right number for evaluation of predictions. GDT-TS bounds between these groups are 30%, 52%, 67% and 81%. We term these categories: FM (free modeling, as predictors are free to do anything they can, they still fail to predict these targets right), FR (fold recognition, to give a tribute to historic category), and CM_H, CM_M, CM_E (comparative modeling – hard, medium and easy). We use these categories to evaluate predictions.

kernel density estimate for domain GDT_TS
Domains: Gaussian kernel density estimation of domain GDT-TS scores for the first model GDT-TS averaged over top 10 servers, plotted at various bandwidths (= standard deviations). These average GDT-TS scores for domains are shown as a spectrum along the horizontal axis: each bar represents a domain. The bars are colored according to the category suggested by this analysis: black - FM; red - FR; green - CM_H; cyan - CM_M; blue - CM_E. The family of curves with varying bandwidth is shown. Bandwidth varies from 0.3 to 8.2 GDT-TS % units with a step of 0.1, which corresponds to the color ramp from magenta through blue to cyan. Thicker curves: red, yellow-framed brown and black, correspond to bandwidths 1, 2 and 4 respectively.

Bounds found by the kernel density analysis partition target domains into the following categories.

**Target Category for Domains**
FM Free modeling	FR Fold recognition	CM_H Comparative modeling: hard	CM_M Comparative modeling: medium	CM_E Comparative modeling: easy
T0397_1, T0405_2, T0407_2, T0465, T0496_1	T0395, T0399, T0405_1, T0413, T0416_2, T0419_1, T0419_2, T0421, T0429_2, T0430, T0443_1, T0443_2, T0457_2, T0460, T0466, T0467, T0468, T0476, T0478_1, T0478_2, T0482, T0487_2, T0487_4, T0489, T0495, T0504_2, T0510_1, T0510_3, T0513, T0514	T0391, T0393, T0394, T0401, T0407_1, T0409_1, T0414, T0417, T0420, T0427, T0429_1, T0434, T0436, T0446, T0449, T0454, T0457_1, T0464, T0471, T0485, T0487_1, T0487_3, T0487_5, T0498, T0501_1, T0501_2, T0506, T0507, T0510_2, T0512	T0389, T0392, T0397_2, T0402, T0406, T0408, T0411, T0412, T0415, T0422, T0424, T0425, T0431, T0433, T0435, T0437, T0440, T0441, T0445, T0448, T0451, T0456, T0459, T0462_1, T0462_2, T0463, T0469, T0473, T0475, T0477, T0480, T0481, T0483, T0490, T0492, T0493, T0494, T0496_2, T0497, T0502, T0503, T0504_1, T0505, T0509, T0511	T0387, T0388, T0390, T0396, T0398, T0400, T0404, T0410, T0416_1, T0418, T0423, T0426, T0428, T0432, T0438, T0442, T0444, T0447, T0450, T0452, T0453, T0455, T0458, T0461, T0470, T0472_1, T0472_2, T0474, T0479, T0484, T0486_1, T0486_2, T0488, T0491, T0499, T0504_3, T0508

This analysis repeated for the whole chain predictions revealed identical trends, and the same cutoffs of 30%, 52%, 67%, 81% can be used to determine the target category from the top 10 averaged first model GDT_TS scores:

kernel density estimate for whole chain GDT_TS
Whole chains: Gaussian kernel density estimation of whole chain GDT-TS scores for the first model GDT-TS averaged over top 10 servers, plotted at various bandwidths (= standard deviations). These average GDT-TS scores for whole chains are shown as a spectrum along the horizontal axis: each bar represents a target. The bars are colored according to the category suggested by this analysis: black - FM; red - FR; green - CM_H; cyan - CM_M; blue - CM_E. The family of curves with varying bandwidth is shown. Bandwidth varies from 0.3 to 8.2 GDT-TS % units with a step of 0.1, which corresponds to the color ramp from magenta through blue to cyan. Thicker curves: red, yellow-framed brown and black, correspond to bandwidths 1, 2.5 and 4 respectively.

**Target Category for Whole Chains**
FM Free modeling	FR Fold recognition	CM_H Comparative modeling: hard	CM_M Comparative modeling: medium	CM_E Comparative modeling: easy
T0405, T0419, T0443, T0465, T0478, T0496, T0504	T0395, T0397, T0399, T0407, T0409, T0413, T0421, T0429, T0430, T0457, T0460, T0462, T0466, T0467, T0468, T0476, T0482, T0487, T0489, T0495, T0501, T0510, T0513, T0514	T0391, T0393, T0394, T0401, T0414, T0416, T0417, T0420, T0427, T0434, T0436, T0446, T0449, T0454, T0464, T0471, T0472, T0485, T0498, T0506, T0507, T0512	T0389, T0392, T0402, T0406, T0408, T0411, T0412, T0415, T0422, T0424, T0425, T0431, T0433, T0435, T0437, T0440, T0441, T0445, T0448, T0451, T0456, T0459, T0463, T0469, T0473, T0475, T0477, T0480, T0481, T0483, T0490, T0492, T0493, T0494, T0497, T0502, T0503, T0505, T0509, T0511	T0387, T0388, T0390, T0396, T0398, T0400, T0404, T0410, T0418, T0423, T0426, T0428, T0432, T0438, T0442, T0444, T0447, T0450, T0452, T0453, T0455, T0458, T0461, T0470, T0474, T0479, T0484, T0486, T0488, T0491, T0499, T0508

¹⁾ L.N.Kinch, J.O.Wrabl, S.S.Krishna, I.Majumdar, R.I.Sadreyev, Y.Qi, J.Pei, H.Cheng, and N.V.Grishin (2003) "CASP5 Assessment of Fold Recognition Target Predictions". Proteins 53(S6): 395-409

Targets
387	388	389	390
391	392	393	394
395	396	397	398
399	400	401	402
403	404	405	406
407	408	409	410
411	412	413	414
415	416	417	418
419	420	421	422
423	424	425	426
427	428	429	430
431	432	433	434
435	436	437	438
439	440	441	442
443	444	445	446
447	448	449	450
451	452	453	454
455	456	457	458
459	460	461	462
463	464	465	466
467	468	469	470
471	472	473	474
475	476	477	478
479	480	481	482
483	484	485	486
487	488	489	490
491	492	493	494
495	496	497	498
499	500	501	502
503	504	505	506
507	508	509	510
511	512	513	514