CASP8

Targets
387 388 389 390
391 392 393 394
395 396 397 398
399 400 401 402
403 404 405 406
407 408 409 410
411 412 413 414
415 416 417 418
419 420 421 422
423 424 425 426
427 428 429 430
431 432 433 434
435 436 437 438
439 440 441 442
443 444 445 446
447 448 449 450
451 452 453 454
455 456 457 458
459 460 461 462
463 464 465 466
467 468 469 470
471 472 473 474
475 476 477 478
479 480 481 482
483 484 485 486
487 488 489 490
491 492 493 494
495 496 497 498
499 500 501 502
503 504 505 506
507 508 509 510
511 512 513 514

Target Categories in CASP8

Some targets are easy to predict, as they have very close templates among known structures, other targets are quite challenging. It is essential to evaluate predictions taking into account target difficulty, since performance of different algorithms depends on it. Grouping targets into categories of approximately the same prediction difficulty brings out the flavors of how each method deals with different target types.

In the early days of CASP, targets were classified in three general categories: comparative modeling, fold recognition and ab initio prediction, to reflect the method that was used to obtain models. It became clear with time that the best approach is to use a combination of various methods as what matters is the quality of the final prediction. Therefore, it is logical to group targets into categories by the prediction quality.

It appears that a general approach described here leads to a well-defined boundaries between categories coming out naturally from the data. The approach is rooted in the suggestion by the Baker group to use prediction scores of the top 10 models (see ROBETTA evaluation pages) and is similar to what we used for target classification in CASP5 assessment 1). We resorted to a traditional model quality metric that stood the test of time – LGA GDT-TS scores. Targets for which domain-based evaluation is essential, as established by our analysis, were split into domains, and other targets remained as whole chains and were considered as single domain targets for evaluation purpose. This procedure resulted in 147 "domains" gathered from 125 targets. For each of these domains, top 10 GDT-TS scores for the first server models were averaged and used as a measure of each target's difficulty.

We looked for naturally emerging clusters in these average GDT-TS scores and used density-based algorithm. Gaussian kernel density estimator is a function ρ(x) = in e-(x − μi)2/(2 σ2)/(√ σ n), where n is the number of domains, μi is average GDT-TS score for a domain i, and σ is a standard deviation, called bandwidth. Conceptually, each domain score generates a Gaussian centered at that score and with standard deviation σ. Averaging these Gaussians gives a density function ρ(x) that reveals score groups. Maxima of this function correspond to the group centers, and minima mark the boundaries between group. When the bandwidth is very narrow (= variance very small), each domain forms its own group. When the bandwidth is broad (= variance very large), all domains are in one group. Some optimal bandwidth setting should reveal meaningful groups in the data.

We plotted estimated densities for varying bandwidths, from 0.3 to 8.2 GDT-TS % units. Apparently, lower end bandwidth is too small and results in too many clusters (magenta curves on a plot below). Higher bandwidths around 8% (cyan curves) reveal two major clusters: a large cluster centered at about 73% GDT-TS and a smaller one around 41%. These major clusters can be used for evaluation as they demonstrate that the dataset splits naturally into two groups: "hard" and "easy", with 52% GDT-TS defining a boundary between the groups: surprisingly non-surprising cutoff. Bandwidth of 4% (black curve) yields three groups: hard, medium and easy, with the cutoffs 52% and 81%, i.e. this "medium" group splits from former "easy" group of a two-cluster breakdown. Finally, 2% bandwidth (yellow-framed brown curve) reveals 5 groups, and this is about the right number for evaluation of predictions. GDT-TS bounds between these groups are 30%, 52%, 67% and 81%. We term these categories: FM (free modeling, as predictors are free to do anything they can, they still fail to predict these targets right), FR (fold recognition, to give a tribute to historic category), and CM_H, CM_M, CM_E (comparative modeling – hard, medium and easy). We use these categories to evaluate predictions.

Bounds found by the kernel density analysis partition target domains into the following categories.

Target Category for Domains
FM
Free
modeling
FR
Fold
recognition
CM_H
Comparative
modeling: hard
CM_M
Comparative
modeling: medium
CM_E
Comparative
modeling: easy
T0397_1, T0405_2, T0407_2, T0465, T0496_1  T0395, T0399, T0405_1, T0413, T0416_2, T0419_1, T0419_2, T0421, T0429_2, T0430, T0443_1, T0443_2, T0457_2, T0460, T0466, T0467, T0468, T0476, T0478_1, T0478_2, T0482, T0487_2, T0487_4, T0489, T0495, T0504_2, T0510_1, T0510_3, T0513, T0514  T0391, T0393, T0394, T0401, T0407_1, T0409_1, T0414, T0417, T0420, T0427, T0429_1, T0434, T0436, T0446, T0449, T0454, T0457_1, T0464, T0471, T0485, T0487_1, T0487_3, T0487_5, T0498, T0501_1, T0501_2, T0506, T0507, T0510_2, T0512  T0389, T0392, T0397_2, T0402, T0406, T0408, T0411, T0412, T0415, T0422, T0424, T0425, T0431, T0433, T0435, T0437, T0440, T0441, T0445, T0448, T0451, T0456, T0459, T0462_1, T0462_2, T0463, T0469, T0473, T0475, T0477, T0480, T0481, T0483, T0490, T0492, T0493, T0494, T0496_2, T0497, T0502, T0503, T0504_1, T0505, T0509, T0511  T0387, T0388, T0390, T0396, T0398, T0400, T0404, T0410, T0416_1, T0418, T0423, T0426, T0428, T0432, T0438, T0442, T0444, T0447, T0450, T0452, T0453, T0455, T0458, T0461, T0470, T0472_1, T0472_2, T0474, T0479, T0484, T0486_1, T0486_2, T0488, T0491, T0499, T0504_3, T0508 

This analysis repeated for the whole chain predictions revealed identical trends, and the same cutoffs of 30%, 52%, 67%, 81% can be used to determine the target category from the top 10 averaged first model GDT_TS scores:

Target Category for Whole Chains
FM
Free
modeling
FR
Fold
recognition
CM_H
Comparative
modeling: hard
CM_M
Comparative
modeling: medium
CM_E
Comparative
modeling: easy
T0405, T0419, T0443, T0465, T0478, T0496, T0504  T0395, T0397, T0399, T0407, T0409, T0413, T0421, T0429, T0430, T0457, T0460, T0462, T0466, T0467, T0468, T0476, T0482, T0487, T0489, T0495, T0501, T0510, T0513, T0514  T0391, T0393, T0394, T0401, T0414, T0416, T0417, T0420, T0427, T0434, T0436, T0446, T0449, T0454, T0464, T0471, T0472, T0485, T0498, T0506, T0507, T0512  T0389, T0392, T0402, T0406, T0408, T0411, T0412, T0415, T0422, T0424, T0425, T0431, T0433, T0435, T0437, T0440, T0441, T0445, T0448, T0451, T0456, T0459, T0463, T0469, T0473, T0475, T0477, T0480, T0481, T0483, T0490, T0492, T0493, T0494, T0497, T0502, T0503, T0505, T0509, T0511  T0387, T0388, T0390, T0396, T0398, T0400, T0404, T0410, T0418, T0423, T0426, T0428, T0432, T0438, T0442, T0444, T0447, T0450, T0452, T0453, T0455, T0458, T0461, T0470, T0474, T0479, T0484, T0486, T0488, T0491, T0499, T0508 

 


1) L.N.Kinch, J.O.Wrabl, S.S.Krishna, I.Majumdar, R.I.Sadreyev, Y.Qi, J.Pei, H.Cheng, and N.V.Grishin (2003) "CASP5 Assessment of Fold Recognition Target Predictions". Proteins 53(S6): 395-409