Target Categories in CASP9

Some targets are easy to predict, as they have very close templates among known structures, other targets are quite challenging. It is essential to evaluate predictions taking into account target difficulty, since performance of different algorithms depends on it. Grouping targets into categories of approximately the same prediction difficulty brings out the flavors of how each method deals with different target types.

In CASP9, targets were classified in two general categories: TBM (template-based modelling) and FM (Free Modelling), to reflect the method that was used to obtain models. Clearly, templates have something to do with it, like TBM assumes presence of templates by definition. Does FM assume absence of template by definition? we would say it is not. Traditionally, predictors thought about FM as 'hard'. FM, which is "free modelling", is a category where predictors are 'free' to do whatever they can, as they can't get is right anyway:). Then, how to define CASP9 category? It became clear with time that the best approach is to use a combination of various methods as what matters is the quality of the final prediction. Therefore, it is logical to group targets into categories by the prediction quality.

During CASP8 season, we developed a general approach which leads to a well-defined boundaries between categories coming out naturally from the data 1). For CSAP9, we applied very similar approach to define the target categories based on the quality of predictions. We resorted to a traditional model quality metric that stood the test of time – LGA GDT-TS scores. Targets for which domain-based evaluation is essential, as established by our analysis, were split into domains, and other targets remained as whole chains and were considered as single domain targets for evaluation purpose. This procedure resulted in 147 "domains" gathered from 116 targets. For each of these domains, we took GDT_TS score for the best server models, and calculated the median GDT_TS for above random models. Moreover, the rank of random model was recorded. Those two values were used to measure each target's difficulty.

We looked for naturally emerging clusters in these median GDT-TS scores and ranks of random models and used Gaussian kernel density estimation. Gaussian kernel density estimator is a function ρ(x) = ∑_in e^{-(x − μ_i)²/(2 σ²)}/(√2π σ n), where n is the number of domains, μ_i is median GDT-TS score or rank of random model for a domain i, and σ is a standard deviation, called bandwidth. Conceptually, each domain score generates a Gaussian centered at that score and with standard deviation σ. Averaging these Gaussians gives a density function ρ(x) that reveals score groups. Maxima of this function correspond to the group centers, and minima mark the boundaries between group. When the bandwidth is very narrow (= variance very small), each domain forms its own group. When the bandwidth is broad (= variance very large), all domains are in one group. Some optimal bandwidth setting should reveal meaningful groups in the data.

kernel density estimate for domain GDT_TS
Domains: Gaussian kernel density estimation of domain GDT-TS scores for the median best model GDT-TS for above random models, plotted at various bandwidths (= standard deviations). These median GDT-TS scores for domains are shown as a spectrum along the horizontal axis: each bar represents a domain. The darkgreen bar indicates the reasonable bound, which partition target domains into tow categories.

kernel density estimate for rank of random model
Domains: Gaussian kernel density estimation of rank of random model, plotted at various bandwidths (= standard deviations). These ranks of random models for domains are shown as a spectrum along the horizontal axis: each bar represents a domain. The darkgreen bar indicates the reasonable bound, which partition target domains into tow categories.

We combine the median GDT_TS for above random model and rank of the random model, and plot 2D of these CASP9 category definition as below:

2D of these CASP9 category definition
domains that do not have template available are colored in red

From the 2D plot of these CASP9 category definition, cleary, the domains located at the left lower cornor should go to the FM category. Except that, we also would like to take the domains that do not have template available as FM targets, which results in 30 FM targets in total.

2D of these CASP9 category definition

¹⁾ S.Shi, J.Pei, R.I.Sadreyev, L.N.Kinch, I.Majumdar, J.Tong, H.Cheng, B.H.Kim, N.V.Grishin (2009) "Analysis of CASP9 targets, predictions and assessment methods." Database (Oxford) 2009: bap003; PMID: 20157476

Targets
515	516	517	518
519	520	521	522
523	524	525	526
527	528	529	530
531	532	533	534
535	536	537	538
539	540	541	542
543	544	545	546
547	548	549	550
551	552	553	554
555	556	557	558
559	560	561	562
563	564	565	566
567	568	569	570
571	572	573	574
575	576	577	578
579	580	581	582
583	584	585	586
587	588	589	590
591	592	593	594
595	596	597	598
599	600	601	602
603	604	605	606
607	608	609	610
611	612	613	614
615	616	617	618
619	620	621	622
623	624	625	626
627	628	629	630
631	632	633	634
635	636	637	638
639	640	641	642
643