Calculates the property Best-zhh.
The document contains these additional sections of information:
Cross-validation Results
Enrichment Results
Percentile Results
Category Statistics Results
Non-validated Models Results
Training Data Information
Model Construction Information
Model Construction Parameters
Leave-one-out Cross-Validation Results
This model was built using 232 samples, and validated using a leave-one-out cross-validation. Each sample was left out one at a time, and a model built using the results of the samples, and that model used to predict the left-out sample. Once all the samples had predictions, a ROC plot was generated, and the area under the curve (XV ROC AUC) calculated.
Best Split was calculated by picking the split that minimized the sum of the percent misclassified for category members and for category nonmembers, using the cross-validated score for each sample. Using that split, a contingency table is constructed, containing the number of true positives (TP), false negatives (FN), false positives (FP), and true negatives (TN).
Output | XV ROC AUC | Best Split | TP/FN FP/TN |
# in Category |
---|---|---|---|---|
Best-zhh | 0.808 | 0.189 | 110/51 14/57 | 161 |
This model was built using 232 samples, and validated using a leave-one-out cross-validation. Each sample was left out one at a time, and a model built using the results of the samples, and that model used to predict the left-out sample. Once all the samples had predictions, an enrichment plot was generated, and the percentage of true category members captured at a particular percentage cutoff. (For example, in a column labeled "1%" would be the percentage of true category members (e.g., actives) that were found in the top 1% of the list, when sorted by the model score.)
This table shows the output name, the percentage of samples that are in that particular category, the number of category members, and the percentage of true members found. Percentages that are less than 100% are in bold.
Output | 1% | 5% | 10% | 25% | 50% | 75% | 90% | 95% | 99% | |
---|---|---|---|---|---|---|---|---|---|---|
Best-zhh | 1.2% | 6.8% | 14.3% | 34.2% | 64.6% | 87.6% | 97.5% | 100% | 100% |
This table shows, for each model, the cutoff needed to capture a particular percentage of the good samples. For each cutoff, it shows below the estimated percentages of false positives and true negatives for the non-good samples. This table is designed to help you pick the cutoff value that best balances your desire to capture as many good samples as possible, while keeping the number of false positives at a minimum.
The rates shown in this table are estimates derived from the cross-validated data; the actual numbers you would find on your own data may vary.
Cutoffs which lead to 10% or greater false positives are displayed in bold for ease of identification.
Model Name | 99% | 95% | 90% | 70% | 50% | 30% | 10% | 5% | 1% |
---|---|---|---|---|---|---|---|---|---|
Best-zhh | -6.928 38%/62% |
-4.348 34%/66% |
-2.947 32%/68% |
-1.399 30%/70% |
1.734 25%/75% |
4.868 21%/79% |
6.416 19%/81% |
7.817 18%/82% |
10.397 15%/85% |
This table shows, for each category, statistics derived from the cross-validated predictions of the model built for that category as applied to members of that category and non-members of that category. For each group, the number of members/nonmembers (N) is given; the mean prediction for each subset (Mean); and the estimate standard deviation of the predictions for each subset (StdDev).
(Categories with one or no members do not have a mean and standard deviation, as there are too few predictions upon which to base them during cross-validation. Also, occasionally categories may contain many duplicate or highly-similar compounds which predict close or identical values, causing them to have unusually low standard deviation values. These low values may be adjusted at time of use of these standard deviations for predicting, for example, percentile results.)
Output | N | Mean (±StdDev) | N | Mean (±StdDev) |
---|---|---|---|---|
Best-zhh |
All categories contained enough samples for cross-validation.
The data used to train the model consisted of 232 samples. The following are the statistics for the independent (X) properties.
Property | Min | Max | Mean | Std. Dev. |
---|---|---|---|---|
ECFP_6 | N/A | N/A | N/A | N/A |
Molecular_Weight | 44.053 | 650.97 | 303.31 | 111.71 |
Num_H_Acceptors | 0 | 15 | 4.194 | 2.7056 |
Num_H_Donors | 0 | 11 | 1.7759 | 1.7024 |
Molecular_PolarSASA | 0 | 516.44 | 129.37 | 88.16 |
The test to identify "good" samples is:
property("class-1") is defined AND property("class-1") = 1You can extend this model by adding your own training data to it to create a new model. Use the New Model from Old component to do this. The new training samples must contain the properties as specified above (except that they need not contain properties that can be calculated on-demand). The "good" samples must be marked so that they can be identified by the above test. Because the original training data were not saved with this model, you will not be able to compute cross-validation statistics for the new model.
Model Construction Information
Post-processing was performed to remove low-information bins. Low-information bins are those that have: normalized estimates in the range [-0.05, 0.05].
For each property, the following table gives the original number of bins (Original), the number removed due to too few samples (TooFew), the number removed due to a poor normalized estimate (Noninformative), and the final number of bins saved in the model (Final).
Property | Original | TooFew | Noninformative | Final |
---|---|---|---|---|
ECFP_6 | 4665 | 0 | 115 | 4550 |
Molecular_Weight | 11 | 0 | 2 | 9 |
Num_H_Acceptors | 7 | 0 | 1 | 6 |
Num_H_Donors | 6 | 0 | 2 | 4 |
Molecular_PolarSASA | 11 | 0 | 4 | 7 |
The following parameter values were specified by the learner component. Some items are internal parameters not exposed by the component. In the course of building the model, certain values may have been adjusted from the values shown below.
Parameter | Value |
---|---|
LearnedPropertyName | Best-zhh |
TestForGood | property("class-1") is defined AND property("class-1") = 1 |
UseProperties | UserSet |
PredefinedSet | ALogP, Molecular_Weight, Num_H_Donors, Num_H_Acceptors, Num_RotatableBonds, Molecular_FractionalPolarSurfaceArea, ECFP_6 |
UserSet | ECFP_6,Molecular_Weight,Num_H_Acceptors,Num_H_Donors,Molecular_PolarSASA |
IgnoreProperties | |
Additional Options | |
NumberOfBins | 10 |
Learn Options | Validate Models, Remove Uninformative Bins, Equipopulate Bins |
Numeric Distance Function | Euclidean |
Numeric Scaling | Mean-Center and Scale, Scale by Number of Dimensions |
Fingerprint Distance Function | Tanimoto |
Model Domain Fingerprint | FCFP_2 |
DestinationFolder | Administrator/LearnedProperties |
Post-Processing Script | resize(#op, 4); #op[1] := 'NormalizedProbability'; #op[2] := 'Enrichment'; #op[3] := 'EstPGood'; #op[4] := 'Prediction'; SetParam('Output Options',#op); |
DuplicationEstimate | 1.0 |
GoodDuplicationEstimate | 1.0 |
Additional Properties |