Free Essay

Data

In: Computers and Technology

Submitted By mnawawee
Words 5674
Pages 23
Tutorial on Classification
Igor Baskin and Alexandre Varnek

Introduction
The tutorial demonstrates possibilities offered by the Weka software to build classification models for SAR (Structure-Activity Relationships) analysis. Two types of classification tasks will be considered – two-class and multi-class classification. In all cases protein-ligand binding data will analyzed, ligands exhibiting strong binding affinity towards a certain protein being considered as “active” with respect to it. If it is not known about the binding affinity of a ligand towards the protein, such ligand is conventionally considered as “nonactive” one. In this case, the goal of classification models is to be able to predict whether a new ligand will exhibit strong binding activity toward certain protein biotargets. In the latter case one can expect that such ligands might possess the corresponding type of biological activity and therefore could be used as ‘’hits” for drug design. All ligands in this tutorial are described by means of an extended set of MACCS fingerprints, each of them comprising 1024 bits, the “on” value of each of them indicating the presence of a certain structural feature in ligand, otherwise its value being “off”.

Part 1. Two-Class Classification Models.
1. Data and descriptors.
The dataset for this tutorial contains 49 ligands of Angeotensin-Converting Enzyme (ACE) and 1797 decoy compounds chosen from the DUD database. The set of "extended" MACCS fingerprints is used as descriptors.

2. Files
The following file is supplied for the tutorial: • ace.arff – descriptor and activity values

3. Exercise 1: Building the Trivial model ZeroR
In this exercise, we build the trivial model ZeroR, in which all compounds are classified as “nonactive”. The goal is to demonstrate that the accuracy is not a correct choice to measure
1

the performance of classification for unbalanced datasets, in which the number of “nonactive” compounds is much larger than the number of “active” ones. Step by step instructions
Important note for Windows users: During the installation, the ARFF files should be associated with Weka.

In the starting interface of Weka, click on the button Explorer. • In the Preprocess tab, click on the button Open File. In the file selection interface, select the file ace.arff.

The dataset is characterized in the Current relation frame: the name, the number of instances (compounds), the number of attributes (descriptors + activity/property). We see in this frame that the number of compounds is 1846, whereas the number of descriptors is 1024, which is the number of attributes (1025) minus the activity field. The Attributes frame allows user to modify the set of attributes using select and remove options. Information about the selected attribute is given in the Selected attribute frame in which a histogram depicts the attribute distribution. One can see that the value of the currently selected descriptor fp_1 (the first bit in the corresponding fingerprint) is “on” in 201 compounds and “off” in 1645 compounds in the dataset. • Select the last attribute “activity” in the Attributes frame.

2

One can read from the Selected attribute frame that there are 1797 nonactive and 49 active compounds in the dataset. Nonactive compounds are depicted by the blue color whereas active compounds are depicted by the red color in the histogram.

• Click on the tab Classify.
The ZeroR method is already selected by default. For assessing the predictive performance of all models to be built, the 10-fold cross-validation method has also be specified by default.

• Click on the Start button to build a model.

The predictive performance of the model is characterized in the right-hand Classifier output frame. The Confusion Matrix for the model is presented at the bottom part of the Classifier output window. It can be seen from it that all compounds have been classified as “nonactive”. It is clear that such 3

trivial model is useless and it cannot be used for discovering “active” compounds. However, pay attention that the accuracy of the model (Correctly Classifieds Instances) of this trivial model is very high: 97.3456 %. This fact clearly indicates that the accuracy cannot be used for assessing the usefulness of classification models built using unbalanced datasets. For this purpose a good choice is to use the “Kappa statistic”, which is zero for this case. “Kappa statistic” is an analog of correlation coefficient. Its value is zero for the lack of any relation and approaches to one for very strong statistical relation between the class label and attributes of instances, i.e. between the class of biological activity of chemical compounds and the values of their descriptors. Another useful statistical characteristic is “ROC Area”, for which the value near 0.5 means the lack of any statistical dependence.

4. Exercise 2: Building the Naïve Bayesian Model
In this exercise, we build a Naïve Bayesian model for predicting the ability of chemical compounds to bind to the Angeotensin-Converting Enzyme (ACE). The goal is to demonstrate the ability of Weka to build statistically significant classification models for predicting biological activity of chemical compounds, as well as to show different ways of assessing the statistical significance and usefulness of classification models.

• In the classifier frame, click Chose, then select the NaiveBayes method from the bayes submenu. • Click on the Start button to build a model.

Although the accuracy of the model became lower (93.8787 % instead of 97.3456 %), its real statistical significance became much stronger. This follows from the value of the ‘’Kappa statistic” 0.42, which indicates the existence of moderate statistical dependence. It can be analyzed using the 4

“Confusion Matrix” at the bottom of the Classifier output window. So, there are 45 true positive, 1688 true negative, 109 false positive, and 4 false negative, and 109 false positive compounds. It is because of the considerable number of false positive that the value of recall for “active” compounds 0.292 is rather low. Nonetheless, the model exhibits an excellent value of “ROC Area” for “active” compounds 0.98. This indicates that this Naïve Bayesian model could very advantageously be used for discovering biologically active compounds through virtual screening. This can clearly be shown by analyzing ROC and Cost/Benefit plots. The Naïve Bayes method provides probabilistic outputs. This means that Naïve Bayes models can assess the value of the probability (varying from 0 to 1) that a given compound can be predicted as “active”. By moving the threshold from 0 to 1 and imposing that a compound can be predicted as “active” if the corresponding probability exceeds the current threshold, one can build the ROC (Receiver Operating Characteristic) curve.

• Visualize the ROC curve by clicking the right mouse button on the model type bayes.NaiveBayes in the Result list frame and selecting the menu item Visualize threshold curve / active.

The ROC curve is shown in the Plot frame of the window. The axis X in it corresponds to the false positive rate, whereas its axis Y corresponds to the true positive rate. The color depicts the value of the threshold. The “colder” (closer to the blue) color corresponds to the lower threshold value. All compounds with probability of being “active” exceeding the current threshold are predicted as “active”. If such prediction made for a current compound is correct, then the corresponding compound is true positive, otherwise it is false positive. If for some values of the threshold the true positive rate greatly exceeds the false positive rate (which is indicated by the angle A close to 90 degrees), then the classification model with such threshold can be used to extract selectively “active” 5

compounds from its mixture with the big number of “nonactive” ones in the course of virtual screening.

In order to find the optimal value of the threshold (or the optimal part of compounds to be selected in the course of virtual screening), one can perform the cost/benefit analysis.

• Close the window with the ROC curve. • Open the window for the cost/benefir analysis by clicking the right mouse button on the model type bayes.NaiveBayes in the Result list frame and selecting the menu item Cost/Benefit analysis / active. • Click on the Minimize Cost/Benefit button at the right bottom corner of the window.

6

Consider attentively the window for the Cost/Benefit Analysis. It consists of several panels. The left part of the window contains the Plot: ThresholdCurve frame with the Threshold Curve (called also the Lift curve). The Threshold curve looks very similar to the ROC curve. In both of them the axis Y corresponds to the true positive rate. However, in contrast to the ROC curve, the axis X in the Threshold curve corresponds to the part of selected instances (the “Sample Size”). In other words, the Threshold curve depicts the dependence of the part of “active” compounds retrieved in the course of virtual screening upon the part of compounds selected from the whole dataset used for screening. Remind that only those compounds are selected in the course of virtual screening, for which the estimated probability of being “active” exceeds the chosen threshold. The value of the threshold can be modified interactively by moving the slider in the Threshold frame of the Cost/Benefit Analysis window. The confusion matrix for the current value of the threshold is shown in the Confusion Matrix frame at the left bottom corner of the window.

Pay attention that the confusion matrix for the current value of the threshold sharply differs from the previously obtained one. In particular, the classification accuracy 97.8873 % is considerably higher than the previous value 93.8787 %, the number of false positives has greatly decreased from 109 to 31, whereas the number of false negatives has increased from 4 to 8. Why is this happening? In order to give an answer to this question and explain the corresponding phenomenon, let us take a look at the right side of the window. Its right bottom corner contains the Cost Matrix frame.

The left part of the frame contains the Cost matrix itself. Its four entries indicate the cost one should pay for decisions taken on the base of the classification model. The cost values are expressed in the table in abstract units, however in the case studies they can be considered in money scale, for example, in EUROs. The left bottom cell of the Cost matrix defines the cost of false positives. Its default value is 1 unit. In the case of virtual screening this corresponds to the mean price one should pay in order to synthesize (or purchase) and test a compound wrongly predicted by the model as “active”. The right top cell of the Cost matrix defines the cost of false negatives. Its default value is 1 unit. In the case of virtual screening this corresponds to the mean price one should pay for “throwing away” very useful compound and losing profit because of the wrong prediction taken by the classification model. It is also taken by default that one should not pay price for correct decisions 7

taken using the classification model. It is clear that all these settings can be changed in order to match the real situation taking place in the process of drug design. The overall cost corresponding to the current value of the threshold is indicated at the right side of the frame. Its current value is 39 (cost of 31 false positives and 8 false negatives). In order to find the threshold corresponding to the minimum cost, it is sufficient to press the button Minimize Cost/Benefit. This explains the afore-mentioned difference in confusion matrices. The initial confusion matrix corresponds to the threshold 0.5, whereas the second confusion matrix results from the value of the threshold found by minimizing the cost function. The current value of the cost is compared by the program with the cost of selecting the same number of instances at random. Its value 117.18 is indicated at the right side of the frame. The difference between the values of the cost function between the random selection and the current value of the cost is called Gain. Its current value 78.18 is also indicated at the right side of the frame. In the context of virtual screening, the Gain can be interpreted as the profit obtained by using the classification model instead of random selection of the same number of chemical compounds. Unfortunately, the current version of the Weka software does not provide the means of automatic maximization of the Gain function. However, this can easily be done interactively by moving the slider in the Threshold frame of the Cost/Benefit Analysis window. The current model corresponds to the minimum value of the Cost function. Read the values for the current threshold from the right side of the Threshold frame.

So, the current model (with the threshold obtained by minimizing the cost) specifies that it is optimal to select 3.9003 % of compounds in the course of virtual screening, and this ensures retrieving of 83.6735 % of active compounds.

• Close the window with the Cost/Benefit Analysis.

Exercise: Find the model corresponding to the maximum Gain.

5. Exercise 3: Building the Nearest Neighbors Models (k-NN)
In this exercise, we build k-NN models for predicting the ability of chemical compounds to bind to the Angeotensin-Converting Enzyme (ACE). The goal is to learn how to use instancebased (lazy) methods. • In the classifier frame, click Chose, then select the IBk method from the lazy submenu.
8

The lazy submenu contains a group of methods, in which the training phase is almost omitted – it actually amounts to memorizing all instances from the training set. Instead of it, all main calculations are delayed to the test phase. That is why such methods are sometimes called lazy, instance-based and memory-based. The price for this “laziness” is however rather high – computations at the test phase are very intensive, and that is why such methods work very slowly during prediction, especially for big training sets. So, the abbreviation IBk means that this is an Instance-Based method based on k neighbours. The default value of k is 1. So, build an 1-NN model

• Click on the Start button to build a 1-NN model.

One can see that the 1 Nearest Neighbour model is statistically much stronger than the previous Naïve Bayes one. In particular, the number of Incorrectly Classified Instances has decreased from 113 to 13, whereas the Kappa statistic has increased from 0.42 to 0.8702. Nonetheless, the ROC Area became slightly smaller in comparison with the Naïve Bayes model. Now perform the Cost/Benefit Analysis for the 1-NN model.

• Click the right mouse button on the model type lazy.IBk in the Result list frame and selecting the menu item Cost/Benefit analysis / active. • Click on the Minimize Cost/Benefit button at the right bottom corner of the window.

One can see that the Cost became considerably lower (13 vs 39), and the Gain became higher (87.13 vs 78.18). It can also be checked that the initial 1-NN model corresponds to the lowest Cost and the 9

highest Gain. It can also be seen that when using the 1-NN classifier in virtual screening it is sufficient to select 2.9252 % of compounds in order to retrieve 91.8367 % “active” ones. Can this result be further improved? Yes, this can be carried out by using the weighted modification of the k-NN method.

• Close the window with the Cost/Benefit Analysis. • Click with the left mouse button on the word IBk in the Classifier frame. The window for setting options for the k-NN method pops up. • Change the option distanceWeighting to Weight to 1-distance.

• Click on the OK button. • Click on the Start button.
One can see that the ROC Area has increased from 0.95 to 0.977, although the accuracy of prediction and the Kappa statistics have not changed.

• Repeat the Cost/Benefit analysis.

So, after the introduction of weighting, the Cost became lower (11 vs 13), the Gain became slightly higher (87.24 vs 87.13), and now it is sufficient to screen 2.8169 % (instead of 2.9252 %) of 10

compounds in order to retrieve the same number of the “active” ones. So, some moderate improvement has been achieved. So, the Nearest Neighbours approach appeared to be considerably more efficient than the Naïve Bayes for predicting the ability of chemical compounds to bind to the Angeotensin-

Converting Enzyme (ACE) using the set of "extended" MACCS fingerprints as descriptors. The question arises: is the Nearest Neighbours approach always more efficient than the Naïve Bayes?
The answer is: no. In this case the exceptionally high performance of the 1-NN method can be explained by the fact that the MACCS fingerprints have specially been optimized so as to provide high performance of retrieving “active” compounds in similarity search, which is actually 1-NN. With other sets of descriptors, the results might be different.

Exercise: Build two 3-NN models (with and without weighting) for the same dataset and analyze their relative performances in comparison with the corresponding 1-NN models. Hint: the kNN option in the k-NN parameters window should be changed.

6. Exercise 4: Building the Support Vector Machine Models
In this exercise, we build Support Vector Machine (SVM) models for predicting the ability of chemical compounds to bind to the Angeotensin-Converting Enzyme (ACE). The goal is to learn the possibilities offered by the Weka software for that. • In the classifier frame, click Chose, then select the SMO method from the functions submenu.
The Weka software implements John Platt’s Sequential Minimal Optimization (SMO) algorithm for training a support vector classifier, and this explains abbreviation SMO used in Weka for this methods.

• Click on the Start button.

11

We have obtained very good model with a small number of misclassification errors (13) and rather high value of the Kappa statistic 0.8536. The only thing that became worse in comparison with previous models is ROC Area. This can however be easily explained by the fact that original SVM method is not probabilistic, and only a single optimal value of threshold (which is in the case of the standard SVM approach the distance between the separating hyperplane and the coordinate origin) is provided. Without such freely moving threshold, it would not be possible to perform virtual screening based on ranking chemical compounds and adjusting threshold for selection. This results in the relatively bad value of ROC Area. Nonetheless this can be improved by using a special modification of the original SVM approach, which assigns probability value to each prediction. Since the algorithm for assigning probability values to SVM predictions is based on the use of logistic functions, such models are called in Weka Logistic Models.

• Click with the left mouse button on the word SOM in the Classifier frame. The window for setting options for the SVM method pops up.

12

• Change the option buildLogisticModels to True. • Click on the OK button. • Click on the Start button.

Although the accuracy of prediction has not changed, the ROC Area became very high – 0.993 for “active” compounds. For such probabilistic variant of the SVM method, good ROC curves can be built, and the Cost/benefit analysis can easily be performed.

• Click the right mouse button on the model type functions.SMO in the Result list frame and select the menu item Cost/Benefit analysis / active. • Click on the Minimize Cost/Benefit button at the right bottom corner of the window.

13

One can see that the value of the Cost function is low (11), whereas the Gain is rather high (83.45). In order to retrieve 87.7551 % of active compounds in the course of virtual screening, it is possible to select only 2.6002 % of compounds. The Threshold Curve at the left side of the window also demonstrates very good performance of the probabilistic SVM approach in virtual screening.

• Close the window with the Cost/Benefit Analysis.
The obtained results can further be improved. All these models have been built using the linear kernels chosen by default. Such kernels take into account only individual impacts of descriptors (in this case fingerprint bits), but do not consider their interaction (in this case, interaction of features corresponding to different fingerprint bits). All binary interactions of features can be depicted using the quadratic kernels. Let us build a new probabilistic SVM model for the quadratic kernel.

• Click with the left mouse button on the word SOM in the Classifier frame. The window for setting options for the SVM method pops up.

Now change the kernel from the linear to the quadratic one. All of them are particular cases of the polynomial kernel with different exponents (one for the linear, and two to the quadratic kernel). Therefore, in order to obtain the quadratic kernel, it is sufficient to set the exponent parameter of the polynomial kernel to 2, and it is not necessary to change the type of kernel. • Click with left mouse button on the PolyKernel word near the kernel label.

A new window with parameters of the polynomial kernel pops up. • Change the value of the exponent option from 1.0 to 2.0

14

• Click on the OK button to close the window with polynomial kernel options. • Click on the OK button to close the window with SVM options. • Click on the Start button.

So, all statistical parameters have substantially improved in comparison with the case of the linear kernel. In particular, the number of misclassification errors has dropped from 13 to 9, the value of the Kappa statistics has raised from 0.8536 to 0.9007.

• Perform the Cost/Benefit analysis.

15

The Cost has fallen even further from 9 to 7, in comparison with the linear kernel. With the quadratic kernel, it is sufficient to select only 2.2752 % of compounds in order to retrieve 85.7143 % of “active” ones. So, the transition to the quadratic kernel from the linear one has lead to substantial improvement of SVM classification models. This means that it is important to consider not just individual features coded by fingerprints, but also nonlinear interaction between them. Unfortunately, very popular Tanimoto similarity measure does not take this into account.

Exercise: Rebuild probabilistic SVM model with quadratic kernel for different values of parameter C (trade-off between errors and model complexity). Trye the values 0.1, 0.5, 2, 10. Can any improvement be achieved in comparison with the use of its default value 1?

7. Exercise 5: Building a Classification Tree Model
In this exercise, we build a classification tree model (using the C4 method named in Weka as J48) for predicting the ability of chemical compounds to bind to the Angeotensin-Converting Enzyme (ACE). The goal is to learn the possibilities offered by the Weka software to build and visualize classification trees. • In the classifier frame, click Choose, then select the J48 method from the trees submenu. • Click on the Start button.

The statistical parameters of the J48 model appears not high, especially in comparison with previously considered methods. Nonetheless, the main strength of individual classification trees 16

stems not from high statistical significance of models, but from their interpretation ability. In order to visualize the classification tree in the text mode, scroll the text field in the Classifier output frame up.

In order to obtain more usual representation of the same tree, do the following

• Click the right mouse button on the model type trees.J48 in the Result list frame and select the menu item Visualize tree. • Resize a new window with graphical representation of the tree • Clock with the right mouse button to the space in this screen, and in the popup menu select the item Fit to screen.

17

The Tree View graphical diagram can be used to visualize decision trees. It contains two type of nodes, ovals and rectangles. Each oval contains a query of the sort: does chemical structure contains a feature depicted by the specified fingerprint bit number. If the answer is “yes”, then the node connected with the previous one with the “= on” branch is queried next. Otherwise, the “= off” branch is activated. The tree top node is queried the first. The “leaves” of the tree, depicted by rectangular, contain final decisions, whether the current compound is active or not.

Exercise: Build the ROC curve and perform the Cost/Benefit analysis of the J48 model.

8. Exercise 6: Building a Random Forest Model
In this exercise, we build a Random Forest model for predicting the ability of chemical compounds to bind to the Angeotensin-Converting Enzyme (ACE). The goal is to learn the possibilities offered by the Weka software to build Random Fore models. Although models built using individual decision trees are not very strong from statistical point of view, they can largely be improved by applying ensemble modeling. In the latter case, an ensemble of several models is built instead of a single one, and prediction of the ensemble model is made as a consensus of predictions made by all its individual members. The most widely used method based on the ensemble modeling is Random Forest, which has recently become very popular in chemoinformatics. • In the classifier frame, click Choose, then select the J48 method from the trees submenu. • Click with the left mouse button on the word RandomForest in the Classifier frame. The window for setting options for the Random Forest method pops up. • Change the value of the numTrees option from 10 to 100

18

We have changed the default number of trees 10 in ensemble to 100

• Click on the OK button to close the window with the Random Forest options. • Click on the Start button.

The resulting model is rather strong. Although its classification accuracy and the value of Kappa statistics are worse than for the SVM model, the ROC Area appears to be very high. This means that it can advantageously be applied in virtual screening. Indeed, perform the Cost/Benefir Analysis of the model.

• Click the right mouse button on the model type trees.RandomForest in the Result list frame and select the menu item Cost/Benefit analysis / active. • Click on the Minimize Cost/Benefit button at the right bottom corner of the window.

19

Very good Cost/Benefit parameters are observed. The Cost is rather low (10), the Gain is rather high (85.4), it is sufficient to select only 2.6544 % of compounds in order to retrieve 89.7959 % of “active” ones.

Exercise: Study the dependence of the Kappa statistic and ROC Area upon the number of trees in ensemble. Try 10, 20, 30, 40, 50, 100, 200 trees.

20

Part 2. Milti-Class Classification Models.
1. Data and descriptors.
The dataset for this tutorial contains 3961 ligands to 40 different protein biotargets and 3127 decoy compounds chosen from the DUD database []. The extended set of MACCS fingerprints is used as descriptors.

2. Files
The following file is supplied for the tutorial: • dud.arff – descriptor and activity values

3. Exercise 7: Building the Naïve Bayes Model
In this exercise we will show how the Naïve Bayes method implemented in the Weka software can be apply for building a multi-class model capable of predicting affinity to 40 pharmaceutically important protein biotargets. In this case the output attribute is called “classes” and it can take 41 different values: the names of biotargets and “none” for the lack of affinity.

Step by step instructions
Important note for Windows users: During installation, the ARFF files should have been associated with Weka. In this case, it is highly recommended to locate and double click on the file dud.arff, and to skip the following three points.

• In the starting interface of Weka, click on the button Explorer. • In the Preprocess tab, click on the button Open File. In the file selection interface, select the file dud.arff.

21

The dataset is characterized in the Current relation frame: the name (dud), the number of instances (compounds), the number of attributes (descriptors + activity/property). We see in this frame that the number of compounds is 7088, whereas the number of descriptors is 1024, which is the number of attributes (1025) minus the “classes” field. The Attributes frame allows user to modify the set of attributes using select and remove options. Information about the selected attribute is given in the Selected attribute frame in which a histogram depicts the attribute distribution. One can see that the value of the currently selected descriptor fp_1 (the first bit in the corresponding fingerprint) is “on” in 1675 compounds and “off” in 5413 compounds in the dataset. • Select the last attribute “classes” in the Attributes frame.

One can read from the Selected attribute frame the list of different classes (40 types of biotargets and ‘none’ for decoys) and the number of compounds belonging to each of the classes (i.e. the number of ligands strongly binding to the corresponding protein). Compounds binding to different 22

biotargets are depicted with different colors in the histogram. The last black color corresponds to “decoys”.

• Click on the tab Classify. • In the classifier frame, click Chose, then select the NaiveBayes method from the bayes submenu. • Click on the Start button to build a model.
All information concerning the predictive performance of the resulting model can be extracted from the text field in the right-hand Classifier output frame. Consider first the global statistics.

We can see that for 81.3488 % of ligands the corresponding biotargets have been correctly predicted. The value of the Kappa statistic 0.7688 means that statistical significance of the model is rather high. Therefore, it can be applied in “target fishing”, i.e. in prediction of putative biological targets for a given compound. Consider individual statistic for each of the targets.

23

For each of the targets, several statistical characteristics are presented: True Positive (TP) rate, False Positive (FP) rate, Precision, Recall, F-Measure, and ROC Area. One can see that different targets are characterized by rather different performance of recognition. For example, models for dhfr and gart ligands are very strong, whereas those for pr, hivrt and hivpr are not so good. For each of the targets, individual ROC curves can be built and Cost/Benefit analysis can be performed.

Exercise: Perform Cost/Benefit analysis for the ace target and compare its results with the case of the two-class classification Naïve Bayes model obtained in Exercise 2 24

4. Exercise 8: Building a Joint Classification Tree for 40 Targets
In this exercise we will show how decision trees can be applied for building a multi-class model capable of predicting affinity to 40 pharmaceutically important protein biotargets and how the resulting decision tree can be visualized.

• In the classifier frame, click Choose, then select the J48 method from the trees submenu. • Click on the Start button.
The global statistics of the multi-class classification model is as follows:

So, the classification tree model is characterized with better statistical characteristics than the Naïve Bayes one (compare with the previous exercise). Now visualize the joint classification tree.

• Click the right mouse button on the model type trees.J48 in the Result list frame and select the menu item Visualize tree. • Resize a new window with graphical representation of the tree • Clock with the right mouse button to the space in this screen, and in the popup menu select the item Auto Scale.
The graphical representation of the tree appears to be very big and cannot accommodate into the window. So use scroll bars to scroll it inside the window.

25

5. Exercise 9: Building the Multi-Class Random Forest Model
In this exercise we will show how the Random Forest method can be applied for building a multi-class model capable of predicting affinity to 40 pharmaceutically important protein biotargets.

• In the classifier frame, click Choose, then select the J48 method from the trees submenu. • Click with the left mouse button on the word RandomForest in the Classifier frame. The window for setting options for the Random Forest method pops up.
• The global statistics of the multi-class classification model is as follows:

We can see that the Random Forest method provides the strongest multi-class classification models. Exercise: Perform Cost/Benefit analysis for the ace target and compare its results with the case of the two-class classification Random Forest model obtained in Exercise 6.

26

Appendix

1. Notes for Windows
On Windows, Weka should be located on the usual program launcher, in a folder Wekaversion (e.g., weka-3-6-2). It is recommended to associate Weka to ARFF files. Thus, by double clicking an ARFF, Weka/Explorer will be launched and the default directory for loading and writing data will be set to the same directory as the loaded file. Otherwise, the default directory will be Weka directory. If you want to change the default directory for datasets in Weka, proceed as follows: • Extract from the java archive weka.jar, the weka/gui/explorer/Explorer.props file. It can be done using an archive program such as WinRAR or 7-zip. • Copy this file in your home directory. To identify your home directory, type the command echo %USERPROFILE% in a DOS command terminal. • Edit the file Explorer.props with WordPad. • Change the line InitialDirectory=%c by InitialDirectory=C:/Your/Own/Path If you need to change the memory available for Weka in the JVM, you need to edit the file RunWeka.ini or RunWeka.bat in the installation directory of Weka (root privilege may be required). Change the line maxheap=128m by maxheap=1024m. You cannot assign more than 1.4Go to a JVM because of limitations of Windows.

2. Notes for Linux
To launch Weka, open a terminal and type: java -jar /installation/directory/weka.jar.

If you need to assign additional memory to the JVM, use the option -XmMemorySizem, replacing MemorySize by the required size in megabytes. For instance to launch Weka with 1024 Mo, type: java -jar -Xm512m /installation/directory/weka.jar.

27…...

Similar Documents

Free Essay

Data

... which will be available in  the system.  In the event that the report and the abstract language is English, the student must  upload and add to the above another file contains an Arabic Abstract. In addition,  C. D. E. the student must copy the Arabic Abstract and paste it in a text box, which will be  available in the system.  Note: the student must deliver to the university coordinator two copies of the video  documenting the participation on flash memory or DVD in one of the following formats: AVI,  MP4, and Conference administra on will receive the contribu ons from the University  Coordinator.    Fifth: Public Speaking (Rhetoric):  • In this field there is nothing required from the student to upload except uploading  the above‐mentioned basic data and files.    Sixth: Documentaries:  • For the Documentaries, the student writes in a text box that will be available in the  system a theoretical description of the film includes the following: (Title of the film,  the main idea, the objective,  ...

Words: 348 - Pages: 2

Premium Essay

Data

...Data Verification- Effectiveness of Testing • Prepared a survey to ask 20 students 10 girls, 10 boys • Each person was allowed to tick one option in each question • The questions of the survey were: 1. About how many tests do you have per week? -less than 2 (7/7 -more than 2 (3/3) -more than 3 (0) 2. Do you get good grades in most tests? -yes (5/7) -no (1/0) -so so (4/3) 3. Do you find the tests in our school helpful? -yes (5/3) -not at all (0) -some are helpful some are not (5/7) 4. Which subjects are good to be tested on? -Math (5/5) -Languages (2/3) -Sciences (3/2) 5. Do you think testings like the IGCSE exams will help you in the future? -Yes, I think they will prepare me for university (9/9) -I think some are not useful, but some are (1/1) - No, I think they are not useful at all (0) • Our results were: BOYS GIRLS -results are next to the questions Conclusion -Boys and girls have the same amount of tests. More of them have less than 2 per week (not including quizzes) -Most girls get good grades, more boys said ‘so so’ than girls. - Half of the boys said tests are good, half said some are and some aren’t. More Girls said some are and some aren’t...

Words: 316 - Pages: 2

Free Essay

Data

...C H A P T E R 5 AppleTalk Data Stream Protocol (ADSP) 5 This chapter describes the AppleTalk Data Stream Protocol (ADSP) that you use to establish a session to exchange data between two network processes or applications in which both parties have equal control over the communication. You should read this chapter if you want to write an application that supports the exchange of more than a small amount of data between two parties who each can both send and receive streams of data. This chapter also describes the AppleTalk Secure Data Stream Protocol (ASDSP), a secure version of ADSP, that allows users of your application to communicate over an ADSP session after the users’ identities have been authenticated. Users can then exchange encrypted data over the session. For your application to use ASDSP, the system on which it runs must have the AppleTalk Open Collaboration Environment (AOCE) software installed and must have access to an AOCE server. To use ASDSP, you must also use the Authentication Manager, which is a component of the AOCE software. For information on the Authentication Manager, refer to Inside Macintosh: AOCE Application Programming Interfaces. ASDSP enhances ADSP with authentication and encryption features. When this chapter discusses components of ADSP, such as connection ends and connection listeners, you can assume that the information also applies to ASDSP. The sections in this chapter that discuss ASDSP describe any specific differences between......

Words: 29341 - Pages: 118

Premium Essay

Data

...Discuss each of the following term: • Data - Data are raw facts , Data are the foundation of information, building blocks of information which is the bedrock of knowledge • Information - processing raw data to reveal meaning , Information produced by processing data • Field -A character or a group of characters (numeric or alphanumeric) that describes a specific characteristic. A field may define a telephone number, a date, or other specific characteristics that the end user wants to keep track of • Record - A logically connected set of one or more fields that describes a person, place, event, or thing. For example, a CUSTOMER record may be composed of the fields CUST_NUMBER, CUST_LNAME, CUST_FNAME, CUST_INITIAL, CUST_ADDRESS, CUST_CITY, CUST_STATE, CUST_ZIPCODE, CUST_AREACODE, and CUST_PHONE. • File - Historically, a collection of file folders, properly tagged and kept in a filing cabinet. Although such manual files still exist, we more commonly think of a (computer) file as a collection of related records that contain information of interest to the end user. For example, a sales organization is likely to keep a file containing customer data. Keep in mind that the phrase related records reflects a relationship based on function. For example, customer data are kept in a file named CUSTOMER. The records in this customer file are related by the fact that they all pertain to customers. Similarly, a file named PRODUCT would contain records that describe products – the......

Words: 260 - Pages: 2

Premium Essay

Data

...Data Mining Data mining began with the advent of databases. Databases are warehouses full of computer data. Computer scientists began to realize that this data contains patterns and relationship to other sets of data. As computer technology emerged, data was extracted into useful information. Often, hidden relationships began to appear. Once this data became known and useful, industries grew around data mining. Data mining is a million dollar business aimed at improving marketing, research, criminal apprehension, fraud detection and other applications. History of Data Mining Computers began to be more widely used in the 1960’s. Computers were used to collect and store data. The data was stored on tapes and disks. The companies and organizations began to wonder about the data that was stored. They wanted to know about past sales, past performances and other pertinent information that was stored on these tapes and disks. The next step was to find an accurate way to retrieve the needed information without manually reading all the data. The next step in this quest came in the 1980’s with relational databases and structured queries. Query language could be used to find out more of what was in the data. The companies and organizations could now identify what has happened in the past. They also wanted to know how to apply this knowledge to future predictions based on past performances. In 1989, the first knowledge discovery workshop was held in Detroit (SQL Data Mining,......

Words: 3258 - Pages: 14

Free Essay

Data

...would need to include the square footage of each room and the total square footage of the house. If all the code is properly executed, we will be able to input any sized room and quickly be able to find the square footage. B. Program Design – Following the directions in the assignment, clearly write up your problem design in this section and comment your pseudocode. Main module Declare EntireHouse As String Declare LivingRoom, As Float Declare Kitchen As Float Declare BedRoom As Float Declare BathRoom As Float Write “Square Footage Program” Write “This program computes the square footage of individual rooms and” Write “the total area of houses” Call Input Data module Call Perform Calculations module Call Output Results module End Module Input Data Module Write “What is the length of the room?” Input RoomLength Write “What is the width of the room?” Input RoomWidth Perform Calculations module Declare SquareFootage as Float Set SquareFootage = RoomLength * RoomWidth Set TotalSquareFootage = LivingRoom + Kitchen + BedRoom + BathRoom Output Results module...

Words: 283 - Pages: 2

Free Essay

Data

...William Wragg Mr. Thomas Fortenberry IST 113471 17 October 2014 Chapter 7 Exercises 1. What is the definition of a local area network? a. A communication network that interconnects a variety of data communicating devices within a small geographic area and broadcasts data at high data transfer rates with very low error rates. 2. List the primary activities and application areas of local area network. b. File server, print server, connections to other networks. 3. List the advantages and disadvantages of local area networks. c. The advantages are sharing of files and devices, and intercommunication. d. The disadvantages are maintenance, complexity, and costs. 4. What are the basic layouts of local area networks? List two advantages that each layout has over the others. e. Buses advantages are the use of low-noise coaxial cable, and inexpensive taps. f. Star-wired buses are simple to interconnect, have easy to add components, and are most popular. g. Star-wired rings are simple to interconnect and have easy to add components. 5. What is meant by a passive device? h. A signal that enters is neither amplified nor regenerated. The signal is simply passed on. 6. What is meant by a bidirectional signal? i. A signal that propagates in either direction on a medium. 7. What are the primary differences between baseband technology and broadband technology? j. Baseband is a digital...

Words: 806 - Pages: 4

Free Essay

Data

...Data & Information Define Data: Data is just raw facts and figures it does not have any meaning until it is processed into information turning it into something useful. DATA Information 01237444444 Telephone Number 1739 Pin Number A,C,D,B,A* Grades Achieved At GCSE Define Information: Information is data that has been processed in a way that is meaningful to a person who receives it. There is an equation for Information which is: INFORMATION= DATA + CONTEXT + MEANING DATA 14101066 Has no meaning or context. CONTEXT A British Date (D/M/YEAR) We now know it says 14th of October 1066. Unfortunately we don’t know it’s meaning so it’s still not information yet. MEANING The Battle Of Hastings We now know everything so it can now be defined as information. How Is Data Protected? You’re data is protected by a law called the Data Protection Act this controls how your personal information is used by organisations, businesses or the government. This means legally everyone responsible for using data has to follow strict rules called ‘data protection principles’ there are eight principles. How Your Data Is Protected Use strong an multiple passwords. Too many of us use simple passwords that are easy for hackers to guess. When we have complicated passwords, a simple “brute force attack”—an attack by a hacker using an automated tool that uses a combination of dictionary words and numbers to crack passwords using strong passwords doesn’t mean this can’t happen it just......

Words: 904 - Pages: 4

Free Essay

Data

...Import Data from CSV into Framework Manager 1. Save all your tables or .csv file under one folder. In our case we will use the Test folder saved on blackboard with three .csv files named TestData_Agent.csv, TestData_Customer.csv, TestData_InsuranceCompany.csv. 2. Now , locate the correct ODBC exe at “C:\Windows\SysWOW64\odbcad32.exe” 3. Once the ODBC Data Source Administrator is open, go to the “System DSN” tab and click “Add”. 4. Select “Microsoft Text Driver (*.txt, *.csv)” if you want to import from csv files. 5. Unclick the “Use Current Directory”, and then click Select Directory to define the path of your data source. Give data source a name as well. Let’s use TestData in this case. NOTE: All the files under the specified location will be selected by default. 6. Again press ok and close the dialogue. Now we will import this Database/csv files into Cognos using Framework Manager. 7. Now Go to find Framework Manager. C:\Program Files (x86)\ibm\Cognos Express Clients\Framework Manager\IBM Cognos Express Framework Manager\bin 8. Right click on 'FM.exe', and then select 'Properties'. Click 'Compatibility' tab. Check "Run this program as an administrator' under 'Privilege Level'.  9. Open Framework Manager and create a new project and give it any name, in this case CSV_MiniProject. Then click OK. 10. Put the username: “Administrator” and password:”win7user”. 11. Select Language as English and hit ok. 12. Select Data......

Words: 775 - Pages: 4

Premium Essay

Data

...by a set of computer programs. An organization of this kind should offer simplicity an easy way to access connect, display and collect information. A database management system allows users the capability to enter data within the database by the use software. This software is designed to manipulate, define, retrieve and manage data in the database. One tier architecture type of database would be Microsoft Access. A desktop computer is an example of what Microsoft Access uses to run a list personal phone numbers and addresses that has been saved in MS Windows “My Document” folders. A tier two architecture database application would be Oracle or SQL server. Also, communication is accomplished through a Microsoft Windows server and are GUI based. The choice of preference used in my workplace is One Tier. Management has access to Microsoft Office, Excel, and Access. Additionally, there are several different databases that are password protected through a web based GUI to get important information about different Dell systems and Dell troubleshooting steps to repair hardware and software issues. To log customer information and other Dell tools the company use Citrix servers that workstations connect with to edit and retrieve data. Because of the large amount of individuals editing information on these servers they become bogged at times and create issues with verification of customer’s information. When the tools we have are not working it makes......

Words: 324 - Pages: 2

Free Essay

Data

...Data Collection - Ballard Integrated Managed Services, Inc. (BIMS) Learning Team C QNT/351 September 22, 2015 Michael Smith Data Collection - Ballard Integrated Managed Services, Inc. (BIMS) Identify types of data collected--quantitative, qualitative, or both--and how the data is collected. A survey was sent out to all the employees’ two paychecks prior and a notice to complete the survey was included with their most recent paychecks. After reviewing the surveys that have been returned it was found that the data collected is both quantitative and qualitative. Questions one thru ten are considered qualitative data because the response for those questions are numbered from one (very negative) to five (very positive), which are measurements that cannot be measured on a natural numerical scale. They can only be classified or grouped into one of the categories and are simply selected numerical codes. Then, questions A-D could fall under quantitative data because it can determine the number of employees in each department, whether they are male or female and amount of time employed with the company. From that data it is able to find an average of time employed, then subcategorize by department, gender and if they are a supervisor or manager. Identify the level of measurement for each of the variables involved in the study. For qualitative variable there are a couple levels of measurements. Questions A, C, and D in Exhibit A fall in nominal-level data because when......

Words: 594 - Pages: 3

Premium Essay

Data

...Data - raw facts about things and events Information - transformed data that has value for decision making 1. Persistent - data that resides on stable storage such as a magnetic disk (does not have to be forever - can be deleted or archived when no longer needed) 2. Shared - multiple uses and multiple users 3. Interrelated - data stored as separate units can be connected to provide a whole picture Database Management System - DBMS - A collection of components that supports: 1. creation, user, and maintenance of databases 2. data acquisition, dissemination, retrieval, and formatting Table – a named, two-dimensional arrangement of data. A table consists of two parts: Slide 5 1. heading defining a) table name b) column names 2. body containing rows of data Structured Query Language - SQL – an industry standard database language that includes statements for database definition, database manipulation, and database control. Nonprocedural Database Language - a language such as SQL that allows you to specify the parts of a database to access rather than to code a complex procedure and also does not include looping statements Three Types of SQL Statements Data Definition Language – DDL – SQL statements for database definition Examples: CREATE TABLE, CREATE VIEW, etc. Data Manipulation Language – DML – SQL statements for manipulation of data. Examples: SELECT, INSERT, UPDATE, DELETE, etc. Data Control Language – DCL – SQL statements for database......

Words: 1567 - Pages: 7

Free Essay

Data

...#nextchapter #nextchapter hashnextchapter.com so happens that this bias toward dislocation breaks down the hashnextchapter.com hashnextchapter.com systems of local business, community and communication, leaving people disconnected from the local and engaged with national brands and ubiquitous corporations. By recognizing media’s bias toward dislocation and learning to live in person, we can reclaim power by reclaiming the local. “you may always “do not sell choose none of your friends” the above” – Rushkoff EXPLORE FURTHER WITH THESE LINKS – Rushkoff #nextchapter hashnextchapter.com Ed Shane. Disconnected America: The Consequences of Mass #nextchapter hashnextchapter.com (New York: M.E. Sharpe, 2000) Media in a Narcissistic World Great data on how people use the net. Pew Internet Reports. TED TALKS : Sheikha Al Mayassa, “Globalizing the local, localizing the global.” • One result of the “delocalizing” nature of the Internet is the use of long-distance technologies when local and face-to-face interaction is possible. (Have you ever used an Instant Messaging Client to message a roommate, even though she was working on her computer in the same room?) What are two more examples of delocalization as a result of a networked existence? • Beyond the way we sometimes employ technologies for the gee whiz factor, or for convenience, do we sometimes utilize distancing as a “feature” rather than a “bug?” In other words, when do we like the fact that the person we’re communicating with......

Words: 6071 - Pages: 25

Free Essay

Data

...instructed to backfill with temporary labour. The collated data is being used to investigate the effect of this shift in labour pattern, paying particular attention to staff retention. The table below gives a month by month record of how many staff have been employed, temporary and permanent , how many temporary staff are left at the end of each month compared to how many are left that are on a permanent contract. Month | Temporary staff | permanent staff | total | permanent leavers | Temporary leavers | total leavers | Jan-15 | 166 | 359 | 525 | 7 | 2 | 9 | Feb-15 | 181 | 344 | 525 | 15 | 5 | 20 | Mar-15 | 181 | 344 | 525 | 0 | 7 | 7 | Apr-15 | 204 | 321 | 525 | 23 | 7 | 30 | May-15 | 235 | 290 | 525 | 31 | 12 | 43 | Jun-15 | 238 | 287 | 525 | 3 | 17 | 20 | Jul-15 | 250 | 275 | 525 | 12 | 42 | 54 | Aug-15 | 267 | 258 | 525 | 17 | 23 | 40 | Sep-15 | 277 | 248 | 525 | 10 | 27 | 37 | Oct-15 | 286 | 239 | 525 | 9 | 30 | 39 | Nov-15 | 288 | 237 | 525 | 2 | 34 | 36 | Dec-15 | 304 | 221 | 525 | 16 | 45 | 61 | Jan-16 | 305 | 220 | 525 | 1 | 53 | 54 | Feb-16 | 308 | 217 | 525 | 3 | 57 | 60 | An explanation of how I analysed and interpreted the data To make a comparison between the labour pattern and retention, I placed the above data into a line graph this gives a more of an idea to trends over the period My Findings The actual level of staff has remained constant throughout the data collated, as each job requires a specific amount......

Words: 621 - Pages: 3

Premium Essay

Data

...Discuss the importance of data accuracy. Inaccurate data leads to inaccurate information. What can be some of the consequences of data inaccuracy? What can be done to ensure data accuracy? Data accuracy is important because inaccurate data leads may lead to such things as the closing down of business, it may also lead to the loosing of jobs, and it may also lead to the failure of a new product. To ensure that one’s data is accurate one may double check the data given to them, as well as has more than one person researching the data they are researching. Project 3C and 3D Mastering Excel: Project 3G CGS2100L - Section 856 MAN3065 - Section 846 | | 1. (Introductory) Do you think Taco Bell was treated fairly by the mass media when the allegations were made about the meat filling in its tacos? I think so being that they are serving the people for which I must say that if you are serving the people then it’s in the people rights to know what exactly you are serving them. 2. (Advanced) Do you think the law firm would have dropped its suit against Taco Bell if there were real merits to the case? It’s hard to say but do think that with real merits it would have changed the playing feel for wit real merits whose the say that Taco Bell wouldn’t have had an upper hand in the case. 3. (Advanced) Do you think many people who saw television and newspaper coverage about Taco Bell's meat filling being questionable will see the news about the lawsuit being withdrawn? I doubt......

Words: 857 - Pages: 4