ProbRough is a system for inducing decision rules from data. The rules enable us to predict values of a decision attribute for new objects on the basis of condition attribute values. Input data are given in the form of one decision table. Objects are characterized by any mixture of qualitative and quantitative condition attributes and one discrete decision attribute. The system accepts noisy and inconsistent data with missing attribute values. Background knowledge is used in the form of prior probabilities of decisions and different costs of misclassification. The domains of decision rules that compose the resultant rough classifiers are disjoint and fill up the space of all possible values of condition attributes. The set of domain rules forms a partition of this space. The ProbRough system searches through various partitions using a criterion based on minimizing the misclassification costs. ProbRough has demonstrated its usefulness on many real-world classification and knowledge discovery problems from the area of business, technology and medicine.
The system ProbRough has been developed by A. Lenarcik and Z. Piasta at Kielce University of Technology.
An inspiration for developing the system was the problem of discretization of continuous attributes in a context of the rough set theory.
Input data to the ProbRough system is one decision table given as an ASCII file. Each condition attribute is characterized by its type: continuous, discrete ordered, or qualitative unordered. Continuous attributes are discretized by choosing elements from the sets of intermediate values. These sets are given in advance or can be obtained from the input decision table. Background knowledge is represented by prior probabilities that reflect the distribution of the decisions in the universe and a cost matrix that involves the unit costs of misclassification.
For each number of iterations in the phase of partitioning the space of condition attribute values, not greater than the prespecified maximum number of iterations, the ProbRough system generates a family of equivalent decision rule sets (rough classifiers). The strength of each induced rule is expressed by the number of objects confirming the rule and by the average costs associated with the decisions.
ProbRough tries to find an optimal partition of the space of condition attribute values minimizing the average misclassification cost, and then induce the decision rules that describe the whole partition in a compact way. As a search strategy, ProbRough uses a beam search which is guided by the cost criterion. Each decision rule assigns a decision or a set of decisions to the subset of the space of values of condition attributes which is the domain of the rule. Each rule-domain has the form of a special Cartesian product which enables the presentation of the rule in a simple if—then logical form. The system accepts missing attribute values. The objects with missing values are removed from the computations or a missing value is treated as an additional value of the attribute. The way of missing Values treatment has to be given for each attribute. The minimum percentage of all learning objects that has to be used in computing the criterion value is given in advance. ProbRough accepts qualitative attributes with a great number of values. For each such value set all possible divisions into two disjoint subsets are considered in the phase of partitioning the space of condition attribute values. The number of divisions can not exceed the given upper bound.
Rule-sets generated by the ProbRough system are comparable or superior to many acknowledged classifiers, both in terms of predictive accuracy and capability to explain the learned knowledge. ProbRough does not require that input data is kept, in the main memory. As a result, rough classifiers can be generated from large databases with practically unlimited numbers of objects and attributes. Any new object is covered by exactly one decision rule of the resultant classifier. The rough classifiers arc not sensitive to outliers in the data. A disadvantage of ProbRough is that so far it is implemented on a PC DOS platform only (Unix and WindowsNT C versions are in preparation).
The ProbRough system is a result of work that was supported by the State Committee for Scientific Research in Poland (KBN) under grants #8 S503 033 06, #8 S503 021 06 and #8 T11C 010 12.