KDD-R is a comprehensive set of tools for rough sets-based data mining. This is a prototype system which incorporates many of the newest research results obtained in the area of rough sets. It is developed around variable precision rough sets model. The main discovery task of the system is the computation of probabilistic rules satisfying predefined rule length (number of rule conditions), strength (number of supporting cases) and rule predictive probability constraints from a flat table containing data in attribute-value format. The rules can be computed from real, integer or categorical data, or a mix of them. The results of the analysis are presented in three forms: a tabular form, ASCII text in the format IF (conditions) THEN (decision) with probability P and strength N, and in the histogram form. In the current version KDD-R is UNIX-based with X-Windows interface. The main application area for the system is market research. Other potential applications, are banking, insurance, stock market prediction, medical data analysis, sensor data analysis for control, or questionaire analysis.
This note presents system KDD-R (Knowledge Discovery in Data using Rough sets) developed at the University of Regina. KDD-R is a successor of two earlier commercial systems for rough sets-based data analysis designed by the author and implemented by his students, systems DataQuest and DataLogic. The basic underlying methodology behind KDD-R is the theory of rough sets, and in particular its extension called variable precision model of rough sets. The inner workings of the system have been described in more detail in. KDD-R is a prototype system which incorporates many of the research results obtained in the area of rough sets. In the current version KDD-R is UNIX-based with X-Windows interface.
The system accepts input in the form of a single ASCII table with columns separated by at least one space. The original table must be associated with a format description prepared by the user in which each data column is given attribute name, value type and each attribute name must be designated as condition or decision ("independent" or "dependent" variable). System utility program exists to help in preparation of the format description. The system permits presence of one symbol representing "unknown" data value. The rules are subsequently computed with respect to selected values of decision attribute. When preparing format description the user is supposed to provide discretization definition for numeric attributes. This can be done based on his/her understanding of the problem domain, or system-supplied utility program can be used in selecting discretization points. In the current version the KDD-R system is not integrated with any particular DBMS.
The main discovery task of the system is the computation of probabilistic rules satisfying predefined rule length (number of attributes), strength (number of supporting cases) and rule predictive probability constraints from a flat table containing data in attribute-value format. The rules can be computed from real, integer or categorical data, or a mix of them. The results of the analysis are presented in three forms: a tabular form, ASCII text in the format IF (conditions) THEN (decision) with probability P and strength N, or in histogram form. The system produces two rule sets: 1. all rules satisfying predefined constraints and 2. only maximal rules, that is rules which are not subsumed by other rules.
The system consists of the following major components:
The system is using a form of "branch-and-bound" technique for exhaustive search for rules satisfying predefined criteria. Since valuable and "credible" from application perspective rules are relatively short, strong and have relatively high predictive probability, the bounded search process is typically quite effective at significantly reducing the search space. The user background knowledge used in limiting the search space is expressed implicitly through the following user's decisions:
The system is primarily intended for trained domain users. The operation of the system is simple and minimal background is required to master the concepts behind the system's principles and operation. No knowledge of rough sets theory is required to effectively use the system.
The system have been used primarly for market-oriented research serving data tables of up to 700000 customer records. These applications involved analysis of customer demographic and other characteristics in relationship to his/her ability to accept new product offerings. Some experiments were also performed in medical domain trying to find predictive rules linking patient's health condition information with the likelihood of the presence or absence of a disease.
The results of the analysis normally are presented in the text form, but they can be converted into visual display as well showing rules in the histogram form. In such a display, each rule corresponds to one bar of the histogram with the height of the I bar being proportional to rule's predictive probability and width being proportional to rule's strength. The use of such a display helps in visual identification of strong and highly predictive rules.
The main advantage of KDD-R is its ability to extract strong rules from data, both numeric and categorical data. No formal comparison in that respect with other systems has been done, but author's experience with some other systems, in particular decision tree-based, indicates that the rules produced by KDD-R are on the average much stronger than "tree rules" corresponding to paths from the root to leave'; of the tree. In other words, there is no "overfitting" problem with rules produce'1 by KDD-R simply because weak rules are not extracted from the data, as opposed to other systems. The disadvantage of this approach is clearly the possible incomplete coverage of available data with rules, that is, some data records may not match any rules. This is the consequence of the underlying philosophy that it is better not to "mine" likely incorrect rules, focusing instead on identifying, .strong rules whose "credibility" is supported by a relatively large number of data points matching their preconditions.
The research reported in this paper was supported in part by a research grant from the Natural Sciences and Engineering Research Council of Canada.