## SOFTWARE

AbstractPRIMEROSE (Probabilistic Rule Induction Methods based on Rough Sets) generates probabilistic rules from databases. This system includes induction of rules with conditional probabilities, called accuracy and coverage, or with test statistics, for example, X2 -statistics, estimation of conditional probabilities or test statistics by using resampling methods (cross-validation and bootstrap method), and calculation of statistics of induced rules. This system also allows concept hierarchy and applies attribute-oriented generalization technique to rule induction. The main target domain of this system is medicine and this system succeeds in automated acquistion of medical expert systems and medical knowledge discovery. IntroductionPRIMEROSE (Probabilistic Rule Induction Methods based on Rough Sets) is developed by Shusaku Tsumoto, which is firstly introduced in 1993. This system is first introduced to discover probabilistic rules from medical databases based on Ziarko's VPRS model and estimate the reliability of induced rules by using resampling methods. Next, in order to apply this system to genome sequence analysis, PRIMEROSE is extended with calculation of test statistics, transformation of attributes based on attribute-oriented generalization. Finally, this system is extended with calculation of statistics of rules, similarities between rules and automatic comparison between domain experts' knowledge. InputPRIMEROSE induces rules from one table, which includes condition and decision attributes. Although this system can induce rules without background knowledge, in order to control the number of rules, it requires the following parameters: (1) thresholds for accuracy (p(R\D)) and covera.ge(p(D\R)), (2) selection of resampling methods, and (3) selection of rule statistics. Additionally, PRIMEROSE allows (4) concept hierarchy of attributes and (5) domain experts' knowledge for further analysis. PRIMEROSE can also control search for rules using background knowledge, which is described by first-order predicate logic. OutputPRIMEROSE outputs induced rules, whose conditional parts are represented by DNF, with probabilities or statistics calculated from training samples, estimation of probabilities and statistics, and statistics of rules (the number of induced rules for each decision class, the averaged rule length, similarity between induced rules, and so on.) Furthermore, PRIMEROSE can output the following knowledge: (1) if concept hierarchy of attributes is given, then PRIMEROSE applies transformation of attributes and induces rules using this hierarchy. (2) if domain experts' knowledge is given, then PRIMEROSE compares induced rules with it and outputs the difference between induced and given knowledge. System architecturePRIMEROSE consists of four modules, data transformation module, resampling module, rule induction module, and output interface module, and runs as follows: First, this system induces rules from raw databases without evoking resampling modules. Second, it estimates probabilities and induces rules using resampling modules. Thirdly, it applies data transformation to raw databases and repeats the first and second procedures. Finally, PRIMEROSE outputs rules both in natural language form and in PROLOG predicates using output, interface module. Users only have to input tables and background knowledge including parameters. Induced knowledge are stored as PROLOG predicates and users can compare between rules induced with different parameters. Search for knowledgePRIMEROSE generates rules whose conditional parts are represented as DNF form, using a kind of heuristic search. As evaluation functions, it uses conditional probabilities and test statistics. Then, for estimation of probabilities and test statistics, cross-validation and/or the bootstrap method are/is applied. User this system is intended not only for analyst but also domain users. UserThis system is intended not only for analyst but also domain users. KDD processThe main target domain is medicine and genome sequence analysis. Induced knowledge is presented as rules understandable for domain experts, with numeric information, such as probabilities and test statistics. Furthermore, statistics of rules is also presented as a summarized table. The system has the following advantages: (1) PRIMEROSE allows multiple choices for evaluation functions and can make comparisons between induced results obtained by different evaluation functions. (2) this systems can compare induced rules with domain experts' knowledge and detect the differences between them. (3) PRIMEROSE allows domain knowledge represented by first-order predicate logic. On the other hand, the disadvantage of this system is that PRIMEROSE is rather slower than other systems because it is written by PROLOG. In biomedical domain, PRJMEROSE obtains the following nice results on automated knowledge acquisition and has made several new discoveries in biomedical databases: (1) PRJMEROSE induces probabilistic rules, which matches domain experts' knowledge. (2) Introduced resampling methods correctly estimate (predictive) conditional. (3) PRIMEROSE discovers several new knowledge from genome sequence databases, which is partially validated by biochemical experiments. |