![]()
|
|
|
Processes Goals and Purpose: A comparative analysis will be conducted for application of rough sets to at least one other, preferably more than one, data mining approach in case-based reasoning. Rough sets allow for data mining involving inconsistent and noisy data sets. Experiments will be carried out employing Bayesian networks and rough sets. Many rules produced utilizing the rough set approach will involve overfit which can be attributed to a function of increasing training set size.
Processes: Two methods for data mining, suitable for utilization in case-based reasoning, were comparatively evaluated: Bayesian Networks and Rough Sets. Experiments were carried out on a Bayesian classifier using AutoClass and on a Rough Set tool for producing logic rules using Rosetta. Several benchmarks were employed, which were included in the original documentation for Rosetta and AutoClass. The Rough Set method produced many rules, and the researchers determined that overfit increases with the size of the training set. Considerable attention to human interpretation of the results had to be exercised when utilizing Bayesian classification. Rules generated by Rosetta were found useful for background knowledge in Case-Based Reasoning, and in making predictions for class membership. Criteria for Data Mining techniques included: noise level, consistency, prior knowledge, output, retractability, robustness, overfit/underfit, and whether or not the method evaluated was dynamic. Different experiments with different data sets were performed. The data sets employed were obtained from the original documentation for Rosetta and AutoClass, which included the KDD standard IRIS database. Attributes were chosen that had some possible functional dependencies. Also attributes were chosen that would partition the values in a given attribute, in order to predict the failure type given values from the other attributes. Jennifer and Christine conducted a background search on case-based reasoning. Then while Jennifer researched rough sets, Christine researched Bayesian networks. Various resources were used in the research including books, Internet sites, and articles. Findings include a core knowledge of case based reasoning, rough set, and Bayesian networks and various ways they are used in real world applications. After looking into the background of each topic, we downloaded software from the internet that employed these techniques. Jennifer downloaded ROSETTA to work with rough sets and Christine downloaded AutoClass C to work with Bayesian networks. After learning how to run each program and their capabilities, we teamed up to discover how they would work together on the same set of data.
Noise Level: Are data missing, biased or not applicable? Is the system able to reason with noisy data, or must be the data be cleaned? Consistency: Does the system discover inconsistency? Is it able to reason with inconsistency? Is the system trying to “hide” uncertainty or is it actively using and revealing it? Will the system discover latent attributes? Prior Knowledge: Do the researchers have to have any prior knowledge of the system that is to be analyzed? Is it in the form of a data dictionary or domain knowledge? Is it extracted automatically? Is the method using metadata to eliminate relationships between data that are impossible or unwanted in the model? Output: What type of output can the system generate? Is it interpretable for humans? Can the output be used as prior knowledge later? Does it produce graphs, written reports or logic rules. Does it state why exactly this output is produced? Retractability: Is it possible to find out why the algorithm produced the results it did? Robustness: How does the algorithm handle errors, invalid input, shortage of memory? Underfit/Overfit: Does the algorithm perform well on test data? Dynamic: Is the method dynamic, in the sense that it allows results to be refined by adding more data, or does it just allow static data sets? Conclusions and Results: Noise and Retractability Rough Set Method: This method is capable of handling most types of noise. If the input data is missing an attribute that is dispensable then the classification process is not affected. If the input data are missing important attributes, then the Rough Sets can not classify the object and the object is thrown away. Bayesian Method: Able to handle noise, but only as a normal attribute value. Like the Rough Set Method, this method appears unable to handle values that are “not applicable”. The user must be cautious, using AutoClass, when trying to fill in values for missing and “not applicable” attribute values, otherwise unreliable results will be produced. Rough Set Method / Bayesian Method: Since both methods are based upon logical and statistical foundations it is possible to find out how the tolls discovered the rules or classes. The latter statement demonstrates that both methods are retractable. Output: Rough Set Method: The output from Rosetta is rules used for: · Find functional dependencies between attributes in the dataset. · Discover rules that govern processes, thereby can be utilized to generate background knowledge in Case-Based Reasoning. · Predict future events. · Find attributes of importance. Rosetta has a way of testing how well the rules generated are on a test set. Bayesian Method: The output from AutoClass is classes used for: · Discover similarities between attribute-values. · Group data for rule generation later. · Develop knowledge of the domain. AutoClass does not have a decision to test against and is only using the model it has learned to classify the test set. Therefore AutoClass does not allow for testing how well the classes are on the test set. Usage and Robustness: Rough Set Method: Rosetta supports functions that preprocess the data, including removal of incompleteness and discretization. A GUI interface is the means for user interaction with Rosetta. Bayesian Method: AutoClass expects a ready dataset, and offers no way of working with the data except for removing whole attributes. Consistency and Prior Knowledge: Neither Rosetta nor AutoClass requires, or uses any background knowledge of the data. AutoClass requires the user to know the range of each attribute. The user is able to control the process in Rosetta by deciding on a decision variable. AutoClass gives no control of how the classification is done. Neither Rosetta nor AutoClass learns by interaction with the user. Rosetta is unable to use knowledge gained through analysis of data as background knowledge later. However, AutoClass includes a parameter to allow the use of prior knowledge to determine new results. Rosetta also has the possibility for testing the rules that have been generated via a confusion matrix. AutoClass does not provide for testing inconsistency. Underfit/Overfit and Dynamic: Rough Set Method: Rosetta allows the user to calculate reducts, both normal and object oriented, and generate rules. The reducts can be calculated by three different algorithms: RSES exhaustive calculation, Johnson algorithm and genetic algorithm. All of the algorithms support dynamic calculation. Both reducts and rules can be pruned away based on support value or with respect to a decision value or an attribute. It is also possible to calculate the rough classification for the decision system. Bayesian Method: AutoClass is run as a batch program and therefore lacks a script language which would enable automated usage. Generic Results of the Experiments: Application of Rosetta revealed that: · When the data set has many attribute values, more objects are required to produce good results. · The overfit increases with the size of the training set. Weak rules should be pruned away. · Rosetta produces many rules! Application of AutoClass revealed that: · The number of attributes has little influence on the number of classes generated. · AutoClass results require considerable human interpretation. · More data leads to greater precision classification. · Automatic classification of objects can reveal poor human classification in the data. AutoClass groups data into classes for better understanding, while Rosetta’s output can be used directly as background knowledge in Case-Based Reasoning and in making predictions about failures. Publication: Texas Academy of Science [Annual Meeting or Journal] (tentative) Research posted at http://cs.uttyler.edu/research/crew/ |
|
|
Resources
The following are links for AutoClass C: AutoClass C General Information and Download Software http://sal.iatp.by/Z/3/AUTOCLASS.html The following are links for ROSETTA:
|
Main | Case-Based | Rough Sets | Bayesian | Results