Automatic Classification System (Metadata Aurora)



With the rapid development of science and technology of today, all kinds of information after another, resulting in a variety of scientific literature, news corpus and even information on the Internet indeed is explosive. So many people want the information to find the information they need, then have to classify them. However, the traditional manual text classification, due to long period, high cost, low efficiency, and often need to have the expertise to be competent, so it is difficult to meet the practical needs of today; but now many of the system, because the sample quantity requirements document larger, resulting in decreased efficiency of the system, or because the document does not meet the number of samples, resulting in classification of incomplete, unclear, lack of learning ability and other issues. Therefore, the study of effective automatic text classification has become necessary, and it is in the text retrieval, information retrieval, information filtering, data organization, information management and even the Internet search has a very wide range of applications.



In the document classification process, classification plays an important example of the impact, the more examples of the more accurate the more accurate document classification. However, for large quantities of document classification, to be precise classification of the instance of a very large number of requirements, which greatly reduces the efficiency of classification. In view of this, we have EM (Expectation-Maximization) algorithm, based on unclassified documents taking into account the impact on the classification of factors, combined into a Metadata automatic classification system. Metadata automatic classification system combines unclassified document the contribution of the classification results achieved with less amount of examples to obtain more accurate classification results. Unclassified document, taking into account factors of uncertainty, add the coefficient λ, can be adjusted to its. Through some examples of existing tests, the system can achieve better classification results, to meet the demand for information classification accuracy. Instance-based classification basis, we improved on their part, so that the user provides only the key words for each class of the unclassified documents can be effectively classified.
Metadata Aurora system has the following characteristics:

1, the sample (or keywords) in the demand for small, easy to implement on the massive document classification
Metadata automatic classification system as considered unclassified document on the classification of certain factors, greatly reducing the required number of classified documents. According to statistics, in the 10,000 unclassified documents classification, to achieve better classification results, the conventional method requires 2000 samples classified documents (ie classification of documents), and Metadata using our automated classification system only 600 sample document can achieve the same results of classification.

2, intelligent classification, accurate results
Metadata automatic classification system to take intelligent classifier to classify the document, through the classifier training can continue to update the original classification. In the continuous training and learning process, the classifier will be more experience, thus, continuously improved classification accuracy constantly improve, when the classifier reaches a steady state, the system will be the best classification results .

3, classification of high reliability
Metadata automatic classification system based on the statistical classification of documents using a highly efficient method of cutting the word, and the English word for root processing (see Metadata Partner of the root of), so that the word information and the information consistent with the original document, ensure the reliability of the classification process.

4, considering the classification of the document is not the role of classification to reach the optimal state
In the classification process, classification of documents is not much larger than the number of samples of the document, according to EM algorithm, classification accuracy largely depends on unclassified documents. Unclassified document on the classification of the two-fold: to raise or lower the classification accuracy. Taking into account this factor, Metadata EM algorithm for automatic classification system has been improved, the introduction of the coefficient λ, in order to document the impact of the unclassified degree of adjustment. After our test, take λ = 0.5.

5, classification speed
Because the method used in the accurate and efficient as the premise, so the implementation of the classification process was very fast. This is the classification of a large number of documents possible. In this case, the system can be used in more fields to deal with more substantial information.

6, with a wide range of uses
As previously mentioned, many areas in the present, we need to automatically abstracts a subsystem to be implemented into the core to complete the work, or need access to information as automatic abstracting system, the auxiliary means of analysis of information processing massive data. Therefore, automatic summarization will be essential for many areas of good work.

Metadata automatic classification system overview diagram is as follows: