Email Data Cleaning-free research paper on computer science
Addressed in this paper is the issue of ‘email data cleaning’ for text mining. Many text mining applications need take emails as input. Email data is usually noisy and thus it is necessary to clean it before mining. Several products offer email cleaning features, however, the types of noises that can be eliminated are restricted. Despite the importance of the problem, email cleaning has received little attention in the research community. A thorough and systematic investigation on the issue is thus needed. In this paper, email cleaning is formalized as a problem of non-text filtering and text normalization. In this way, email cleaning becomes independent from any specific text mining processing. A cascaded approach is proposed, which cleans up an email in four passes including non-text filtering, paragraph normalization, sentence normalization, and word normalization. As far as we know, non-text filtering and paragraph normalization have not been investigated previously. Methods for performing the tasks on the basis of Support Vector Machines (SVM) have also been proposed in this paper. Features in the models have been defined. Experimental results indicate that the proposed SVM based methods can significantly outperform the baseline methods for email cleaning. The proposed method has been applied to term extraction, a typical text mining processing. Experimental results show that the accuracy of term extraction can be significantly improved by using the data cleaning method.
Email is one of the commonest means for communication via text. It is estimated that an average computer user receives 40 to 50 emails per day . Many text mining applications need take emails as inputs, for example, email analysis, email routing, email filtering, email summarization, information extraction from email, and newsgroup analysis. Unfortunately, Email data can be very noisy. Specifically, it may contain headers, signatures, quotations, and program codes; it may contain extra line breaks, extra spaces, and special character tokens; it may have spaces and periods mistakenly removed; and it may contain words badly cased or non-cased and words misspelled. In order to achieve high quality email mining, it is necessary to conduct data cleaning at the first step. This is exactly the problem addressed in this paper. Many text mining products have email data cleaning features. However, the number of noise types that can be processed is limited. In the research community, no previous study has so far sufficiently investigated the problem, to the best of our knowledge. Data cleaning work has been done mainly on structured tabular data, not unstructured text data. In natural language processing, sentence boundary detection, case restoration, spelling error correction, and word normalization have been studied, but usually as separated issues. The methodologies proposed in the previous work can be used in email data cleaning. However, they are not sufficient for removing all the noises.
FREE IEEE PAPER