In consisting of 19620 emails. According toIn consisting of 19620 emails. According to

In this section is based on approach in cite{new8}.I. Alsmadia and I. Alhami proposed a content-based spam filtering approach based on NGram classification to cluster an inbox containing both Arabic and English language emails based on the email text content. The experiment was performed on a Gmail email address consisting of 19620 emails. According to the authors their approach shows that NGram can perform better than other approaches if email contains different language content. Several steps were performed to a gmail account to create the required dataset. First, the authors used an open source software in order create  .EML extension text files files from the emails. Second, the email information were parsed and pre-processed.Finally the dataset was created of one record for each email where each record has 5 attributes (Email name, sender, date, subject, email body content). The authors calculated the frequency of words from all the emails. To analyze the Gmail account they selected the words with frequency above 100 from the dataset and applied  stemming. Words with high frequency were ignored for their irrelevant data value of information (e.g. PM, AM, I, will, are, etc).Clustering algorithms are used to generate the vector space model of the email data. After applying stemming and prepossessing the authors used cite{new9} steps to generate the email matrix. Rows represent the selected top frequent words and columns the different emails.Figure 2 shows the algorithm used by cite{new8} in order to cluster the emails. A random email is picked from the dataset.Then the similarities  are calculated and compared to other emails and the most similar 100 emails are clustered. The same process is repeated for other emails.The used algorithm shows that terms having high frequency can be sometimes useless to distinguish emails from each other. Such high frequent words can be tailored to the account user work or personnel.Term frequency and WordNet shows limitation with huge dataset or when some a unique terms exists (Arabic words) in the emails classification.