Efficient n-gram construction for text categorization using feature selection techniques

Maximiliano García, Sebastián Maldonado*, Carla Vairetti

*Autor correspondiente de este trabajo

Producción científica: Contribución a una revistaArtículorevisión exhaustiva

19 Citas (Scopus)

Resumen

In this paper, we present a novel approach for n-gram generation in text classification. The a-priori algorithm is adapted to prune word sequences by combining three feature selection techniques. Unlike the traditional two-step approach for text classification in which feature selection is performed after the n-gram construction process, our proposal performs an embedded feature elimination during the application of the a-priori algorithm. The proposed strategy reduces the number of branches to be explored, speeding up the process and making the construction of all the word sequences tractable. Our proposal has the additional advantage of constructing a low-dimensional dataset with only the features that are relevant for classification, that can be used directly without the need for a feature selection step. Experiments on text classification datasets for sentiment analysis demonstrate that our approach yields the best predictive performance when compared with other feature selection approaches, while also facilitating a better understanding of the words and phrases that explain a given task; in our case online reviews and ratings in various domains.

Idioma originalInglés
Páginas (desde-hasta)509-525
Número de páginas17
PublicaciónIntelligent Data Analysis
Volumen25
N.º3
DOI
EstadoPublicada - 2021

Nota bibliográfica

Funding Information:
The authors gratefully acknowledge financial support from CONICYT PIA/BASAL AFB180003 and FONDECYT-Chile, grants 1160738, 1200221 and 12200007.

Publisher Copyright:
© 2021 - IOS Press. All rights reserved.

Huella

Profundice en los temas de investigación de 'Efficient n-gram construction for text categorization using feature selection techniques'. En conjunto forman una huella única.

Citar esto