Automatic Concept Extraction from Persian News Text Based On Deep Learning
محورهای موضوعی : Natural Language Processing
ZahraSadat Hosseini
1
,
SayedGholamHassan Tabatabaei
2
1 - Malek-Ashtar University of Technology, Tehran, Iran,
2 - Department of Electrical and Computer, Engineering Malek-Ashtar University of Technology, Tehran, Iran,
کلید واژه: Concept Extraction, Deep Learning, Keyphrase, BERT-BASE, ParsBERT, mT5,
چکیده مقاله :
One of the most critical issues in natural-language understanding is extracting concepts from the text. The concept expresses essential information from the text. Concept Extraction to the process of extracting and generating keyphrases that may exist or not in the text. Automatic concept extraction from the Persian news text is a challenging problem due to the complexity of the Persian language. In this paper, we first review traditional and deep learning-based models in keyphrase extraction and generation. Then, an automated Persian news concept extraction algorithm is presented, which exploits encoder-decoder models. Specifically, our proposed models use the output vector of BERT-Base and ParsBERT language models as a word embedding. The evaluation results have shown that changing the word embedding layer has improved recall, precision, and F1 measures about 3.15%. Since encoder-decoder models get inputs consecutively, the training time increases. Also, if the sentence is long, they cannot store much information from the sentences. Therefore, for the first time, we have used mT5-Base with Transformer architecture, which receives and processes data parallelly. Recall, precision, and F1 measures used for the concept extraction results of the mT5-Base model are 55.66%, 55.47%, and 55.48%, respectively. The F1 score has increased by 19.8% compared to the previous models. Therefore, this model is effective for extracting the concept of Persian news texts.
[1] S. Jones and M. S. Staveley, "Phrasier: A system for interactive document retrieval using keyphrases", in Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, pp. 160-167.
[2] Y. Zhang, N. Zincir-Heywood, and E. Milios, "World wide web site summarization", Web intelligence and agent systems: an international journal, Vol. 2, No. 1, 2004, pp. 39-53.
[3] E. Papagiannopoulou and G. Tsoumakas, "A review of keyphrase extraction", Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, Vol. 10, No. 2, 2020, p. e1339.
[4] J. Chen, X. Zhang, Y. Wu, Z. Yan, and Z. Li ", Keyphrase generation with correlation constraints," arXiv preprint arXiv:1808.07185, 2018.
[5] F. Boudin, Y. Gallina, and A. Aizawa, "Keyphrase generation for scientific document retrieval", arXiv preprint arXiv:2106.14726, 2021.
[6] S. Mehrabi, S. A. Mirroshandel, and H. Ahmadifar, "DeepSumm: A Novel Deep Learning-Based Multi-Lingual Multi-Documents Summarization System", Journal of Information Systems and Telecommunication (JIST), 2019, p. 204.
[7] K. Barker and N. Cornacchia, "Using noun phrase heads to extract document keyphrases", in Advances in Artificial Intelligence: 13th Biennial Conference of the Canadian Society for Computational Studies of Intelligence, 2000, pp. 40-52.
[8] I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin, and C. G. Nevill-Manning, "KEA: Practical automatic keyphrase extraction", in Proceedings of the fourth ACM conference on Digital libraries, 1999, pp. 254-255.
[9] S. N. Kim and M.-Y. Kan, "Re-examining automatic keyphrase extraction approaches in scientific articles", in Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications (MWE), 2009, pp. 9-16.
[10] C. Zhang, "Automatic keyword extraction from documents using conditional random fields", Journal of Computational Information Systems, 2008, vol. 4, no. 3, pp. 1169-1180.
[11] M. Barla and M. Bieliková, "From ambiguous words to key-concept extraction", in 24th International Workshop on Database and Expert Systems Applications, 2013, pp. 63-67: IEEE.
[12] S. M. H. Khozani and H. Bayat, "Specialization of keyword extraction approach to persian texts", in International Conference of Soft Computing and Pattern Recognition (SoCPaR), 2011, pp. 112-116.
[13] S. R. El-Beltagy and A. Rafea, "KP-Miner: A keyphrase extraction system for English and Arabic documents", Information systems, 2009, Vol. 34, No. 1, pp. 132-144.
[14] S. Rose, D. Engel, N. Cramer, and W. Cowley, "Automatic keyword extraction from individual documents", in Text Mining: Applications and Theory, 2010, pp. 1-20.
[15] R. Campos et al., "Yake! collection-independent automatic keyword extractor", in Advances in Information Retrieval: 40th European Conference on IR Research, ECIR 2018, Vol. 40, 2018, pp. 806-810.
[16] R. Mihalcea and P. Tarau, "Textrank: Bringing order into text", in Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, 2004, pp. 404-411.
[17] X. Wan, and J. Xiao, "Single document keyphrase extraction using neighborhood knowledge", in Proceedings of the 23rd AAAI Conference on Artificial Intelligence, Vol. 8, 2008, pp. 855-860.
[18] A. Bougouin, F. Boudin, and B. Daille, "Topicrank: Graph-based topic ranking for keyphrase extraction", in Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP), 2013, pp. 543-551.
[19] T. Tomokiyo and M. Hurst, "A language model approach to keyphrase extraction", in Proceedings of the ACL 2003 workshop on Multiword expressions: analysis, acquisition and treatment, 2003, pp. 33-40.
[20] Z. Liu, X. Chen, Y. Zheng, and M. Sun, "Automatic keyphrase extraction by bridging vocabulary gap", in Proceedings of the Fifteenth Conference on Computational Natural Language Learning, 2011, pp. 135-144.
[21] E. Doostmohammadi, M. H. Bokaei, and H. Sameti, "Perkey: A persian news corpus for keyphrase extraction and generation", in 2018 9th International Symposium on Telecommunications (IST), 2018, pp. 460-465.
[22] I. Hsu, G. Xiao, N. Premkumar, and P. Nanyun", Discourse-level relation extraction via graph pooling", arXiv preprint arXiv:2101.00124 ,2021.
[23] E. Oro, R. Massimo, and S. Domenico, "Ontology-based information extraction from pdf documents with xonto", International Journal on Artificial Intelligence Tools, Vol. 18, No. 05, 2009, pp. 673-695.
[24] M. Gayathri, and R. J. Kannan, "Ontology based concept extraction and classification of ayurvedic documents", Procedia Computer Science, Vol. 172, 2020, pp. 511-516.
[25] X. Yuan, T. Wang, R. Meng, K. Thaker, P. Brusilovsky, D. He, A. Trischler, "One size does not fit all: Generating and evaluating variable number of keyphrases", arXiv preprint arXiv:1810.05241, 2018.
[26] A. Swaminathan, R. K. Gupta, H. Zhang, D. Mahata, R. Gosangi, and R. R. Shah, "Keyphrase generation for scientific articles using gans (student abstract) ", in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, Vol. 34, No. 10, pp. 13931-13932.
[27] Z. Sun, J. Tang, P. Du, Z.-H. Deng, and J.-Y. Nie, "Divgraphpointer: A graph pointer network for extracting diverse keyphrases", in Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2019, pp. 755-764.
[28] R. Meng, S. Zhao, S. Han, D. He, P. Brusilovsky, and Y. Chi, "Deep keyphrase generation", arXiv preprint arXiv:1704.06879, 2017.
[29] Y. Zhang and W. Xiao, "Keyphrase generation based on deep seq2seq model", IEEE access, Vol. 6, 2018, pp. 46047-46057.
[30] W. Chen, H. P. Chan, P. Li, L. Bing, and I. King, "An integrated approach for keyphrase generation via exploring the power of retrieval and extraction", arXiv preprint arXiv:1904.03454, 2019.
[31] E. Doostmohammadi, M. H. Bokaei, and H. Sameti, "Persian keyphrase generation using sequence-to-sequence models", in 2019 27th Iranian Conference on Electrical Engineering (ICEE), 2019, pp. 2010-2015.
[32] A. Glazkova, and D. Morozov, "Exploring Fine-tuned Generative Models for Keyphrase Selection: A Case Study for Russian", arXiv preprint arXiv:2409.10640, 2024.
[33] A. Glazkova, D. Morozov, and T. Garipov, "Key Algorithms for Keyphrase Generation: Instruction-Based LLMs for Russian Scientific Keyphrases", arXiv preprint arXiv:2410.18040, 2024.
[34] E. Thomas, and S. Vajjala, "Improving Absent Keyphrase Generation with Diversity Heads", in Findings of the Association for Computational Linguistics: NAACL 2024, 2024, pp. 1568-1584.
[35] M. Song, Y. Feng, and L. Jing, "A Preliminary Empirical Study on Prompt-based Unsupervised Keyphrase Extraction", arXiv preprint arXiv:2405.16571, 2024.
[36] L. Shen, and X. Le, "An enhanced method on transformer-based model for one2seq keyphrase generation", Electronics, Vol. 12, No. 13, 2023, p. 2968.
[37] N. S. Shirwandkar and S. Kulkarni, "Extractive text summarization using deep learning", in 2018 fourth international conference on computing communication control and automation (ICCUBEA), 2018, pp. 1-5.
[38] M. E. Khademi, M. Fakhredanesh, and S. M. Hoseini, "Farsi conceptual text summarizer: a new model in continuous vector space", Journal of Information Systems and Telecommunication (JIST), Vol. 1, No. 25, 2019, p. 23.
[39] M. Afsharizadeh, H. Ebrahimpour-Komleh, A. Bagheri, and G. Chrupała, "A Survey on Multi-document Summarization and Domain-Oriented Approaches", Journal of Information Systems and Telecommunication (JIST), Vol. 1, No. 37, 2022, p. 68.
[40] M. Allahyari, S. Pouriyeh, M. Assefi, S. Safaei, E.D. Trippe, J.B. Gutierrez, and K. Kochut, "Text summarization techniques: a brief survey", arXiv preprint arXiv:1707.02268, 2017.
[41] S. Gupta, and S. K. Gupta, "Abstractive summarization: An overview of the state of the art", Expert Systems with Applications, Vol. 121, 2019, pp. 49-65.
[42] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, "Bertscore: Evaluating text generation with bert", arXiv preprint arXiv:1904.09675, 2019.
[43] J. Zhang, Y. Zhao, M. Saleh, and P. Liu, "Pegasus: Pre-training with extracted gap-sentences for abstractive summarization", in International Conference on Machine Learning, 2020, pp. 11328-11339: PMLR.
[44] L. Shen, and X. Le, "An enhanced method on transformer-based model for one2seq keyphrase generation", Electronics, Vol. 12, No. 13 , 2023, p. 2968
[45] N. Datta, "Extractive Text Summarization of Clinical Text Using Deep Learning Models", in 2024 Second International Conference on Emerging Trends in Information Technology and Engineering (ICETITE), 2024, pp. 1-6.
[46] F. Liu, C. Xiong, " A Generative Text Summarization Method Based on mT5 and Large Language Models", in 2023 Eleventh International Conference on Advanced Cloud and Big Data (CBD), 2023, pp. 174-179.
[47] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding", arXiv preprint arXiv:1810.04805, 2018.
[48] M. Farahani, M. Gharachorloo, M. Farahani, and M. Manthouri, "Parsbert: Transformer-based model for persian language understanding", Neural Processing Letters, Vol. 53, 2021, pp. 3831-3847.
[49] K. Cho, B. v. Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, "Learning phrase representations using RNN encoder-decoder for statistical machine translation", arXiv preprint arXiv:1406.1078, 2014.
[50] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, Ł. Kaiser, and I. Polosukhin, "Attention is all you need. Advances in neural information processing systems", Advances in neural information processing systems, Vol. 30, 2017.
[51] D. Wang, C. Hansen, L.C. Lima, C. Hansen, M. Maistro, J.G. Simonsen, and C. Lioma, "Multi-Head Self-Attention with Role-Guided Masks", in Advances in Information Retrieval: 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28–April 1, 2021, Proceedings, Part II 43,2021, pp. 432-439.
[52] T. Xiao, Y. Li, J. Zhu, Z. Yu, and T. Liu, "Sharing attention weights for fast transformer", arXiv preprint arXiv:1906.11024, 2019.
[53] S. Yildirim and M. Asgari-Chenaghlu, "Mastering Transformers: Build state-of-the-art models from scratch with advanced natural language processing techniques", Packt Publishing Ltd, 2021.
[54] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P.J. Liu, "Exploring the limits of transfer learning with a unified text-to-text transformer", The Journal of Machine Learning Research, Vol. 21, No. 1, 2020, pp. 5485-5551.
[55] L. Xue, "mT5: A massively multilingual pre-trained text-to-text transformer", arXiv preprint arXiv:2010.11934, 2020.