There are various ways to express the overall intention and focal points of a text, and keywords might be the most appropriate choice. Keywords are defined as the most prominent phrases in a document that convey its main message. By extracting relevant words and phrases from a text, keyword extraction can help to uncover meaningful patterns in the text and provide an overview of the content. It can also help highlight the most significant concepts in a text and focus the attention of a machine learning algorithm on them.
Keyword extraction is an imperative subtask of natural language processing. By reducing the complexity of the text and making it easier to process, keyword extraction can be used as the basis for many other processing tasks such as text classification, clustering, and summarization. By extracting keywords from text, a machine can better understand the meaning and context of the text. This enables it to better analyze the text, recognize patterns, and make more accurate decisions. It can also reduce the amount of time it takes to process the text by eliminating unnecessary words and focusing on the most important words. Many datasets are proposed for evaluating keyword extraction methods in Persian, most of which only contain authors’ keywords and do not cover all potential ones. Thus, using such datasets leads to incorrect judgments about the accuracy of the suggested supervised and unsupervised methods.
In this paper, we introduce Noor-Vajeh, a Persian keyword extraction dataset of about 1400 scientific papers. We asked experts to extract potential keywords besides the authors’ keywords to complete the keywords set for each article. The resulting dataset is a valuable resource for ongoing research into Persian keyword extraction. To evaluate the dataset to be used as a benchmark, we tested several unsupervised keyword extraction methods. We used these methods because, compared to supervised methods, they take less time to execute and require minimal to no manual tuning. Moreover, they are able to extract keywords with a high degree of accuracy and generalize well to different articles. Furthermore, due to the wide variety of categories of unsupervised learning methods, graph-based methods have been regularly applied in different projects, so we describe and use some of their most famous ones, such as TextRank, SingleRank, and PositionRank. These algorithms can identify important words and phrases in a given text, as well as identify relationships between them. Doing so can provide insights into the overall structure and meaning of the text. This makes them especially useful for finding patterns and making predictions in a variety of tasks, such as machine translation, text summarization, and sentiment analysis. Furthermore, graph-based methods are highly versatile and can be adapted to different datasets and tasks. This makes them ideal for use in benchmark datasets. The results inferred from these methods confirm the comparisons made between the methods employed in other papers.