Written by Ana Canteli on 3 April 2019
Automatic summarization is the process by which a software manages to summarize a document that condenses the content of said writing. Technological solutions capable of creating multi-document summarization consider variables such as length, style or syntax.
Keyword extraction is one of the two main approaches in the field of text summarization, which pivots between extraction and abstraction. The extractive methods focus on the selection of a set of words - keyword extraction - or sentences - sentence extraction - from the original text to create the summary (single document summarization). Abstractive methods construct an internal semantic representation, for which the use of natural language generation techniques is necessary, to create a summary as close as possible to what a human could write. In this article, we will focus on the extractive approach, which is a technique widely used today; search engines are just one example.
Keywords or key phrases are widely used in the management of large digital libraries. They can describe the content of files and provide useful semantic metadata for a multitude of goals or purposes. In the case of academic content, the authors manually include a selection of keywords that represent the content of the article, which helps with information retrieval. For this, the identification of relevant words and sentence position within a multi-document summarization is essential to be able to index the contents; to guide the user in the search for information and improve their experience both in search and information retrieval. This task is called indexing by keywords. However, most texts lack this information, hence the automatic extraction of keywords has become essential, in a world in which information and documentation are created exponentially.
The users of the network use search engines daily, such as Google or Bing, among others. Probably without realizing that, when we carry out searches in the search engines; in fact, we are consulting on information that has been previously analyzed and identified.
Search engines have powerful machine learning algorithms that apply data mining (big data). These use the algorithms to identify, filter and evaluate which keywords are relevant depending on the type of search; which allows you to get an idea of the content, which in turn helps to access it.
In short, the process by which search engines - which use millions of users daily - establish the subject of a web page in the form of keywords and phrases is a critical part of the indexing process, which will subsequently help us locate the information through the search engines.
Correct indexation will facilitate the identification and location of the information immediately fulfilling the two main objectives of the process:
For organizations, it is an important investment in human resources, time and money to organize, classify and facilitate the information retrieval within the entity. Therefore, keyword and sentence extraction are parts of the solution for the best management of information in companies.
The OpenKM document management system provides the right environment in which data and information management is transparently incorporated into business processes. When we enter a document into the DMS, the system will automatically submit the file to a text extraction process. The software, which through the REST API includes the automatic summarization service KEA (Keyphrase Extraction Algorithm) can identify and extract significant keywords from the text. In addition, this multi-document summarization service will allow us to choose and implement the keyword extraction model that most interests us.
The automatic extraction of keywords can be used in various stages of document management:
JBA Solutions Sdn Bhd
Malaysia: Sila telefon +60 12 809 1368.
Isnin - Jumaat: 08:00 pagi - 12:00 petang, 13:00 petang - 17:00 petang GMT+8 untuk bantuan segera. Masa sekarang ialah hari Sabtu 19:47 pm di Kuching, Sarawak, Malaysia.
OpenKM di seluruh dunia: