Zoltan Gal, Erzsébet Tóth
Optimizing Text Clustering Efficiency through Flexible Latent Dirichlet Allocation Method: Exploring the Impact of Data Features and Threshold Modification
A parallel corpus comprising Croatian EU legislative documents automatically translated into English spans 28 years and is enriched with metadata, including creation year and hierarchical classifier tags denoting descriptors, document types, and fields. However, nearly two-thirds of the approximately 1.5 thousand texts lack complete metadata, necessitating labor intensive manual efforts that pose challenges for human administration. This incompleteness issue can be observed in the case of official legal sites functioning as regular service provisioning databases. In response, this paper introduces an artificial cognitive and multilabel classification approach to expedite the tagging process with only a fraction of the manual effort. Leveraging the Latent Dirichlet Allocation (LDA) algorithm, our method assigns field values or tags to incompletely labeled documents. We implement a Flexible LDA variant, incorporating the influence of topics close to the most probable topic, regulated by a relative probability threshold (RPT). We evaluate the LDA prediction's dependence on document prefiltering and RPT values. Furthermore, we investigate the dependence of quantitative linguistic properties on the type and speciality of pre-processing tasks. Our algorithm, built on error-correcting optimizing codes, succesfully predicts a mixture of topic probabilities for these legal texts. This prediction is achieved by calculating the Hamming distance of binary feature vectors created using the legal fields of the EUROVOC multilingual thesaurus.
Reference:
DOI: 10.36244/ICJ.2024.5.7
Please cite this paper the following way:
Zoltan Gal, Erzsébet Tóth, "Optimizing Text Clustering Efficiency through Flexible Latent Dirichlet Allocation Method: Exploring the Impact of Data Features and Threshold Modification", Infocommunications Journal, Joint Special Issue on Cognitive Infocommunications and Cognitive Aspects of Virtual Reality, 2024, pp. 58-66, https://doi.org/10.36244/ICJ.2024.5.7