Vremenska vizualizacija velikih zbirki tekstova zasnovana na analizi korespondencije

Šilić, Artur

prikaz prve stranice dokumenta Vremenska vizualizacija velikih zbirki tekstova zasnovana na analizi korespondencije

No public access

doctoral thesis

Vremenska vizualizacija velikih zbirki tekstova zasnovana na analizi korespondencije

Zagreb: University of Zagreb, Faculty of Electrical Engineering and Computing, 2014. urn:nbn:hr:168:667490

Šilić, Artur

University of Zagreb
Faculty of Electrical Engineering and Computing
Department of Electronics, Microelectronics, Computer and Intelligent Systems

Cite this document

APA 6th Edition

Šilić, A. (2014). Vremenska vizualizacija velikih zbirki tekstova zasnovana na analizi korespondencije (Doctoral thesis). Zagreb: University of Zagreb, Faculty of Electrical Engineering and Computing. Retrieved from https://urn.nsk.hr/urn:nbn:hr:168:667490

MLA 8th Edition

Šilić, Artur. "Vremenska vizualizacija velikih zbirki tekstova zasnovana na analizi korespondencije." Doctoral thesis, University of Zagreb, Faculty of Electrical Engineering and Computing, 2014. https://urn.nsk.hr/urn:nbn:hr:168:667490

Chicago 17th Edition

Harvard

Šilić, A. (2014). 'Vremenska vizualizacija velikih zbirki tekstova zasnovana na analizi korespondencije', Doctoral thesis, University of Zagreb, Faculty of Electrical Engineering and Computing, accessed 19 April 2024, https://urn.nsk.hr/urn:nbn:hr:168:667490

Vancouver

Šilić A. Vremenska vizualizacija velikih zbirki tekstova zasnovana na analizi korespondencije [Doctoral thesis]. Zagreb: University of Zagreb, Faculty of Electrical Engineering and Computing; 2014 [cited 2024 April 19] Available at: https://urn.nsk.hr/urn:nbn:hr:168:667490

IEEE

A. Šilić, "Vremenska vizualizacija velikih zbirki tekstova zasnovana na analizi korespondencije", Doctoral thesis, University of Zagreb, Faculty of Electrical Engineering and Computing, Zagreb, 2014. Available at: https://urn.nsk.hr/urn:nbn:hr:168:667490

Cite this item: https://urn.nsk.hr/urn:nbn:hr:168:667490

Please login to the repository to save this object to your list.

Metadata

Title	Vremenska vizualizacija velikih zbirki tekstova zasnovana na analizi korespondencije
Title (english)	Temporal visualization of large text collections based on correspondence analysis
Author	Artur Šilić
Mentor	Bojana Dalbelo-Bašić (mentor)
Committee member	Bojana Dalbelo-Bašić (član povjerenstva)
Granter	University of Zagreb Faculty of Electrical Engineering and Computing (Department of Electronics, Microelectronics, Computer and Intelligent Systems) Zagreb
Defense date and country	2014, Croatia
Scientific / art field, discipline and subdiscipline	TECHNICAL SCIENCES Computing Data Processing
Universal decimal classification (UDC)	004 - Computer science and technology. Computing. Data processing
Abstract	Vizualizacija tekstova jedan je od pristupa strojne obrade koji pomaže ljudima analizirati velike zbirke. Istraživanje vizualizacije tekstova, načinjeno u ovoj disertaciji, motivirano je činjenicama da tekstne zbirke često imaju vremensku dimenziju te da se protežu kroz dulja vremenska razdoblja. U ovoj disertaciji osmišljena je i istražena nova metoda vizualizacije CatViz koja je temeljena na analizi korespondencije i koja je usmjerena prema prikazu vremenskih promjena u sadržaju zbirke tekstova. Metoda CatViz predstavlja fuziju pristupa semantičkog prostora i vremenske osi jer iskazuje svojstva obaju pristupa. Kako bi se metoda CatViz upotrijebila na zbirkama tekstova, konstruirane su značajke za predstavljanje tekstova temeljene na prepoznavanju imenovanih entiteta, modeliranju tema i grupiranju. Razvijen je iznimno efikasan vizualizacijski sustav CatViz kako bi se istražile mogućnosti metode CatViz, ali i kako bi se provelo empirijsko vrednovanje. Oblikovana je korisnički usmjerena metodologija vrednovanja vizualizacije pomoću koje je uspješno provedeno vrednovanje vizualizacijske metode CatViz. Pokazana je korisnost te metode pri analizi velikih zbirki novinskih tekstova. Za ilustraciju mogućnosti vizualizacije CatViz, u ovoj su disertaciji predstavljene tri studije slučaja. Ova disertacija sadrži detaljan ilustrirani pregled radova na temu vizualizacije tekstova s naglaskom na pristupima i metodama crtanja. Ovim istraživanjem napravljen je pomak u području vizualizacije tekstova koji će omogućiti pojedincima da efikasno i objektivno otkrivaju znanje u velikim zbirkama. Vjeruje se kako će metoda CatViz obogatiti povijesna istraživanja tekstnih arhiva, medijska istraživanja suvremenih izvora, ali i otkrivanje znanja u svim drugim zbirkama tekstova.
Abstract (english)	The Information Age emerged due to exceptionally efficient digital storage and exchange of texts. The quantity of daily published texts in all areas of human society surpasses individual's capacity to consume them in a traditional way. Visualization, a computer-based approach, helps reduce this gap by enabling people to discover knowledge in very large text collections. The research of text visualization presented in this dissertation is motivated by the fact that text information often include a temporal dimension, and by the fact that today's digitally available collections contain documents spanning long periods of time. That secures the conditions for performing analyses which aim to discover temporal changes and constants in text content. During this research, CatViz, a novel visualization method, was designed and investigated. The CatViz method is based on correspondence analysis and can be used to visualize temporal changes in the content of a text collection. In order to use the CatViz method on text collections, text representation features were constructed using natural language processing methods. Since the CatViz method displays properties of both the semantic space and the term trend approach to text visualization, it is considered a fusion of these two approaches. To illustrate the CatViz method, in this dissertation, three case studies are presented. A very efficient visualization system was developed in pursuance of investigating the capabilities of the CatViz method, and in order to conduct an empirical evaluation. A user-oriented evaluation methodology was designed and used to evaluate the CatViz method. Usefulness of the CatViz method was shown on tasks of large news text collection analysis. When examining related work, it is easily recognized that in order to create successful text visualization methods or systems knowledge from many research fields is required. First, texts are represented with features by means of information extraction techniques developed within the fields of natural language processing and computational linguistics. To draw a display, methods of multivariate statistics and computer engineering are used. Finally, good visualizations exhibit efficient interfaces which are enhanched within the fields of human-computer interaction, cognitive and perceptive psychology, design, and aesthetics. The final result of a visualization is an increase of users' knowledge, so real users participate in empirical evaluations during which subjectivity has to be controlled. For that reason, evaluation approaches are drawn from social sciences. It is desirable to evaluate visualizations on real data, with real users solving real analytical tasks. The corpus of related work shows a rising trend in research of temporal text visualization methods. In the most recent works, an evident aspiration draws attention -- researchers aim to simultaneously display many text aspects such as topics, names, events, and time. The CatViz method, an extension of the correspondence analysis, enables an analysis of any multivariate data with a temporal dimension. This method is scalable and efficiently enables visualization of very large data. Besides the calculation and interpretation of CatViz plots, application procedures to tasks of text analysis are explained. In this dissertation, text features based on named entity recognition, topic modeling, and clustering are proposed and explored. Feature construction is motivated by the concept of complete reporting which seeks to answer the basic questions: Who?, Where?, When?, What?, Why?, and How?. Also, in furtherance of ameliorating the robustness of the CatViz method, a smoothing option during the calculation of CatViz was investigated. In order to show the possibilities of the CatViz method, this dissertation presents case studies which include examples of plot interpretation, collection exploration, data restriction, source comparison, seasonality visualization, and text feature choice. It is shown that clustering methods can be used to construct representations which show important events, as well as to emphasize the features with constant temporal distributions while using the CatViz method. Furthermore, the case studies reveal an interesting point -- strong seasonality in content can clearly be seen on CatViz plots. An example of comparing two large non-parallel corpora written in different languages that describe the same events confirms the robustness of the CatViz method. During this research, CatViz System was developed to enable the evaluation of the CatViz method and the proposed text representation features. The CatViz System implements two visualization methods, intuitive text selection, easy parameter setting, display of features' temporal distributions, and reading access to texts in an advanced display. Two natural language processing tasks were solved in order to use the defined text features with the CatViz visualization. First, a rule-based named entity linking module for English was developed. It operates on rules for matching different name forms, amended by the frequencies of those names in the analyzed corpus. Second, a methodology and a program for manual labeling of topics were developed. Description of the architecture and functionality of the CatViz System illustrates the complexity of a development path starting from a theoretical visualization method and finishing in a production-ready visualization system. A few important conclusions are drawn. Firstly, the visualization systems necessarily need to have a good interface with an intuitive interaction. Secondly, due to practically unlimited sizes of available data, computational complexity and selection of data structures pose very important questions during initial method choice and system design. Thirdly, while developing a visualization system, communication with the end users is critical since they give valuable advice and objective judgement on advantages and drawbacks of a method or a system. Fourthly, the appropriateness of a client-server architecture for visualization systems is confirmed. Two user studies classified as laboratory experiments with quantitative and qualitative methodologies were performed using the CatViz System. These studies confirm the usefulness of the CatViz method paired with the proposed text features. The first study shows that the users can use the CatViz method to discover and interpret important events in very large news text collections. The second study involved working with real users, on real tasks, using real data. This comparative evaluation shows a strong tendency of the CatViz method outperforming a baseline visualization (the temporal frequency plot) on the complex analytical tasks with free-answer questions. For this study, an evaluation methodology which includes manual expert assessment of answers on the criteria of fact coverage, quality of inference, and quality of expression was developed and employed. The users of the CatViz System state satisfaction by giving high marks to attributes of usefulness, intuitiveness, and enabling of knowledge discovery. The CatViz method is seen as adequate for display of relevant names and for solving tasks where temporal dimension is important. Besides many useful traits of the CatViz System, the users accentuate using topic modeling for text representation. This research advances the field of text visualization by enabling individuals to efficiently and objectively discover knowledge in large collections. The importance of the CatViz method is in that it enables both high-level overviews and detailed inspection, giving the users a capability of exploring millions of texts at a time and bringing us closer to the objectivization of history and contemporary affairs. It is believed that the CatViz method will enrich historical research of text archives, media research of contemporary sources, as well as knowledge discovery from all other text collections.
Keywords
Keywords (english)
Language	croatian
URN:NBN	urn:nbn:hr:168:667490
Study programme	Title: Electrical Engineering and Computing Study programme type: university Study level: postgraduate Academic / professional title: Doktor znanosti elektrotehnike i računarstva (Doktor znanosti elektrotehnike i računarstva)
Type of resource	Text
Extent	153 str. ; 30 cm.
File origin	Born digital
Access conditions	Closed access
Terms of use
Repository	FER Repository
Created on	2019-04-19 12:22:13

Nacionalna i sveučilišna knjižnica u Zagrebu

Dabar