Crpljenje i pretraživanje tekstnih informacija temeljem grafova događaja

Glavaš, Goran

prikaz prve stranice dokumenta Crpljenje i pretraživanje tekstnih informacija temeljem grafova događaja

Rad nije dostupan

disertacija

Crpljenje i pretraživanje tekstnih informacija temeljem grafova događaja

Zagreb: Sveučilište u Zagrebu, Fakultet elektrotehnike i računarstva, 2014. urn:nbn:hr:168:769208

Glavaš, Goran

Sveučilište u Zagrebu
Fakultet elektrotehnike i računarstva
Zavod za elektroniku, mikroelektroniku, računalne i inteligentne sustave

Citirajte ovaj rad

APA 6th Edition

Glavaš, G. (2014). Crpljenje i pretraživanje tekstnih informacija temeljem grafova događaja (Disertacija). Zagreb: Sveučilište u Zagrebu, Fakultet elektrotehnike i računarstva. Preuzeto s https://urn.nsk.hr/urn:nbn:hr:168:769208

MLA 8th Edition

Glavaš, Goran. "Crpljenje i pretraživanje tekstnih informacija temeljem grafova događaja." Disertacija, Sveučilište u Zagrebu, Fakultet elektrotehnike i računarstva, 2014. https://urn.nsk.hr/urn:nbn:hr:168:769208

Chicago 17th Edition

Harvard

Glavaš, G. (2014). 'Crpljenje i pretraživanje tekstnih informacija temeljem grafova događaja', Disertacija, Sveučilište u Zagrebu, Fakultet elektrotehnike i računarstva, citirano: 17.07.2024., https://urn.nsk.hr/urn:nbn:hr:168:769208

Vancouver

Glavaš G. Crpljenje i pretraživanje tekstnih informacija temeljem grafova događaja [Disertacija]. Zagreb: Sveučilište u Zagrebu, Fakultet elektrotehnike i računarstva; 2014 [pristupljeno 17.07.2024.] Dostupno na: https://urn.nsk.hr/urn:nbn:hr:168:769208

IEEE

G. Glavaš, "Crpljenje i pretraživanje tekstnih informacija temeljem grafova događaja", Disertacija, Sveučilište u Zagrebu, Fakultet elektrotehnike i računarstva, Zagreb, 2014. Dostupno na: https://urn.nsk.hr/urn:nbn:hr:168:769208

Za citiranje koristite ovu mrežnu adresu: https://urn.nsk.hr/urn:nbn:hr:168:769208

Prijavite se u repozitorij kako biste mogli spremiti objekt u svoju listu.

Podaci o radu

Naslov	Crpljenje i pretraživanje tekstnih informacija temeljem grafova događaja
Naslov (engleski)	Text information extraction and retrieval based on event graphs
Autor	Goran Glavaš
Mentor	Jan Šnajder (mentor)
Član povjerenstva	Jan Šnajder (član povjerenstva)
Ustanova koja je dodijelila akademski / stručni stupanj	Sveučilište u Zagrebu Fakultet elektrotehnike i računarstva (Zavod za elektroniku, mikroelektroniku, računalne i inteligentne sustave) Zagreb
Datum i država obrane	2014, Hrvatska
Znanstveno / umjetničko područje, polje i grana	TEHNIČKE ZNANOSTI Računarstvo Procesno računarstvo
Univerzalna decimalna klasifikacija (UDC)	004 - Računalna znanost i tehnologija. Računalstvo. Obrada podataka
Sažetak	Tekstni izvori koji opisuju događaje iz stvarnoga svijeta (novinski članci) sve su brojniji, a informacijske potrebe korisnika koje se tiču događaja sve su izraženije. Stoga su postupci za automatizirano crpljenje i pretraživanje informacija o događajima sve potrebniji. U okviru disertacije predstavljen je model grafa događaja kao strukture koja sadrži sve bitne informacijske aspekte događaja iz stvarnog svijeta. Vrhovi grafa događaja predstavljaju pojedinačna spominjanja događaja u tekstu, a bridovi vremenske odnose među njima. Ostvaren je potpuno automatizirani postupak izgradnje grafova događaja iz teksta koji kombinira modele za crpljenje informacija temeljene na nadziranom strojnom učenju s modelima temeljenim na pravilima. Provedeno je iscrpno intrinzično eksperimentalno vrednovanje svih modela koji sudjeluju u izgradnji grafova događaja, a predstavljene su i dvije nove mjere za vrednovanje ukupne kakvoće automatski izgrađenih grafova događaja. Predstavljen je model za usporedbu dokumenata usporedbom grafova događaja pomoću jezgrenih funkcija nad grafovima. Učinkovitost predstavljanja dokumenata grafovima događaja i njihove usporedbe jezgrenim funkcijama nad grafovima utvrđena je ekstrinzičnim vrednovanjem na različitim zadatcima pretraživanja informacija. Korisnost crpljenja i strukturiranja informacija o događajima iz teksta dodatno je potvrđena vrednovanjem na zadatcima sažimanja grupa dokumenata te pojednostavljivanja novinskih članaka. Pristupi crpljenju i pretraživanju informacija opisani u ovoj disertaciji usredotočeni su na engleski jezik, ali ih je, uz pretpostavku postojanja određenih jezičnih resursa i alata, moguće prilagoditi na način da budu primjenjivi i za druge jezike.
Sažetak (engleski)	As our society becomes increasingly digital, there are a growing number of textual information sources (e.g., breaking news, investigative stories, police reports, tweets, historical texts, electronic health records) that are filled with descriptions of events. The ability to automatically extract and analyse events from text is now more important than ever, with applications that range from security and intelligence to journalism, media analysis, and historical research. Efficiently satisfying event-oriented user information needs requires precise extraction of event-related information, which is a very demanding task considering the complexity, vagueness, and ambiguity of natural language. In text, real-world events are represented by the so-called linguistic events, or event mentions. Due to ambiguity and vagueness of natural language, the mapping of real-world events and their relations (temporal, causal, etc.) to their linguistic counterparts introduces a loss of information. Event mentions are structured — they consist of event anchors, being words bearing the core meaning of events, and event arguments, being the phrases that denote protagonists and circumstances (e.g., time and location) of events. Documents describing real-world events, thus, give rise to a structure in which there are relations between different event mentions as well as relations between anchors and arguments within individual event mentions. In this dissertation I have proposed an event graph, a structured representation of event-oriented documents containing all informationally-relevant aspects of real-world events. Vertices of event graphs denote individual event mentions extracted from text, whereas edges may denote various semantic relations that hold between event mentions. Although, model-wise, event graphs allow for any semantic relation between events, temporal relations between events have been considered in particular due to inherent temporal aspect of events. Based on the model of event graph, a fully automated procedure for constructing event graphs has been developed. Automated construction of event graphs includes four different information extraction models: (1) a supervised model for extraction of event anchors, (2) a rule-based model for extraction of event arguments, (3) a supervised model for extracting temporal relations between events, and (4) a supervised model for resolving coreference of event mentions. Models for extracting event anchors and temporal relations between event mentions are linear regression models based on rich set of lexical, syntactic, and semantic features. The argument extraction model is based on a set of syntactic extraction patterns and semantic disambiguation rules. The event coreference resolution model is a support vector machines model with the set of numeric features indicating the similarities between anchors and arguments with matching roles between two event mentions. Each of the four models was thoroughly intrinsically evaluated using standard evaluation metrics — precision, recall, and F-score. Two novel metrics for evaluating the overall quality of the automated construction process have been proposed and empirically validated and the overall quality of automatically constructed event graphs has been measured using these metrics. In order to develop and evaluate information extraction models included in construction of event graphs, a large corpus, named EvExtra, manually annotated with factual event mentions has been compiled. The EvExtra is currently the largest corpus manually annotated with event-oriented information. It is approximately three times larger than the TimeBank corpus, which has typically been used in event extraction tasks. Comparison of documents describing real-world events is performed by comparing their corresponding event graphs. An innovative method for efficient comparison of event graphs, based on semantic extensions of graph kernels has been designed and implemented. Two different graph kernels — product graph kernel and weighted decomposition kernel — have been semantically extended to account for event-specific semantics. Efficient information retrieval models based on construction and comparison of event graphs have been proposed and evaluated on several information retrieval tasks. Experimental results show that the retrieval models based on event graph and graph kernels outperform traditional retrieval models, which represent documents in an unstructured fashion (i.e., as bags of words) such as vector space models, language models, and probabilistic models. The usefulness of structured event-centered document representation has been additionally verified on two different natural language processing tasks: multi-document summarization and text simplification. A novel algorithm for multi-document summarization which exploits event-oriented information and temporal structure contained in event graphs has been developed. The novel event-based multi-document summarization algorithm outperforms competitive methods on standard summarization datasets. The algorithm for automated simplification of news stories eliminates all content not relating to event mentions and transforms individual event mentions into separate sentences in the simplified text. Human evaluation shows that text produced with this simplification method are highly grammatical and contain only the most relevant information from the original text. The research covered in the dissertation focused on texts written in English. Although the event graph formalism itself is language independent, some parts of the models used for automated construction of event graphs are language dependent. The adjustment of the graph construction pipeline for another language is possible, although not an easy task. One of the main directions in future work will tackle adjustment of the automated graph construction pipeline for Croatian. This dissertation lays the foundation for structured event-based document analysis and uncovers many interesting directions for future research. Event graphs can be extended conceptually by considering relations between event mentions other than temporal relations (e.g., causality, subordination). Event graphs could also be applied in other natural language processing tasks (e.g., question answering) and other text domains (e.g., biographies). Finally, I envisage a formal framework, based on event graphs, which would enable the modeling of events in a continuous event space that spans from linguistic events at the lowest level to topics at the highest level. Such an event graph-based framework would enable a uniform and elegant treatment of both events and topics for the purpose of event-based document analysis.
Ključne riječi
Ključne riječi (engleski)
Jezik	hrvatski
URN:NBN	urn:nbn:hr:168:769208
Studijski program	Naziv: Elektrotehnika i računarstvo Vrsta studija: sveučilišni Stupanj studija: poslijediplomski doktorski Akademski / stručni naziv: Doktor znanosti elektrotehnike i računarstva (dr.sc.)
Vrsta resursa	Tekst
Opseg	240 str. ; 30 cm.
Način izrade datoteke	Izvorno digitalna
Prava pristupa	Zatvoreni pristup
Uvjeti korištenja
Repozitorij	Repozitorij FER-a
Datum i vrijeme pohrane	2019-04-19 12:29:54