Abstract | Prozodija je važan aspekt govora jer poboljšava informativnost izgovorenog. Jedan segment prozodije uključuje naglašavanje riječi kojim se ističe važnost jedne riječi u kontekstu sadržaja onoga što je izgovoreno, čime se može utjecati i na semantiku tog sadržaja. U tekstu, međutim, taj je aspekt izgubljen, pa je time izgubljena i ta dodatna informativnost napisanog sadržaja. Cilj ovog rada bio je istražiti mogućnosti automatskog vraćanja informacija o naglašenim riječima u tekst koji je spremljen kao podnatpis ili transkript. To se željelo postići bez upotrebe potpuno automatskog sustava za prepoznavanje govora. Naglašenost riječi analizira se kroz tri dimenzije: pojačani intenzitet, povišeni ton, produljeni (usporeni) izgovor. Vraćanje ovih informacija u tekst obogaćuje njegovu informativnost, dok istovremeno, s tehničke strane, takav tekst zahtijeva puno manje memorijskog kapaciteta od zvuka, pa u tom obliku može biti pogodan tamo gdje postoji potreba za spremanjem velikih količina podataka, kao što je arhiviranje ili računalno pretraživanje. Isto tako, ovako obogaćeni tekst može biti koristan za osobe s oštećenim sluhom ili gluhonijeme osobe jer bi se njima na ovaj način olakšalo razumijevanje sadržaja time što bi im se približio izvorni oblik onoga što je i kako je bilo izgovoreno. |
Abstract (english) | Prosody is an important aspect of speech because it complements the meaning of spoken communication. One segment of prosody includes word emphasis by which the importance of one word is emphasized in the context of what was spoken, which can affect the semantics of the spoken content. In written text, however, that aspect is lost, thereby losing this additional information. The goal of this work was to develop a method of returning the prosodic component of speech back into text which is present through subtitles or transcript. Additionally, our goal was to achieve that without developing a full-scale speech recognition system. Word emphasis is examined through three dimensions: increased intensity, increased pitch, extended duration of speech at particular words. Returning these aspects back into text enhances its informational contents, while at the same time, from technical perspective, such text would require much less storage space than sound, so such format can be useful in aplications that store large amounts of data, like archiving or information retrieval. Also, such enhanced text can be useful for people with hearing disabilities because it would make it easier for them to get a better understanding of how was something uttered. This disertation is organized into several parts. The first part is the introduction. In the second part basic speech accoustics is described: physical properties of sound, frequency, tone, intensity, F0, formants, and accoustic properties of some phonemes with graphical representation of their spectrum and other accoustic properties. This part will also contain description of some accoustic properties of emphasized words that set them appart from other, nonemphasized words. Part 3 contains description of some sound analysis techniques: spectrum, spectrogram, oscilogram, spectral analysis, LTAS, together with some methods of sound preparation which are important for its analysis, like speech annotation. This part also describes some capabilities of program Praat used in this research, together with some Python libraries. Part 4 contains basics of machine learning and neural networks used in this research for phoneme classification. This part consists of basic introduction into machine learning and neural networks where their advantages compared to some other computational models are described in relation to sound analysis. After that one way of using such neural network in this work is described. Part 5 contains detailed description of a method of speech analysis with the goal of detecting emphasized words. That method consists of several activities divided into the following steps: Speech annotation, where for each phoneme its sound segment is isolated (by hand). This is necessary for neural network training. This is a tedious process because a recording of just a few minutes contains hundreds of phonemes that need to be carefully annotated. Also, determining the beginning and the end of a phoneme is not always simple because phoneme can be uttered only partially, and can also appear one after another where it can be difficult to determine the phoneme boundaries. Creation of Praat script to iterate over segmented speech and perform spectral analysis for each phoneme. The result consists of LTAS values for each phoneme together with the letter categorizing the phoneme. These values are later used for training the neural network with speech of several randomly chosen speakers.
After this data preparation steps the next step is training the neural network. This process consists of several steps: 1. Elimination of variations in intensity. For neural network training we only need the spectral shape, so variations in intensity can create more variations for neural network to learn. To speed up the training process we need to eliminate variations in intensity as much as possible. One way to do this, as used in this research, is to increase or decrease all LTAS values by the amount necessary such that the largest value does not exceed some given intensity, but keeping all other values in the same ratio to each other as before. 2. Since the LTAS value range is not in the 0..1 interval the values need to be scaled. This is done because the neural network works only with the values in this interval. 3. The values are then organized into a data structure which contains the LTAS values and the category of the phoneme which these values represent. After that, neural network training is performed. The goal of this training is to later use the neural network to classify phonemes from new recordings not used for the training. 4. After the previous step the result would be a neural network trained for phoneme classification. The next phase is the process of emphasized word detection. Before that, however, we extracted the transcripts from the media file to get the information on when on the recording these transcripts appear. This is important for determining later which words the classified phonemes belong to. For example, if in a speech segment phonemes „d..ava“ have been recognized and the text of the transcript in that sound segment contains letters „država” (croatian for state) then it is likely that these phonemes belong to this word. Then the analysis of pitch and intensity would determine if the word was emphasized. After neural network training the detection of emphasized words consists of the following steps: 1. Phoneme classification from a speaker not used for neural network training. For phoneme classification we used two steps: First, the recording is partitioned into segments of 10 ms and for each of the segments the LTAS is calculated. Then, in the second step, the recording is marked with positions that contain glotal pulses (as calculated by Praat) and for each of those positions a segment of 5 ms before and after is selected for which LTAS is calculated. This second step helps avoiding skipping over some important sounds. 2. The result of the previous step is a sequence of phonemes which were the result from the classification process performed on those 10 ms sound segments. This phoneme sequence will contain the letter (phoneme) and time at which it appears in the recording. Some of those phonemes will be classified correctly, but some will not. For example, instead of classifying a phoneme as m the neural network might classify it incorrectly as v or some other phoneme. In order to determine which words were emphasized it is necessary to determine word boundaries. It is clear that the more correctly classified phonemes there are the easier it will be to find the word to which those phonemes belong. However, since many phonemes will be classified incorrectly, the text from the transcript needs to be matched against the phonemes produced by the neural network. This will be solved by using an alignment algorithm that will try to align the sequence of phonemes with the letters of the text from the transcript. 3. The result of the alignment will provide approximate information about where each word from the transcript begins and ends in the recording. Then the sound segment of each words is analysed from the perspective of F0, intensity and duration, which determines if a word was emphasized. Most of the previously described steps assumes creation of Praat and Python scripts by which these processes will be automated, which includes modules for testing and analysis of the results. Figures 1 and 2 show this process. Part 6 contains results obtained from recordings of new speakers (those whose speech was not used for neural network training). These recordings include several speakers thereby showing how this method functions in different environments from those used for testing (regarding speech tempo, pitch, voice, speech patterns, etc.). Also, it shows the speech-to-text alignment precision. Part 7 contains the conclusion. Here, the advantages and disadvantages of this method as compared to some others is discussed. Also, some alternatives are described as well, together with some possible improvements. Additionally, this part underlines the importance of having a larger corpus of annoated speech in croatian as a condition for many usefull future research in this area. Since the automatic recognition of phonemes in croatian is important for many research activities in this area (such as emotion detection, speaker identification, prosody analysis, etc.), such corpus would be an essential tool for this research. |