The study of emotions in language has received a wide interest in the last decades, especially in the specific alley of opinion mining, or sentiment analysis (Birjali et al., 2021; Pang & Lee, 2008). Emotions are a complex field where language, biology, psychology, society and even history (Boddice, 2018) interact (Barrett et al., 2018). In spite of the amount of recent research, there is still a lack of consensus on many levels. For example, there are not even consensual emotion categories that all or most scholars recognize. Many researchers, in fact, take a simplified view and only consider polarity (i.e., positive or negative). If we add the variable language, and the fact that each language has different lexical items and grammatical constructions to express emotion, it is clear that a lot remains to be done in this matter.
In this paper we consider reference to emotion in Portuguese. We are neither concerned with how texts may raise emotion in the readers nor how authors express their emotions. Our concern is to study the way words denoting emotion are used in text. We do not follow a particular theory of what an emotion is in psychological or physiological terms, and limit ourselves to the linguistic properties of emotion words. For surveys of other approaches and discussion of several theories, see the works by Maia and Santos (2018) and Santos and Maia (2018).
In this paper we want to evaluate a publicly available resource that provides emotion annotation in large Portuguese corpora (Santos, Simões & Mota, 2022). To build it, we used primarily our linguistic intuition as to whether a word denoted an emotion, and refined this process by consulting dictionaries, encyclopaedias, and corpora, producing 26 emotion groups in Portuguese, named by the most frequent members of the group.1 See Figure 1 for a quantitative overview, and translation, of those emotion groups, from the original paper.
Note that we do not claim that there are only 26 types of emotions in Portuguese. In fact, we have a group called OUTRA (‘other’), which explicitly groups emotion words for which we have not yet created a dedicated group. In addition, we have a group called AUSENCIA (‘absence’) which marks words that concern absence of emotion, like impávido (‘unmoved’) – the justification to do this is the analogy with the semantics of colour and clothing, where we have postulated a similar absence category that deals with words like incolor (‘colourless’) or nu (‘naked’). We have also, just like in these other semantic categories, created a GEN (“generic”) group, which deals with emotion in general – again, examples from the colour and clothing domains would be colorido (‘coloured’) and vestir (‘to dress’). We use this group in the emotion domain for words like sentimento (‘feeling’) or emocionado (‘moved’).
So, the emotion groups are just a quantitatively-based first approach to the categories that are recognized by Portuguese. If we noticed that a considerable set of distinct words could be assigned to a group (which would thus have a high lexical diversity), we created the group. But one may also state that what we have (so far) achieved is the identification of 23 groups for which we assigned a label, and that every individual word in the OUTRA (‘other’) category is a potential emotion group in Portuguese. We will be mindful to treat the cases of AUSENCIA (‘absence’), OUTRA (‘other’) and GEN (“generic”) groups in a special way.
Then, we have devised a considerable number of rules, using the approach sketched by Santos and Mota (2010) for human-machine cooperation, in order to deal with the inescapable property of language that most words have more than one meaning. We exemplify in the aforementioned paper that, in the case of emotion, there are endless cases where a given word describes an emotion in some contexts, and something not emotional in others, and propose a lexical measure called the degree of emotionality of a word. We also note that it is common for words to convey more than one emotional meaning, which entails that some words are assigned to more than one group. All this has been discussed and exemplified in several papers by Mota and Santos (2015), Santos and Mota (2015), Ramos, Santos and Freitas (2020) and Santos, Simões and Mota (2022), but these facts result in additional problems for our studies on the basis of the annotated corpora. In particular, how to deal with words which are assigned to more than one emotion category. We will come back to this issue later.
The purpose of our studies on top of this material (the resources just described) is to evaluate it properly and independently, in order to assess its value for the community that deals with Portuguese at large. It would be extremely reassuring if one could find automatic methods that could confirm our intuitions2 and/or help us improve the groups. Although Santos, Simões and Mota (2022) have presented some data that confirms our intuition, still the following questions remain: how can one evaluate it properly, and not in so general terms? How can one be sure that those corpora provide more than our subjective opinion of what an emotion is? And particularly, we want to investigate whether it is possible to independently motivate the postulated emotion groups – is there any statistical data that lend them support?
To be more concrete, let us take three examples:
Can big data-based automatic methods support the existence of the two groups FELIZ (‘happiness’) and SATISFEITO (‘satisfaction’) that were postulated, or on the other hand do the methods suggest their merging?
Can statistical processing support that concepts like mistrust and despair should belong to the same group as they are now in DESESPERO (‘despair’), or does language use show that they should be divided into two groups?
Is there any statistical evidence that the group AMOR (‘love’) should also encompass friendship, as it does now?
So, the explorations described in the remainder of the paper are different attempts to use automatic techniques to evaluate the aforementioned annotation, and to investigate whether, by taking a quantitative bird’s eye view, one can generate further knowledge on emotions in Portuguese.
It should also be said upfront that, no matter how many revision rules were designed to improve the (automatic) annotation, it is humanly impossible to review and correct the million annotated cases. So, the studies reported here will help identify specific problems and uncover places to improve the annotation. This will be clear when we analyse our findings.
Emotion annotation is an on-going process, in the sense that, at the same time we are doing these studies, other annotators are painstakingly analysing particular cases and improving the rules, as reported e.g., by Ramos (2021). So, what we present here concerns a particular time slice of the annotated corpora (October 2021), and our most relevant contribution should be the methods we use and propose.
We will be using two different approaches: co-occurrence information, and word embeddings, a vectorial representation of words based on their co-occurrence in large corpora.
2. Further information about the data we use
The main corpus used is Todos, merging all AC/DC corpora (Santos, 2014) together and removing repeated material. The corpus is described in more detail in Santos, Simões and Mota (2022). It purports to include mainly written Portuguese in several genres (newspaper, academic writing, interviews), amounting to ca. 1.5 billion tokens (as of October 2021), most of them (3/4) in Brazilian Portuguese. A small part (6.5%) is transcribed oral data (see Santos, 2016, on the different kinds of oral corpora). A subset of Todos which we thought interesting to explore too was Literateca, containing all literary text after removing repeated texts, amounting to ca. 40 million words. Literateca features mostly old texts (not contemporary Portuguese) and, contrary to Todos, mostly Portuguese from Portugal (70%) (Todos includes Literateca).
We believe in the importance to document the options taken to deal with the material, because they may be essential for replication, and for interpretation. We have thus used the following information from the corpora: tokenization, sentence separation, lemmatization (done by the PALAVRAS parser, see Bick, 2000, 2007, 2014) and semantic annotation for the emotion domain, as displayed in Figure 2.
For the co-occurrence approach, no further preprocessing was necessary. In order to look at the particular groups Amor and Desespero, some manual pruning was done to the lemmas that were used in our graphs: we removed (a) cases where PALAVRAS had creatively added a derived lemma, such as atagonia or paragonimíase (incorrectly parsed as derivations from agonia); (b) cases of misspelled words that might be emotion words but appear too rarely to consider listing them in the emotion group, like afectuiar or deprezador or desesperadamante; (c) cases with non-standard capitalization or hyphenation, such as aMIGO or en-ternecer. We also removed hapax legomena, and added together all cases of clitics, so that lemmas amar+ele, amar+se, amar+ele+o and amar were lumped together under the lemma amar. This shrank the lemmas belonging to the Amor group from 1162 to 87, and those of the Desespero group from 246 to 64 cases.
For the creation of word embeddings, we additionally removed all capitalization, and used only words with a frequency higher than 5. Four kinds of word embeddings were created:
standard, using the bare corpus, without any kind of word annotation
changing the words marked as emotional to emo:word
changing the words marked as emotion to emo:word::group
replacing the words marked as emotion with their group: emo:group
To make this more concrete, the word empertigado (‘stiff’) from the Orgulho (‘pride’) group would have been coded like this in each approach:
This means that our word embeddings created on Todos had different sizes:
1,171,525 word vectors
1,177,040 word vectors
1,177,420 word vectors
1,163,962 word vectors
One should also note that clitics were not removed for the word embeddings, which means that for example the two words admirei and admirei-me count separately.
For the embeddings creation, we considered using Word2Vec (Mikolov et al., 2013), FastText (Bojanowski et al., 2017) or Glove (Pennington et al., 2014). Some authors (such as Romanov & Khusainova, 2021) claim that FastText gives better results for morphology queries, while the two others are more semantically aware. As to the choice between Word2Vec and Glove, the first uses a global word-to-word co-occurrence measure, while the second tries to work within local context. We chose Word2Vec.
To create the embeddings, we left most of Word2Vec’s default options unchanged, but we used a dimension of 300 instead of 200 and 20 training iterations instead of 5. Three hundred was chosen to be comparable to most other public word embeddings for Portuguese, and our expectation was that increasing the number of training iterations would increase quality.3
3. Creating a co-occurrence graph
One standard way to reduce large corpora to quantitative objects easy to manipulate, and reduce their dimensionality, is simply counting the number of times different concepts, or words, co-occur, taking this as a measure of relatedness or even similarity.
This was our first approach, which we applied to words annotated with emotion, in two different ways:
simply using the emotion group (any word annotated as belonging to the group Feliz (‘happiness’) would count as Feliz, so felicidade and ventura (both translateable by happiness) would both count as Feliz (‘happiness’)
or using the word itself, for particular emotion groups (so, in the Feliz group, felicidade and ventura would count as felicidade and ventura, respectively).
Since the corpora are parsed by PALAVRAS, we operationalized co-occurrence as “appearing in the same sentence”, marked by the structural attribute <s>. We believe this to be more linguistically motivated than deciding on a fixed window of N words.
Figures 3 and 4 show the relationship between the emotion groups in Todos and Literateca, drawn like a graph, whose vertices are the group labels, with size (diameter) proportional to their frequency, and whose edges correspond to the attested co-occurrences, drawn with thickness corresponding to the number of co-occurrences.
It should be clarified that we use a random layout for drawing the graphs, which implies that the place where the particular categories appear is not meaningful – and therefore should not be compared across graphs. Also, it is important to explain that the sizes are relative to the universes, which differ widely in quantity: Literateca has about 40 million words, while Todos (which includes Literateca) amounts to a total of 1,315 million words.
With these caveats in mind, what can these graphical depictions tell us, as an initial overview of the annotated corpora? We can conclude that literary text in Portuguese (at least the one present in Literateca), has a stronger emphasis on AMOR (‘love’) and (In)Feliz (‘(un)happiness’) than other kinds of text. And that, in general, the most invoked emotion is Desejo (‘desire’) followed by Esperanca (‘hope’) and Amor (‘love’). Initially surprised by the unexpected prominence of Desejo, we soon understood its cause: words like querer (‘want’) and desejar (‘desire’) were considered as emotion words, even when they might arguably be considered only denoting volition or intent. If this were not the case, the Desejo group would considerably shrink. In fact, by investigating the matter closer, we also realized that our co-occurrence counting procedure counts words annotated, e.g., with desejo_amor for both the Desejo and Amor groups, therefore inflating even more the (possibly dubious) contribution of the verb desejar to the Portuguese emotion realm. In any case, this illustrates that the original decision to attribute as many emotion labels as deemed appropriate can have consequences on the further processing of the material. Since we are not sure what the best alternative to deal with these cases is – and suspect that they may, in fact, indicate a desirable merge of the groups in question –, we did not compute an alternative co-occurrence matrix.
In Figures 5, 6, 7 to 8, we present now the two groups Amor (‘love’) and Desespero (‘despair’) for both the Literateca and the Todos material, together with the pairs with more co-occurrences, in Tables 1 and 2.
We can thus observe several specific differences between the group Amor (‘love’) in general (in all genres present), and in literary text: it is easy to see that while romantic love is the most described in the literary texts of Literateca, amigos (‘friends’), gostar (‘to like’) and preferir (‘to prefer’) are the most common members of Amor in other genres. Interestingly, abraçar and beijar (‘hug’ and ‘kiss’) rank high in the literature list, preferir (‘to prefer’) or desejar (‘to wish for’) are more frequent, when one looks at all genres together.
In Table 2, we present again the most common co-occurrences, now for the Desespero group. Note that the quantities are much lower than in Table 1, because reference to this group is far less frequent in Portuguese, at least according to the corpora we are using (cf. Figure 1).
Although the material has far fewer instances to analyse, reference to mistrust – represented in Table 2 by the words desconfiança e desconfiar – does not occur often in literary texts, contrarily to general texts. And it seems that (from the most common co-occurrences only) the words related to mistrust and those related to despair keep separate. This is in sharp contrast with love and friendship, where there were plenty of co-occurrences, as can be appreciated in Table 1.
After this preliminary bird’s eye view, we have applied several clustering techniques to the co-occurrence material, which we proceed to describe below.
The idea of clustering, a non-supervised exploratory technique, is to identify meaningful groups (“clusters”) in large amounts of data. There are two major ways to proceed: divisive clustering, which starts by dividing the material, and agglomerative clustering, which proceeds bottom-up by joining the closest elements. In any case, these processes depend on a distance measure between the objects to be clustered, and there are many different distance measures to choose from.
For the co-occurrence data, we considered that co-occurrence between two words (or emotion groups) measured how close they were, and defined distance as the inverse of the co-occurrence number (so, if X co-occurred 43 times with Y, their distance would be 1/43). When there were no co-occurrences at all in the material, we assumed infinite distance. We then applied multidimensional scaling with two and three dimensions to the emotion group co-occurrence data. The result for two dimensions is in Figure 9.
The result shows that dimension 2 clearly separates emotions from their absence. However, it is hard to understand what it is that dimension 1 captures, singling out two not very typical emotion groups: Ingrato (‘ingratitude’) and Inveja (‘envy’), in fact the two groups with lowest lexical diversity in Portuguese according to the corpora (cf. again Figure 1).
If we redo multidimensional scaling requiring three dimensions, see Figure 10, dimension 3 now singles out Saudade (‘longing’), which is the next least lexically diverse emotion group, apart from Insatisfeito (‘insatisfaction’).
Trying hierarchical clustering (with R’s hclust command), we get a similar result, see Figure 11.
Table 3a shows the most frequent emotion group co-occurrences.
Most of these numbers can be interpreted straightforwardly: one joins quasi-synonyms (Feliz (‘happy’) and Satisfeito (‘satisfied’)), another shows the cohesiveness of the same group (Amor (‘love’)), while others join feelings that often come together, like Odio (‘hate’) and Vergonha (‘shame’), Vergonha (‘shame’) and Furia (‘anger’), or Desejo (‘desire’) and Esperanca (‘hope’), Amor (‘love’) and Feliz (‘happiness’), and Admirar (‘admiration’) and Humildade (‘humility’). Finally, one may also interpret Orgulho (‘pride’) as a cause for Furia (‘anger’), although obviously not always. It is nevertheless interesting that no antonyms come to the fore: all pairs are either both positive or both negative.
One should also recall that there is a significant number of word occurrences that are marked as belonging to Amor (‘love’) and Desejo (‘desire’) or Amor (‘love’) and Esperança (‘hope’), and these would inflate (artificially, in fact) the number of the co-occurrences of the two categories. Namely, all words marked as belonging to a double or triple category count as co-occurrences among these categories. This is something that we have to deal with, and an alternative Table 3b was therefore created without those cases.
We see that the quantities are considerably smaller than those of Table 3a, showing that many of these co-occurrences involved (or were a product of) vague categories. In this new table, there are three emotion groups that co-occur with themselves: Amor, Desejo and Medo. However, the categories that are included in vague classifications continue to be frequently co-occurring, which in a way vindicates the existence of words that convey both.
4. Investigating word embeddings
For almost a decade now, the technique of using large amounts of data to produce (static) word embeddings has been actively used in many different NLP tasks in order to provide a better representation of a word’s meaning, and has also been applied in other linguistic and literary contexts (Antoniak & Mimno, 2018). Although there are fortunately several word embeddings for Portuguese (see Batista, 2019, for an overview of Hartmann et al., 2017; Rodrigues & Branco, 2018, Grave et al., 2018 and Yamada et al., 2016; and Santos, 2021, for a recent comparison among them), we decided to create our own embeddings based on precisely the data we wanted to analyse, also because, as explained in Section 2 above, we tried four kinds of word embeddings.
However, one thing that stood out was that there is a scarcity of research that uses clustering over word embeddings. Tang et al. (2014) claimed that poor results of clustering over word embeddings are due to the fact that traditional word embeddings are based on substitutability, not similarity, and so “they cannot distinguish words with similar context but opposite sentiment polarity (e.g., good and bad)” (Tang et al., 2014, p. 1563). This means that, in a word embedding representation, antonyms are closer than unrelated words, since they are often substitutable. Another property of antonymy has been pointed out by Justeson and Katz (1992), who suggested that corpus co-occurrence is a textual marker for the antonymy lexical-semantic relation. In other words, antonyms tend to co-occur in text.
Before clustering, we tried to exploit the information gathered by the word embeddings in several ways, as we describe in what follows.
4.1 Emotions near emotions?
We first set out to investigate whether words annotated as emotions also have emotions as their nearest neighbours (most similar words) in word embeddings. In order to do this, we computed the most similar words for the 3 embedding models where emotions were explicitly marked (using Gensim’s (Rehurek & Sojka, 2010) method similar_by_word from its KeyedVectors module), and extracted the following statistics:
how many emotion words were included in the first 50 closest words
what was the position in the top 50 closest words of the first emotion (-1 if none was an emotion)
the sum of the inverse ranks of the 50 closest words which were considered an emotion.
Let us illustrate the third statistic with the help of Figure 12, which lists the 50 nearest neighbours of emo:amor. The rankings of emotion words (in bold) are then 2, 3, 4, 5, 12, 13, 14, 16, 18, 19, 20, 21, 25, 29, 30, 31, 33, 34, 36, 41, 43, 45, and 46, and the statistic amounts to 1/2 + 1/3 + 1/4 + 1/5 + 1/12 + 1/13 + 1/14 + 1/16 + 1/18 + 1/19 + 1/20 + 1/21 + 1/25 + 1/29 + 1/30 + 1/31 + 1/33 + 1/34 + 1/36 + 1/41 + 1/43 + 1/45 + 1/46, equalling 2.10.
Generically, where w is each one of the similar (emotion) words, and pos(w) is the rank of that word.
This example at once shows several interesting features: the closest word to emo:amor is… amor itself! But this second amor was not considered an emotion in the context it appeared in – we can guess it probably was included in a proper noun like a movie or book title, and proper nouns were not annotated with emotions. Another observation is that encantamento (‘enchantment’), encanto (‘enchantment’), remorso (‘remorse’) and even arrebatamento (‘rapture’) do seem to us quite good emotion candidates, although they weren’t considered as such. Therefore, we can give this kind of feedback to enhance the lexicon and/or the rules, so that the corpus annotators can include these terms in the next round. Finally, the word coração (‘heart’) is also quite interesting because of the well-known relationship between body organs and emotions (see e.g., Enfield & Wierzbicka, 2002). Any Portuguese native speaker is used to the metaphor that love is located in the heart (a metaphor which is also incidentally shared by English and many other languages). Should one have also tagged (some) body parts as emotions? We have tagged body parts as possibly meaning emotions in another project (Freitas et al., 2015), but making use of a different syntax, and we have not so far converted that information into emotion annotation which could have been used in our experiments here.
These observations show that the results we obtain may not be final, and that several things could have be enhanced or done differently.
In order to make sense of all emotion words and not single out just one instance, we computed the three statistics given above for all the words of our 3 word embedding models. In Figure 13, we present a histogram of the results for the first statistic (how many words in the top 50 are also emotions), based on the second model, where words denoting emotion in context were prefixed with emo:
In Figure 13, we can see that there are wide differences between words. There are many emotion words (1,120 for the second word embeddings) that do not have an emotion among the closest neighbours, while others (216) have more than 40 out of 50. We assessed whether word frequency correlated with number of emotions as closest numbers, but the value of 0.11 for the Pearson’s product moment correlation coefficient means that there is no correlation at all.
We have also created a corresponding histogram for the third kind of word embeddings, because we have slightly different words (remember that the words here have a prefix added, see Section 2 above). The lexicon gets larger (the same word can be in many different groups) and embeddings may differ. In fact, the random component underlying Word2Vec’s algorithm has often been challenged with lack of stability (see Mihalcea, 2021), which means that different runs of the algorithm with the same parameters can actually produce wildly different results.
In Figure 14, we use the third kind of word embeddings (where we keep both word and group) and restricting our attention to those emotion words with at least one emotion in their closest 50 neighbours, we check whether they belong to the same emotion group. We create a fourth statistic that measures how many of the closest emotions belong to the same group.
The results leave room for improvement. We believe that a decisive factor could be the property of antonyms that has been already discussed: words tend to co-occur with their antonyms, which are therefore closer to them than a word taken at random. So as a next step for future work, one should find a way to mark antonym emotion groups and check whether the percentages in Figure 14 would significantly increase.
4.2 Which emotion groups are the closest?
Using the fourth word embeddings, where we replaced all words of a particular group by their group name (keeping vague group names, like Amor_Desejo (‘love_desire’), as separate groups), we can use this representation to obtain similarities between groups.
The group Desejo (‘desire’) has as the closest and only neighbours the groups Amor_Desejo_Esperanca (‘love_desire_hope’), Amor (‘love’) and Desejo_Esperanca (‘desire_hope’), while Amor (‘love’) has many neighbours (16), in the following order: Feliz_Satisfeito, Feliz, Inveja, Desejo, Admirar, Odio, Satisfeito, Gen, Desespero, Esperanca, Surpresa, Desejo_Esperanca, Medo, Orgulho, Humildade_Admirar and Pena. The closest neighbours (21) of Desespero (‘despair’) are as follows: Medo, Infeliz, Surpresa, Medo_Surpresa, Furia, Insatisfeito_Outra, Alivio, Odio, Inveja, Infeliz_Desespero, Gen, Infeliz_Insatisfeito, Feliz, Ausencia, Furia_Odio, Esperança, Vergonha, Feliz_Satisfeito, Outra, Saudade and Infeliz_Vergonha.
These data are more difficult to interpret than those of the co-occurrence experiments, but we may hypothesize that the Desespero (‘despair’) group is the one with the most neighbours precisely because it has fewer occurrences and therefore it is more difficult to become autonomous. Conversely, the fact that Desejo (‘desire’) has only three group neighbours may show that either it is quite different/far away from the rest of emotions, and/or that it belongs to another (possible) cluster, namely that of volition.
One could use these similarities both to rank these groups in terms of their actual emotionality, and to measure their proximity with other emotions.
In Table 4 we show the raw results (for these groups).
|emotion group||close groups||Gensim similarity|
In order to provide a better way to understand these data, we have also tried to create partial pictures for each emotion, in Figure 15.
4.3 What about the pure word embeddings?
So far, we have only used our three word embedding models that make use of the annotation. But could anything be learned, or discovered, from the traditional word embeddings that require no corpus annotation? We could even compare the results with other word embeddings for Portuguese.
So, we decided to try them out by choosing the words amor and desespero (the ones that, after all, gave the names to the groups we have been looking at) and manually identify their close neighbours that are emotions, in Figure 16.
We see that 21 out of 50 are not emotional words for amor, while only 11 out of 50 do not describe emotion for desespero. This means that regular word embeddings fare very well compared with those created with explicit annotation. However, these test cases may not have been the best ones, since they correspond to cases where the word (amor, desespero) is always an emotion, as opposed to pena (‘sorrow’, ‘feather’, ‘punishment’, ‘pen’, etc.) or reconhecer (‘be grateful’, ‘recognize’, etc.).
4.4 Clustering based on embeddings
Finally, we also attempted the direct use of the k-means algorithm, implemented in the scikit-learn Python package (Pedregosa et al., 2011), to group the embeddings. The k-means algorithm groups vectors on k clusters, where k is a predefined value. It works by classifying each vector in the cluster with the nearest mean. The main problem of this method is the requirement of predefining the number of desired clusters, see Lloyd (1982). This was performed directly, importing each word embedding vector for each one of the groups. With the idea of finding groups that could be merged, or that are semantically closer, we asked for 20 clusters, shown in Table 5.
|1||inveja, amor, ódio|
|2||amor_desejo_esperança, desejo, desejo_esperança|
|3||amor_desejo_orgulho, amor_orgulho, grato|
|4||infeliz, medo_surpresa, desespero, gen, furia_odio, medo, alivio|
|5||furia, vergonha, infeliz_pena, orgulho|
|6||infeliz_insastifeito_outra, outra, infeliz_insastifeito|
|7||humildade, odio_vergonha, pena, insatisfeito|
|9||infeliz_desespero, feliz_orgulho, medo_infeliz, feliz_amor, infeliz_vergonha, coragem, orgulho_vergonha, coragem_furia, ingrato, orgulho_admirar, desespero_furia|
|11||coragem_ausencia, furia_outra, ausencia, gen_ausencia, alivio_ausencia|
|15||satisfeito, feliz_satisfeito, feliz|
|17||infeliz_furia_insatisfeito, surpresa, insatisfeito_outra|
|18||desejo_inveja, amor_admirar, admirar|
Although some clusters are difficult to interpret, some interesting details emerge. Amor (‘love’) and Odio (‘hate’) are in the same cluster (1), although why Inveja (‘envy’) is with them is less clear. The same happens for Orgulho (‘pride’) and Vergonha (‘shame’) in cluster 5, also accompanied by others. Cluster 11 correctly joins all classes which have absence of emotion, although it also comprises the group Furia_Outra (‘anger_other’). One would expect that with more required clusters Furia_Outra (‘anger_other’) would move to another cluster. Cluster 15 links Satisfeito (‘satisfaction’) and Feliz (‘happiness’) (recall that this was one of the cases where we wondered whether merging made sense), and cluster 2 is about Desejo (‘desire’). Cluster 6 showed a spelling error in Insastifeito, which is not a group, it occurs most probably as the annotation of one word only, which was joined with Outra (‘other’). Finally, cluster 13 uncovers yet another problem of the original annotation, namely the fact that there are cases marked Humildade_Admirar and cases marked Admirar_Humildade in the material, while there should be only one way of encoding this group. This cluster also shows that clustering was useful.
After this clustering attempt, we tried using fuzzy clustering, following Atakishiyev and Reformat (2020), who have also applied it to word embeddings. Fuzzy clustering is an extension of clustering in which each data point can belong to more than one cluster. Membership is then a likelihood. This was motivated by the following property of the corpus annotation: we allowed more than one emotion group to be assigned to a particular word in context. These cases seem to require a different kind of clustering, where membership is not crisp, but partial. However, we can only report negative results in that respect. Our preliminary attempts did not show any improvement over k-means.
5. Related work
We report briefly on three different alleys in related work: looking at emotions in text and not only polarity; clustering emotions; and interpreting results with word embeddings.
Parallel to sentiment analysis, there is a (much smaller) research alley that pursues what has been called “emotion annotation” or “emotion detection”, represented by Maia (1994), Aman and Szpakowicz (2007), and Ptaszynski et al. (2014). For a review, see Seyeditabari et al. (2018). There has also been some work on precisely the literary domain to identify emotions, examples of which are Mohammad (2012) and Kim et al. (2017). Our work is definitely included in this tradition. Compared to the above works, we are using considerably larger amounts of text.
As to clustering emotions, that is, using empirical methods to identify how emotion is structured in a language, we are only aware of a few works: Feng et al. (2011) use a sentiment lexicon in order to cluster blogs, and extract the most common words in each of their (eight) clusters; Hu et al. (2009) classify Chinese lyrics by their emotions, using fuzzy clustering. Again, our work addresses far more textual data and more text genres.
The work closest to ours, Tang et al. (2014), creates what they call sentiment-specific word embeddings (SSWE). But, as the name indicates, they use “sentiment” (positive or negative) and not emotion. They also use the top-100 closest words in their word embeddings to evaluate the polarity consistency of different sentiment lexicons. Other researchers trying to adapt word embeddings to emotion or sentiment use techniques like retro-fitting or counter-fitting emotion lexicons onto the “ordinary” word embeddings, producing other kinds of models, see e.g. Speer and Chin (2019). Although these works are valuable to suggest new alleys for using word embeddings in the study of emotions, we believe our different attempts are also worth pursuing, and enrich this relatively young area.
6. Concluding remarks
These exploratory experiments only scratch the surface of what can be done having such a rich resource at our fingertips, the purpose of which is to provide a wide picture of the reference to emotions in Portuguese. We tried to explore this resource with several big data techniques, but we hope that this is just the beginning of a new research area in the years to come, especially because we have made the resources and the methods we employed in the explorations described here publicly available,4 so we hope to see others follow suit.
There are a number of experiments we still wish to perform, from merely testing different word embeddings approaches like FastText or GloVe, experimenting with different preprocessing strategies, adding other annotation information (part-of-speech, functional dependency, morphological or other semantic properties), investigating text genre, time period, and so on.
Also, there are other approaches in the word embeddings world that seem worth trying, from box embeddings (Abboud et al., 2020) to comparison with other public word embeddings for Portuguese, as well as the creation of word embeddings per genre, which has often been proposed in the literature (Tshitoyan et al., 2019).
Likewise, we only looked at the emotion groups Amor (‘love’) and Desespero (‘despair’). More than twenty further groups, as well as their possible mergings (for example, by joining together antonyms), could and should be investigated in order to learn more about emotion in Portuguese, and to evaluate thoroughly the particular annotation available.
Concerning the three initial motivating questions, our preliminary conclusions based on the available material are as follows:
There is evidence to merge Feliz (‘happiness’) and Satisfeito (‘satisfaction’), from clustering word embeddings.
The Desespero (‘despair’) group is apparently extremely wide, since it displays connections with many different emotions, which may mean that it joins disparate things and should be divided.
There seems to be no reason to separate friendship and love. They seem to belong to the same group, called Amor (‘love’).
As one reviewer noted, there are no predefined predictions from previous studies of what emotions groups might look like in Portuguese, so we cannot contrast our groups with others in the literature.
Finally, and concerning the limitations of our approach to comparative work, clearly we are only discussing Portuguese here. Similar studies have to be done for other languages before one can contrast Portuguese with them. It is noteworthy to emphasize that we, with Wierzbicka (1999) and many others (Jackson et al., 2019), do not believe emotion concepts work similarly in different languages.5 Otherwise, there would be no point in doing emotion studies in Portuguese, given that there are more resources in general for English linguistics.
- The naming decision was unfortunate, because different classes received different grammatical categories (adjective or noun), and “most frequent” is based on the size of the corpus at the naming time (2015), and may therefore have changed already. In the present paper we will consistently use the nominal version in the English translation. We will keep the Portuguese names unchanged as they were in the original corpus. [^]
- Are the groups well-chosen? Are the different emotion words that belong to a group correctly classified? [^]
- Preliminary experiments seem to show we were mistaken, there seems to be no significant difference. [^]
- From https://www.linguateca.pt/documentacao/artigoClusteringEmotions.html. [^]
- Although Wierzbicka assumes a universal framework of semantic primitives, she takes special care to explain that (most) emotions are culture-specific, albeit built from common primitives, see also her work on pain (Goddard & Wierzbicka, 1994). [^]
We are grateful to Fundação Científica para a Computação Nacional, FCCN, for maintaining Linguateca’s servers, and to NRIS – Norwegian research infrastructure services for access to the saga cluster in Norway. We thank the anonymous reviewers of our paper for excellent feedback and editorial help, and we thank every researcher at Linguateca which contributed to the resources used here. A special thanks goes to Cristina Mota, with whom we discussed almost every line of this paper, and who contributed therefore significantly to its final form.
The authors have no competing interests to declare.
Abboud, R., Ceylan, I. I., Lukasiewicz, T., & Salvator, T. (2020). BoxE: A Box Embedding Model for Knowledge Base Completion. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan & H. Lin (Eds.), Advances in Neural Information Processing Systems 33 (NeurIPS 2020) (pp. 9649–9661). Vancouver, Canada: Curran Associates Inc. DOI: http://doi.org/10.48550/arXiv.2007.06267
Aman, S., & Szpakowicz, S. (2007). Identifying Expressions of Emotion in Text. In V. Matousek & P. Mautner (Eds.), TSD 2007: Text, Speech and Dialogue (pp. 196–205), Springer. DOI: http://doi.org/10.1007/978-3-540-74628-7_27
Antoniak, M., & Mimno, D. (2018). Evaluating the Stability of Embedding-based Word Similarities. Transactions of the Association for Computational Linguistics, 6, 107–119. DOI: http://doi.org/10.1162/tacl_a_00008
Atakishiyev, S., & Reformat, M. Z. (2020). Analysis of Word Embeddings Using Fuzzy Clustering. In S. Shahbazova, J. Kacprzyk, V. Balas, & V. Kreinovich (Eds.), Recent Developments and the New Direction in Soft-Computing Foundations and Applications. Studies in Fuzziness and Soft Computing (pp. 539–551). Springer, Cham. DOI: http://doi.org/10.1007/978-3-030-47124-8_44
Barrett, L. F., Lewis, M., & Haviland-Jones, J. M. (Eds.) (2018). Handbook of Emotions: Fourth Edition. Guilford Press.
Batista, D. S. (2019). Portuguese Word Embeddings. http://www.davidsbatista.net/blog/2019/11/03/Portuguese-Embeddings/
Bick, E. (2000). The Parsing System “Palavras”: Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Aarhus University Press.
Bick, E. (2007). Automatic Semantic Role Annotation for Portuguese. TIL, V Workshop em Tecnologia da Informação e da Linguagem Humana, 1715–1719.
Bick, E. (2014). PALAVRAS, a Constraint Grammar-based Parsing System for Portuguese. In T. B. Sardinha, & T. L. S. B. Ferreira (Eds.), Working with Portuguese Corpora (pp. 279–302). London/New York: Bloomsbury Academic.
Birjali, M., Kasri, M., & Beni-Hssane, A. (2021). A comprehensive survey on sentiment analysis: Approaches, challenges and trends. Knowledge-Based Systems, 26(107134). DOI: http://doi.org/10.1016/j.knosys.2021.107134
Boddice, R. (2018). The history of emotions. Manchester: Manchester University Press.
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5, 135–146. DOI: http://doi.org/10.1162/tacl_a_00051
Csardi, G., & Nepusz, T. (2006). The igraph software package for complex network research. InterJournal, Complex Systems, 1695(5), 1–9. http://igraph.org
Enfield, N. J., & Wierzbicka, A. (2002). Introduction: The body in description of emotion. Pragmatics & Cognition, 10(1/2), 1–25. DOI: http://doi.org/10.1075/pc.10.1-2.02enf
Feng, S., Wang, D., Yu, G., Gao, W., & Wong, K.-F. (2011). Extracting common emotions from blogs based on fine-grained sentiment clustering. Knowledge Information Systems, 27, 281–302. DOI: http://doi.org/10.1007/s10115-010-0325-9
Freitas, C., Santos, D., Mota, C., Carriço, B., & Jansen, H. (2015). O léxico do corpo e anotação de sentidos em grandes corpora: o projeto Esqueleto [The lexicon of the body and sense annotation on large corpora]. Revista de Estudos da Linguagem, 23(3), 641–680. DOI: http://doi.org/10.17851/2237-2083.23.3.641-680
Goddard, C., & Wierzbicka, A. (1994). Pain: is it a human universal? In C. Goddard & A. Wierzbicka (Eds.), Semantic and Lexical Universals (pp. 127–155). John Benjamin Publishing. DOI: http://doi.org/10.1075/slcs.25
Grave, E., Bojanowski, P., Gupta, P., Joulin, A., & Mikolov, T. (2018). Learning Word Vectors for 157 Languages. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 3483–3487. https://aclanthology.org/L18-1550
Hartmann, N. S., Fonseca, E. R., Shulby, C. D., Treviso, M. V., Rodrigues, J. S., & Aluı́sio, S. M. (2017). Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks. Proceedings of 11th Brazilian Symposium in Information and Human Language Technology, 122–131. https://aclanthology.org/W17-6615/
Hu, Y., Chen, X., & Yang, D. (2009). Lyric-based song emotion detection with afective lexicon and fuzzy ckustering method. 10th International Society for Music Information Retrieval Conference (ISMIR 2009), 123–128.
Jackson, J. C., Watts, J., Henry, T. R., List, J.-M, Forkel, R., Mucha, P. J., Greenhill, S. J., Gray, R. D. & Lindqvist, K. A. (2019). Emotion semantics show both cultural variation and universal structure. Science, 366 (6472), 1517–1522. DOI: http://doi.org/10.1126/science.aaw8160
Justeson, J. S., & Katz, S. M. (1992). Redefining Antonymy: The Textual Structure of a Semantic Relation. Literary and Linguistic Computing, 7(3), 176–184. DOI: http://doi.org/10.1093/llc/7.3.176
Kim, E., Padó, S., & Klinger, R. (2017). Investigating the Relationship between Literary Genres and Emotional Plot Development. Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, 17–26. https://aclanthology.org/W17-2203/. DOI: http://doi.org/10.18653/v1/W17-2203
Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137. DOI: http://doi.org/10.1109/TIT.1982.1056489
Maia, B. (1994). A Contribution to the Study of the language of Emotion in English and Portuguese. Porto: Faculdade de Letras da Universidade do Porto. Revised version: 1996 http://web.letras.up.pt/bhsmaia/belinda/pubs/thesis.htm
Maia, B., & Santos, D. (2018). Language, emotion, and the emotions: The multidisciplinary and linguistic background. Language and Linguistics compass, 12(5). DOI: http://doi.org/10.1111/lnc3.12280
Mihalcea, R. (2021, 19 May). The Ups and Downs of Word Embeddings. 2021. https://www.youtube.com/watch?v=33XtLnPDOC0
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K.Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 26 (NIPS 2013). a: Currant Associates, Inc. https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf
Mohammad, S. M. (2012). From once upon a time to happily ever after: Tracking emotions in mail and books. Decision Support Systems, 53(4), 730–741. DOI: http://doi.org/10.1016/j.dss.2012.05.030
Mota, C., & Santos, D. (2015). Emotions in natural language: a broad-coverage perspective. http://www.linguateca.pt/acesso/EmotionsBC.pdf
Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1–2), 1–135. DOI: http://doi.org/10.1561/1500000011
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Müller, A., Nothman, J., Louppe, G., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, M. B., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830. https://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html
Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2014, 1532–1543. DOI: http://doi.org/10.3115/v1/D14-1162
Ptaszynski, M., Rzepka, R., Araki, K., & Momouchi, Y. (2014). Automatically annotating a five-billion-word corpus of Japanese blogs for sentiment and affect analysis. Computer Speech and Language, 28, 38–55. DOI: http://doi.org/10.1016/j.csl.2013.04.010
R Development Core Team. (2008). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/
Ramos, B. C. (2021). Descrição de uma metodologia desenvolvida para revisão de um léxico de palavras de emoção [Description of a methodology developed to revise an emotion lexicon]. Jornadas de Descrição do Português, STIL 2021 (pp. 389–397). DOI: http://doi.org/10.5753/stil.2021.17819
Ramos, B. C., & Freitas, C. (2019). “Sentimento de quê?” uma lista de sentimentos para a Análise de Sentimentos [Feeling of what? A list of feelings for emotion analysis]. STIL – Symposium in Information and Human Language Technology, Salvador, BA, 38–47.
Ramos, B., Santos, D., & Freitas, C. (2020). Looking at body expressions to enrich emotion clusters. In M. J. B. Finatto, S. Luz, S. Pollak, & R. Vieira (Eds.), Proceedings of the Digital Humanities and Natural Language Processing Workshop at the 14th International Conference on the Computational Processing of Portuguese Language (pp. 57–62). http://hdl.handle.net/10400.26/35280
Rehurek, R., & Sojka, P. (2010, May). Software Framework for Topic Modelling with Large Corpora. Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 45–50. http://www.lrec-conf.org/proceedings/lrec2010/workshops/W10.pdf
Rodrigues, J., & Branco, A. (2018). Finely Tuned, 2 Billion Token Based Word Embeddings for Portuguese. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2403–2409. https://aclanthology.org/L18-1382.pdf
Romanov, V., & Khusainova, A. (2021). Evaluation of Morphological Embeddings for English and Russian Languages. NLPIR 2019: Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval, 144–148. DOI: http://doi.org/10.1145/3342827.3342846
Santos, D. (2014). Corpora at Linguateca: Vision and roads taken. In T. B. Sardinha, & T. L. S. B. Ferreira (Eds.), Working with Portuguese Corpora (pp. 219–236). Bloomsbury.
Santos, D. (2016). Comparando corpos orais (transcritos) e escritos na Gramateca [Comparing (transcribed) oral corpora with written corpora in Gramateca]. In C. Bardel & A. De Meo, Parler les langues romanes/Parlare le lingue romanze/Hablar las lenguas romances/Falando línguas românicas. Atti del Convegno Internazionale GSCP 2014 (pp. 127–142). Napoli: Università di Napoli L’Orientale, Il Torcoliere.
Santos, D. (2021). Natural and artificial intelligence; natural and artificial language. In R. Queirós, M. Pinto, A. Simões, F. Portela, & M. J. Pereira (Eds.), 10th Symposium on Languages, Applications and Technologies (SLATE 2021) (pp. 1:1–1:11), OASIcs – OpenAccess Series in Informatics (vol. 94). DOI: http://doi.org/10.4230/OASIcs.SLATE.2021.1
Santos, D., & Maia, B. (2018). Language, emotion, and the emotions: A computational introduction. Language and Linguistics compass, 12(6). DOI: http://doi.org/10.1111/lnc3.12279
Santos, D., & Mota, C. (2010). Experiments in human-computer cooperation for the semantic annotation of Portuguese corpora. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), 1437–1444. http://www.lrec-conf.org/proceedings/lrec2010/pdf/457_Paper.pdf
Santos, D., & Mota, C. (2015). A admiração à luz dos corpos [Admiration illuminated by corpora]. In A. Simões, A. Barreiro, D. Santos, R. Sousa-Silva, & S. E.O. Tagnin (Eds.), Linguística, Informática e Tradução: Mundos que se Cruzam. Homenagem a Belinda Maia, Oslo Studies in Language, 7(1), 57–77. https://journals.uio.no/public/journals/1/images/osla-7-1.pdf. DOI: http://doi.org/10.5617/osla.1466
Santos, D., Simões, A., & Mota, C. (2022). Broad coverage emotion annotation, Language Resources and Evaluation, 56, 857–879. DOI: http://doi.org/10.1007/s10579-021-09565-1
Seyeditabari, A., Tabari, N., & Zadrozny, W. (2018). Emotion Detection in Text: A Review. https://arxiv.org/pdf/1806.00674.pdf
Speer, R., & Chin, J. (2019). An Ensemble Method to Produce High-Quality Word Embeddings. https://arxiv.org/pdf/1604.01692.pdf
Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., & Qin, B. (2014). Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 1555–1565. https://aclanthology.org/P14-1146.pdf. DOI: http://doi.org/10.3115/v1/P14-1146
Tshitoyan, V., Dagdelen, J., Weston, L., Dunn, A., Rong, Z., Kononova, O., Persson, K. A., Ceder, G., & Jain, A. (2019). Unsupervised word embeddings capture latent knowledge from materials science literature. Nature, 571, 95–98. DOI: http://doi.org/10.1038/s41586-019-1335-8
Wierzbicka, A. (1999). Emotions across Languages and Cultures: Diversity and Universals. Cambridge University Press. DOI: http://doi.org/10.1017/CBO9780511521256
Yamada, I., Shindo, H., Takeda, H., & Takefuji, Y. (2016). Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation. Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning (CoNLL), 250–259. DOI: http://doi.org/10.18653/v1/K16-1025