<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20120330//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd">
<!--<?xml-stylesheet type="text/xsl" href="article.xsl"?>-->
<article article-type="research-article" dtd-version="1.2" xml:lang="en" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
<journal-meta>
<journal-id journal-id-type="issn">2397-5563</journal-id>
<journal-title-group>
<journal-title>Journal of Portuguese Linguistics</journal-title>
</journal-title-group>
<issn pub-type="epub">2397-5563</issn>
<publisher>
<publisher-name>Open Library of Humanities</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.16995/jpl.15242</article-id>
<article-categories>
<subj-group>
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Semiautomatic selection of interjectional onomatopoeia from English, Portuguese, Spanish, and Ukrainian corpora based upon syllables&#8217; repetition pattern</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Fokin</surname>
<given-names>Serhii</given-names>
</name>
<email>sergiyborysovych@ukr.net</email>
<xref ref-type="aff" rid="aff-1">1</xref>
</contrib>
</contrib-group>
<aff id="aff-1"><label>1</label>Taras Shevchenko National University of Kyiv, UA</aff>
<pub-date publication-format="electronic" date-type="pub" iso-8601-date="2025-09-10">
<day>10</day>
<month>09</month>
<year>2025</year>
</pub-date>
<pub-date pub-type="collection">
<year>2025</year>
</pub-date>
<volume>24</volume>
<fpage>1</fpage>
<lpage>27</lpage>
<permissions>
<copyright-statement>Copyright: &#x00A9; 2025 The Author(s)</copyright-statement>
<copyright-year>2025</copyright-year>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CC-BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. See <uri xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</uri>.</license-p>
</license>
</permissions>
<self-uri xlink:href="http://jpl.letras.ulisboa.pt/articles/10.16995/jpl.15242/"/>
<abstract>
<p>Onomatopoeic words constitute a serious challenge for translators, lexicographers, language learners, and teachers. Hence, empirical data collection on onomatopoeia is highly sought after. The most suitable data sources for extracting onomatopoeia are large language corpora. Since onomatopoeic words and, particularly, interjectional onomatopoeias show wide variance and many of them are created spontaneously, the methodology chosen for automating the extraction in this study initially involved observing the existing patterns of transcribed interjectional onomatopoeias, among which the one based upon repetition proved the most recurring. Among the observed features were the same or similar syllable sequence, three or more repeated letters, combined with punctuational markers such as hyphens, ellipses, quotation, or exclamation marks, part of speech tags. These properties were further implemented in formulating corpus queries. The search was based on the pattern of repetition of similar syllables. The results underwent an ANOVA test that revealed the open and closed hyphenated syllables to be the most reliable pattern for extracting interjectional onomatopoeias from corpora of English, Portuguese, Spanish, and Ukrainian. The used markers allow for the achievement of high efficiency, which was evaluated in terms of precision.</p>
</abstract>
</article-meta>
</front>
<body>
<sec>
<title>1. Introduction</title>
<p>Phonetic motivation as a lexeme creation mechanism constitutes a considerable theoretical and practical challenge. Onomatopoeic words that may appear spontaneously in a given language and are currently widely used are characterized by a high degree of occasionality. Their examples in written literature are rare, whereas their usage is not clearly regulated, excepting the most frequent forms traditionally mentioned in dictionaries and grammars: <italic>bow-wow, bang, tic-tac</italic>, and similar. One of the practical challenges that arises from this extravagant phenomenon is their correct usage in foreign language learners&#8217; speech: for whatever correct grammar and vocabulary is used in spontaneous speech, inappropriate onomatopoeic words are likely to unveil the speech&#8217;s unnaturalness. In the domain of translation practice, the onomatopoeia usage and meaning are a poorly explored subject, as this phenomenon posits a practical challenge. As Casas-Tost points out:</p>
<disp-quote>
<p>As I see it, one of the factors which are an encumbrance to the translator&#8217;s task is that these text units have been given little importance at a theoretical level and, as a consequence, in practice. This is reflected by the lack of onomatopoeia entries in all manner of reference books, including dictionaries, which I believe is one of the reasons why they are rarely used (<xref ref-type="bibr" rid="B6">2012, p. 39</xref>).</p>
</disp-quote>
<p>It is evident that many onomatopoeias are used <italic>ad hoc</italic> and are highly dependent upon the situational context. There are hundreds of conventional onomatopoeic words used in fiction, in transcribed oral texts, and in internet communication that could be additionally registered in monolingual and bilingual dictionaries and that would be of a high practical value for language learners, teachers, and translators. The specialized literature is characterized by scarce observations in this respect, since examples are hard to find.</p>
<p>However, it should come as no surprise that it is possible to find mentions of lexicographic sources focused particularly upon onomatopoeia in either monolingual or even bilingual dictionaries of a limited set of languages, for example, <italic>Farhange Namavaha dar Zbane</italic> (&#8220;A dictionary of Onomatopoeia in Persian&#8221;) by Vahidian Kamyar (<xref ref-type="bibr" rid="B32">1996</xref>). Curiously, bilingual or multilingual dictionaries that concern this subject are more prolific than monolingual ones&#8212;perhaps because the dictionary compilers became aware of the object&#8217;s importance through specific translation or language-learning challenges. Such works include <italic>Diccionari d&#8217;onomatopeies i altres interjeccions: amb equival&#232;ncies en angl&#232;s, espanyol i franc&#233;s</italic> by Riera-Eures (<xref ref-type="bibr" rid="B21">2010</xref>) for Catalan, English, Spanish, and French, and <italic>Japanese-Ukrainian Themed Dictionary of Onomatopoeic Vocabulary</italic> by Egava &amp; Kobelyanska (<xref ref-type="bibr" rid="B9">&#1045;&#1169;&#1072;&#1074;&#1072;, 2016</xref>), which offers the user a wide range of search possibilities from alphabetical criterion to accessing through subject classification (being this onomasiological approach still quite rare among lexicographers). After a detailed overview, Medvediv and Dmytruk (<xref ref-type="bibr" rid="B15">2019, p. 79&#8211;80</xref>) provide an extensive list of the Japanese lexicographers&#8217; achievements. Despite these examples, for most languages and language pairs, the lexicographic gap of onomatopoeia is still not covered.</p>
<p>In spite of the emerging literature and lexicographic sources regarding onomatopoeia, this lexically, emotionally, culturally, communicatively, and stylistically remarkable feature still constitutes an impressive lacuna in the domain of language didactics, lexicography, and translation due to data shortage.</p>
<p>The logical question arising in similar cases is whether there is a possibility of automating the selection in large-volume data, such as language corpora. Therefore, the purpose of this study is to find a method of automating the extraction of onomatopoeic words by observable formal markers, and evaluate its effectiveness in terms of precision. To be able to generalize commonalities in such markers, we resort to annotated corpora of four different languages: English, Portuguese, Spanish, and Ukrainian.</p>
<p>The article is organized as follows. After this introduction, in Section 2, Theoretical background, we explore the approaches to the subject in literature regarding onomatopoeia as a translational and lexicographic challenge, as well as methods of automatic retrieval of onomatopoeia. In Section 3, Methodology, we elaborate upon the methods used to automate the onomatopoeia extraction from corpora of English, Portuguese, Spanish, and Ukrainian. In Section 4, Results and Discussion, we propose tools to evaluate the precision of the performed corpus queries and judge the statistical significance of the expected precision rates obtained in the course of the study. Finally, in Section 5, we summarize the most common formal properties that may be successfully used in corpus queries to extract interjectional onomatopoeias.</p>
</sec>
<sec>
<title>2. Theoretical background</title>
<sec>
<title>2.1. Interjection VS onomatopoeia: distinguishing criteria</title>
<p><italic>&#8220;Onomatopoeia</italic> is the naming of a thing or action by a vocal imitation of the sound associated with it (such as <italic>buzz</italic> or <italic>hiss</italic>)&#8221; (<xref ref-type="bibr" rid="B5">Britannica, 2024</xref>). The phenomenon in question is not restricted to a particular part-of-speech (POS), however due to lexical and functional similarities researchers tend to associate onomatopoeia primarily with interjections, which is seen in some manuscript titles, such as <italic>Diccionari d&#8217;onomatopeies i altres interjeccions: amb equival&#232;ncies en angl&#232;s, espanyol i franc&#233;s</italic> by Riera-Eures (<xref ref-type="bibr" rid="B21">2010</xref>). In fact, sophisticated criteria are needed to distinguish both terms from each other, which is why entire works, even PhD theses, are devoted to this issue (<xref ref-type="bibr" rid="B16">Meinard, 2022</xref>).</p>
<p><italic>Interjection</italic>, on its turn, is defined as &#8220;an exclamatory word or phrase used to express an emotional reaction or to emphasize a thought&#8221; (<xref ref-type="bibr" rid="B5">Britannica, 2024</xref>). Once compared the definitions of these narrowly interrelated terms, we can conclude that the differences between both phenomena lie in their semantic meaning: while the onomatopoeia expresses sounds, the interjections convey emotions. Rodr&#237;guez Guzm&#225;n infers a set of additional points of inflection (formal characteristics, syntactic function in sentences) and disjunction (motivation patterns, morphonological processes, semantics) (<xref ref-type="bibr" rid="B23">2011, p. 173</xref>) between onomatopoeia and interjection, and finally concluding that both are to be considered as separate word classes (<xref ref-type="bibr" rid="B23">2011, p. 173</xref>). If we interpret the <italic>word class</italic> as <italic>part of speech</italic> in the context of data mining, particularly, in corpus linguistics, most corpora managers and taggers rely on worldwide conventions, among which the most widespread is the <italic>Universal Dependencies</italic> (UD) framework, currently used to annotate thousands of corpora. While the UD POS tagset does include interjections, the onomatopoeia does not form its part (<xref ref-type="bibr" rid="B29">Universal Dependencies, 2024</xref>). Similarly, many other traditional POS lists, whose number typically ranges between 9 and 10, comprise interjections (since the Latin grammars) but traditionally exclude onomatopoeia. This seems consistent with the logic: while, indeed, onomatopoeia semantically stands alone from other POSs, expressing sounds, the semantic category of a word is not the decisive criterion to assign it a POS label: otherwise, verbal nouns, such as <italic>participation</italic> or <italic>engineering</italic> should be semantically classified as verbs.</p>
<p>Both grammar and semantic characteristics come into play when assigning a POS property to a word. From a grammatical standpoint, the onomatopoeias are unchangeable words, as well as the interjections. Furthermore, depending on a particular linguistic school, the sound-imitating meaning is listed among the semantic properties of interjections, which is the traditional approach in Ukrainian grammar (see, for instance, <xref ref-type="bibr" rid="B14">Karamysheva, 2017, p. 218</xref>). This feature is also empirically validated by the sampling from the corpus GRAK (see <xref ref-type="table" rid="T9">Table 9</xref> and <xref ref-type="table" rid="T10">Table 10</xref>), where including the interjection tag in the query yields an impressive number of onomatopoeias. In this case, the concept of interjection turns out to be broader than that of the onomatopoeia. Now, the central question is what POS status should apply for <italic>onomatopoeia</italic>? Beyond the ongoing debates about their part of speech status, there is an immediate need to retrieve sound-imitating words from corpora or another source for various practical purposes. Authors, translators, and editors may need to express sound not only using pure sound imitation but also through derived nouns, verbs, adjectives, and adverbs with the semantics of sounds. Are these words to be classified as onomatopoeias? According to the <italic>Merriam-Webster Dictionary</italic>, onomatopoeia can also refer to the words formed by onomatopoeia (<xref ref-type="bibr" rid="B17">2024</xref>). Thus, not only <italic>buzz, hiss</italic> and similar words, but also <italic>buzzing, hissing, buzzy, and hissy</italic>, as well as other onomatopoeically derived lexemes (nouns, verbs, adjectives, adverbs), may form part of this list, as seen (particularly, but not exclusively) in Bidaud&#8217;s work (<xref ref-type="bibr" rid="B4">2022</xref>), who focuses their research on <italic>verbal onomatopoeias</italic>. Moreover, in English, assigning a part of speech property for a word such as <italic>buzz</italic> may be particularly challenging. It is obvious now that the POS-attribution may depend on the semantic and grammar approach elaborated upon in a particular linguistic school, but, from the standpoint of data mining at the current stage, we are to conclude the following:</p>
<list list-type="order">
<list-item><p>while the interjection is universally considered a part of speech, onomatopoeia is not;</p></list-item>
<list-item><p>onomatopoeia is now qualified rather as a semantic word class that may be assigned different part of speech tags;</p></list-item>
<list-item><p>interjection is a class that may comprise onomatopoeia depending on the linguistic approach;</p></list-item>
<list-item><p>both classes (onomatopoeia, interjection) in their broadest sense intersect; in the intersection of both classes, emerges a subclass of interjectional onomatopoeias.</p></list-item>
</list>
<p>Given these premises, we qualify hereafter <italic>interjectional onomatopoeias</italic> as onomatopoeias morphologically characterized as interjections, which act as grammatically unchangeable words and semantically express sounds.</p>
</sec>
<sec>
<title>2.2. Onomatopoeia as a translation challenge</title>
<p>It is important to note that the issue of translating onomatopoeic words comprises two faces: whereas in translation many context-driven techniques allow for multiple ad hoc contextual solutions, for the sake of dictionary compilation, more comprehensive solutions covering a wide range of potential situations in translation are needed.</p>
<p>From the translatological point of view, Yaqubi et al. outline a dozen techniques for using sources to translate onomatopoeia in the following order of precedence:</p>
<disp-quote>
<p>Make every effort to apply &#8220;established translation&#8221;, in other words, choose the exact recognized equivalent in the dictionary. In case, no equivalent is chosen for the item, choose &#8220;discursive creation&#8221; technique in order to create the same effect, although out of the context of the literary work, it may not have the same effect. Use &#8220;borrowing&#8221; technique which can help the translators to transfer the expressive function of the onomatopoeias to some extent. This transference is due to the universality of sound effects. Use &#8220;descriptive translation&#8221; in TT in order to imply that the item imitates a sound and the sound implies an action or emotion. Utilize &#8220;generalization&#8221; which helps to transfer the general meaning of onomatopoeias, i.e., the general information about the action and the emotion by using specific lexicon. However, by using this technique the form used in TT may not sound like onomatopoeia. Apply &#8220;reduction&#8221;, although by using it, some information may be partially or completely missed by the translator. Mix the translation techniques in order to create an equivalent which can imply the expressive function both in form and meaning (<xref ref-type="bibr" rid="B33">2018, p. 220</xref>).</p>
</disp-quote>
<p>The mentioned procedures are of enormous practical help for translators who are constrained to bridge numerous onomatopoeic lacunas. Whereas in one language there may be a traditional way of imitating the sound of a particular object or phenomenon, in other languages there might not exist a conventionally accepted onomatopoeia for a given communicative situation, where in yet another, less specific onomatopoeia might be acceptable (i.e., by means of hyperonymic substitution). The onomatopoeia of scissors or other cutting tools in Ukrainian is <italic>&#1095;&#1080;&#1082;-&#1095;&#1080;&#1082;</italic>, although no specific sound-imitating word is provided in the list of onomatopoeic words for Spanish, whereas in English and Portuguese there seem to exist analogous onomatopoeias: <italic>snip</italic>, and <italic>rip</italic>, as shown in the following examples 1 and 2.</p>
<list list-type="gloss">
<list-item>
<list list-type="wordfirst">
<list-item><p>(1)</p></list-item>
</list>
</list-item>
<list-item>
<list list-type="sentence-gloss">
<list-item>
<list list-type="final-sentence">
<list-item><p>These ninjas with scissors often have the vision to see the best version of you long before you can see it yourself. And there is nothing quite like the anticipation of patiently sitting in a chair, hearing the <bold><italic>snip-snip</italic></bold> of the scissors and watching a new you emerge in the mirror (<ext-link ext-link-type="uri" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://www.rsvplive.ie/life/hairdressers-unsung-heroes-lives-writes-14097868">https://www.rsvplive.ie/life/hairdressers-unsung-heroes-lives-writes-14097868</ext-link>) (<xref ref-type="bibr" rid="B25">RSVPLive 2024</xref>).</p></list-item>
</list>
</list-item>
</list>
</list-item>
</list>
<list list-type="gloss">
<list-item>
<list list-type="wordfirst">
<list-item><p>(2)</p></list-item>
</list>
</list-item>
<list-item>
<list list-type="sentence-gloss">
<list-item>
<list list-type="final-sentence">
<list-item><p>Ent&#227;o ela agarrou os lindos cabelos de Rapunzel, deu-lhe algumas palmadas com a m&#227;o esquerda e com a direita apanhou a tesoura e <bold><italic>rip, rip, rip</italic></bold>, os cabelos estavam cortados (<xref ref-type="bibr" rid="B7">Chamizo Babo n.d., <italic>CRPC</italic></xref>).</p></list-item>
</list>
</list-item>
</list>
</list-item>
</list>
<p>At the same time, a similar sound in Spanish can be traditionally imitated through <italic>zas-zas</italic>, whose meaning denotes many types of noise, i.e., is a hyperonym (see example 3):</p>
<list list-type="gloss">
<list-item>
<list list-type="wordfirst">
<list-item><p>(3)</p></list-item>
</list>
</list-item>
<list-item>
<list list-type="sentence-gloss">
<list-item>
<list list-type="final-sentence">
<list-item><p><bold><italic>&#161;Zis, zas y zas!</italic></bold> Una y otra vez zarande&#243; tijereteando el gladio vorpal! Bien muerto dej&#243; al monstruo, y con su testa &#161;volvi&#243;se triunfante galompando! (<xref ref-type="bibr" rid="B1">A bordo del Otto Neurath n.d.</xref>).</p></list-item>
</list>
</list-item>
</list>
</list-item>
</list>
<p>Beyond any possible valuable technique (reproduction, substitution, addition) that may serve as a brilliant situational workaround, it is crucial to explore the existing bilingual and monolingual dictionaries of onomatopoeia first to discover the possible lexicographic gaps and attempt to bridge them, very much in accordance with Yaqubi&#8217;s et al. recommendation of looking for an established equivalent as a priority method (<xref ref-type="bibr" rid="B33">2018, p. 220</xref>). Nevertheless, borrowing onomatopoeias from the source into the target text looks like a widespread technique. Rodr&#237;guez Guzm&#225;n presents a list of onomatopoeic words loaned and borrowed into Spanish from other languages (<xref ref-type="bibr" rid="B23">2011, p. 157</xref>). At the same time, some tricks used in translation due to the absence of a better suitable equivalence in the target language may indeed be due to a gap or lack of knowledge of existing specific means. This implies that specialized informational resources (such as dictionaries or corpora) would be of great help for translators.</p>
</sec>
<sec>
<title>2.3. Onomatopoeia in bilingual lexicography</title>
<p>It is obvious that, to compile bilingual or multilingual dictionary entries, credible sources and robust methodology are required. From the standpoint of the 21<sup>th</sup> century lexicography, the undisputed number one source for extracting extensive linguistic data are large-volume language corpora, and onomatopoeia is not an exception. In fact, many researchers use their own custom corpora to perform manual searches. For instance, Yaqubi et al. use their research corpus to calculate the frequency of the onomatopoeia in the Charles Dickens novel <italic>A Tale of Two Cities</italic> and for performing manual searches to retrieve examples of onomatopoeia for their two translations into Persian (<xref ref-type="bibr" rid="B33">2018, p. 211&#8211;212</xref>).</p>
<p>Some papers do pursue the objective of automating this operation: Orrequia-Barea and Mar&#237;n-Honor explore techniques particularly focused on onomatopoeic word extraction from large-volume corpora, such as the British National Corpus (<xref ref-type="bibr" rid="B18">2020, p. 47</xref>). Nevertheless, works of this kind are few, and our objective is to propose a method to optimize retrieving onomatopoeic words from large corpora and to evaluate their effectiveness in terms of precision involving four languages: English, Spanish, Portuguese, and Ukrainian.</p>
</sec>
</sec>
<sec>
<title>3. Methodology</title>
<sec>
<title>3.1. Observation</title>
<p>To find rational ways of retrieving examples of onomatopoeia, first we need to find out their most relevant features. As a starting point for observation, we have primarily used the ready-made lists of onomatopoeias in English (<xref ref-type="bibr" rid="B34">Yourdictionary, 2021</xref>), Portuguese (<xref ref-type="bibr" rid="B22">Riondlearn, 2022</xref>), Spanish (<xref ref-type="bibr" rid="B11">Fundeu, 2011</xref>), and Ukrainian (<xref ref-type="bibr" rid="B35">&#1041;&#1086;&#1078;&#1082;&#1086;, 2023</xref>). These lists were subject to observation with the purpose of extracting some valid markers to be used as reliable formal criteria during automatic or semiautomatic extraction of onomatopoeic words out of corpora. The method of observation, aimed at an intuitive selection of relevant features, yielded the following recurrent (but not exclusive nor mandatory) characteristics of the onomatopoeic words:</p>
<list list-type="order">
<list-item><p>repeated mostly closed syllables with the same vowels (<italic>pam-pam</italic>);</p></list-item>
<list-item><p>repeated open syllables (<italic>chu-chu</italic>) and repeated syllables with different vowels (<italic>zigzag, flip-flops</italic>);</p></list-item>
<list-item><p>observed repetitions are mostly hyphenated, but merged forms are not rare (<italic>tacatar, toc-toc-toc, ronroneo, tantan, pompom</italic>);</p></list-item>
<list-item><p>repetition of three or more graphemes (<italic>zzzzz, piiuuw, zwiiiz, pionggg</italic>);</p></list-item>
<list-item><p>ending of the word with <italic>-h</italic> (<italic>pouah, schh, pchhh</italic>).</p></list-item>
</list>
<p>The second and fourth observations partially coincide with the patterns proposed by Orrequia-Barea and Mar&#237;n-Honor (<xref ref-type="bibr" rid="B18">2020, p. 52</xref>). The repetition in linguistic sounds&#8217; representation or, specifically, reduplication are known as a universal linguistic feature:</p>
<disp-quote>
<p>The repetition of sounds occurs in all languages of the world, doubling segments of audible material: natural sounds and animal cries, but also words and clauses (&#8230;) it is interesting to note how often reduplication serves as a common denominator even in cases when languages disagree in the choice of phonemes (<xref ref-type="bibr" rid="B2">Anderson Earl 1998, p. 112</xref>).</p>
</disp-quote>
<p>The fifth observation, once implemented in corpus queries, did not produce noteworthy results. Each of the other four observed items merit particular attention and study, and we implemented the detected features in corpus queries. In the current study, our purpose is to focus upon the syllable&#8217;s reduplication pattern and the possibilities of its usage to automate onomatopoeia extraction from corpora. Therefore, the methodology is based on the phonic properties of onomatopoeias, such as repeated syllables, and, where possible, upon the part of speech parameters. Since there are no phonically annotated large corpora of the languages in question (Ukrainian, English, Spanish, and Portuguese), we were constrained to base our queries on grapheme levels instead of sound or phonemes.</p>
<p>To achieve the results, we use particular query languages depending on the search engine the corpora are provided with the <italic>Corpus Query Language, CQL</italic> (<xref ref-type="bibr" rid="B26">Sketch Engine, n.d.</xref>) or <italic>Corpus Query Processor, CQP</italic> (<xref ref-type="bibr" rid="B10">Evert S. &amp; The CWB Development Team, 2022</xref>) because they allow for searching patterns matching both regular expressions and linguistic annotation tags.</p>
</sec>
<sec>
<title>3.2. Former empirical researches</title>
<p>The automation of onomatopoeia retrieval as an idea started being explored in 2020. Although our interest in the subject appeared independently from the existing studies using similar techniques, we retrospectively took into account all the achievements that are very much in accordance with the current paper.</p>
<p>Orrequia-Barea and Mar&#237;n-Honor (<xref ref-type="bibr" rid="B18">2020</xref>) systematize different graphic properties of onomatopoeic words in written text to extract them by using the regular expression syntax, with some interesting observations regarding the correlation between onomatopoeic formal pattern and the ontological nature of the represented sound using Round &amp; Kwon&#8217;s concept of <italic>phonaesthemes</italic>, i.e. &#8220;recurrent pairings of sound and meaning&#8221; (<xref ref-type="bibr" rid="B24">Round &amp; Kwon, 2015, p. 2</xref>).</p>
<p>Orrequia-Barea and Mar&#237;n-Honor searched the text of comics to retrieve onomatopoeia in Spanish and French, and for texts in English the regular expression syntax was applied:</p>
<disp-quote>
<p>All the above-mentioned systematisations were captured by means of regular expressions, which are patterns that are frequently used in text editors to look for phonaesthemes. This sequence has to fulfil the criteria set out by the regular expression. As the main purpose was to find most of the onomatopoeias in the BNC, the following regular expressions, based on the previous patterns of formation, were used: 1. To find consonants that were repeated at least three times: [bc-df-hj-np-tvz]{3}. This regular expression yielded onomatopoeias such as <italic>zzz</italic>. 2. To find the pattern of up to two consonants plus vowels, repeated at least twice, followed optionally by an indefinite number of consonants: [bc-df-hj-np-tvz]{0,2}vowel{2,} [bc-df-hj-np-tv-z]{0,}. We typed each of the five different vowel graphemes in the vowel slot. Some of the results were: <italic>craark, beep, riing, boom</italic> or <italic>uuummm</italic> (<xref ref-type="bibr" rid="B18">2020, p. 52</xref>).</p>
</disp-quote>
<p>More observation on potentially universal features of onomatopoeias comprising some sound combinations are described by Assaneo, Nichols &amp; Trevisan:</p>
<disp-quote>
<p>We explore the vocal configurations that best reproduce non-speech sounds, like striking blows on a door or the sharp sounds generated by pressing on light switches or computer mouse buttons. From the anatomical point of view, the configurations obtained are readily associated with co-articulated consonants, and we show perceptual evidence that these consonants are positively associated with the original sounds. Moreover, the pairs vowel-consonant that compose these co-articulations correspond to the most stable syllables found in the knock and click onomatopoeias across languages, suggesting a mechanism by which vocal imitation naturally embeds single sounds into more complex speech structures (<xref ref-type="bibr" rid="B3">2011, p. 11</xref>).</p>
</disp-quote>
<p>Some researchers utilized dictionaries as a starting point for retrieving the onomatopoeic words out of lexicographic sources explicitly marked as onomatopoeic:</p>
<disp-quote>
<p>I made a list of onomatopoeic words using the following three steps. First, I searched entries (i.e. head words) in the OED, including terms such as onomatopoeia/onomatopoeic/onomatopoetic etc. in their etymologies. Specifically, I typed onomatop* into the &#8220;FIND WORD&#8221; box in the advanced search of the OED and restricted the search area to etymologies. As a result, 385 entries met this condition. (&#8230;) However, the list of these 304 entries is not adequate in itself. The OED often treats different grammatical classes of one word (= lemma) as separate entries. In addition, these separate entries are sometimes not given the same explanation of their etymologies. Many entries would be overlooked if I examined only those entries that included onomatopoeia/onomatopoeic/onomatopoetic etc. in their etymologies (<xref ref-type="bibr" rid="B27">Takashi Sugahara, 2011, p. 34</xref>)</p>
</disp-quote>
<p>A similar approach was implemented by Yaqubi et al. (<xref ref-type="bibr" rid="B33">2018, p. 212</xref>) who compiled a list of Persian onomatopoeias at the initial stage of their research. Although this methodology seems promising, we did not apply it in the current study, for it implies significant manual work and does not allow for extraction automation. Moreover, at the current stage, where numerous onomatopoeic lemmas or their graphic variants are not yet included in dictionaries, we chose to extract sound-imitating interjection from corpora including occasional ones. This is why we opted for patterns based on syllables or grapheme repetitions.</p>
<p>We are aware that the patterns based upon repetitions ignore single-syllable words. However, the basic idea relies upon the fact that the repetitions may serve as an access point to further retrieve another kind of onomatopoeia that is not based upon repetitions, some of which are present in the nearest context. In other words, if an onomatopoeia exists in the form of repeated syllables, then it is likely to appear in its monosyllabic or isolated variant, as seen in the following examples (4, 5 and 6) in Portuguese and Ukrainian:</p>
<list list-type="gloss">
<list-item>
<list list-type="wordfirst">
<list-item><p>(4)</p></list-item>
</list>
</list-item>
<list-item>
<list list-type="sentence-gloss">
<list-item>
<list list-type="final-sentence">
<list-item><p>Tenho a favor deste meu ju&#237;zo o facto de que, tendo o Governo calculado tam modestamente o rendimento deste imposto em 10 : 000 contos, ele veio a render 150: 000 &#8211; diz a C&#226;mara Corporativa -, mas h&#225; por a&#237; uns <bold><italic>zuns-zuns</italic></bold> que dizem que chegou a 200: 000 (<ext-link ext-link-type="uri" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://gamma.clul.ul.pt/CQPweb/crpc/textmeta.php?text=A25999&amp;uT=y">http://gamma.clul.ul.pt/CQPweb/crpc/textmeta.php?text=A25999&amp;uT=y</ext-link>).</p></list-item>
</list>
</list-item>
</list>
</list-item>
</list>
<list list-type="gloss">
<list-item>
<list list-type="wordfirst">
<list-item><p>(5)</p></list-item>
</list>
</list-item>
<list-item>
<list list-type="sentence-gloss">
<list-item>
<list list-type="final-sentence">
<list-item><p>O &#194;ngelo veio-me para c&#225; com, uns zum <bold><italic>zuns</italic></bold> (NEM&#201;SIO, Vitorino, 1944, <italic>CRPC</italic>).</p></list-item>
</list>
</list-item>
</list>
</list-item>
</list>
<list list-type="gloss">
<list-item>
<list list-type="wordfirst">
<list-item><p>(6)</p></list-item>
</list>
</list-item>
<list-item>
<list list-type="sentence-gloss">
<list-item>
<list list-type="final-sentence">
<list-item><p>&#1059; &#1085;&#1072;&#1089; &#1073;&#1091;&#1083;&#1086; &#1087;&#1086; 130 &#1087;&#1086;&#1088;&#1072;&#1085;&#1077;&#1085;&#1080;&#1093; &#1091;&#1076;&#1077;&#1085;&#1100;. &#1031;&#1093; &#1085;&#1077; &#1090;&#1110;&#1083;&#1100;&#1082;&#1080; &#1103; &#1074;&#1080;&#1074;&#1086;&#1079;&#1080;&#1074;, &#1079;&#1074;&#1110;&#1089;&#1085;&#1086;. &#1040;&#1083;&#1077; &#1091;&#1103;&#1074;&#1110;&#1090;&#1100;, &#1097;&#1086; &#1090;&#1091;&#1090; &#1088;&#1086;&#1073;&#1080;&#1083;&#1086;&#1089;&#1103;. &#8220;<bold><italic>&#1041;&#1072;&#1093;</italic></bold>! <bold><italic>&#1041;&#1072;&#1093;</italic></bold>! <bold><italic>&#1041;&#1072;&#1072;&#1072;&#1072;&#1072;&#1093;</italic></bold>!&#8221; [We had about 130 wounded per day. Of course, I wasn&#8217;t the only me to evacuate them. But just imagine what was happening here. &#8216;Bang! Bang! Baaaang!] (<xref ref-type="bibr" rid="B13"><italic>GRAK</italic> 2023: <italic>&#1056;&#1077;&#1087;&#1086;&#1088;&#1090;&#1077;&#1088;</italic>, 2022</xref>).</p></list-item>
</list>
</list-item>
</list>
</list-item>
</list>
<p>After elaborating upon the enquiry method, we needed to choose the best fitting corpora for extracting empirical data among the available corpora provided with the possibility of looking patterns of texts matching regular expressions.</p>
</sec>
<sec>
<title>3.3. Choosing corpora</title>
<p>To subject to test our hypothesis by means of corpus query patterns that we further propose, we need to decide what corpora will serve as the source of empirical basis, which is why we resort to corpora of four languages in which we can read and, consequently, carry out contextual analysis: English, Portuguese, Spanish, and Ukrainian. The number of languages used is also constrained by the accessibility of specific query languages in the corpora interfaces, suitable with the proposed patterns. Although English and Ukrainian are genetically distant from Spanish and Portuguese, the conclusions drawn on four languages from different groups will allow for better judging over the validity of the query patterns. An additional reason to choose English, Romance languages, and Ukrainian was the fact that, for the latter, the bilingual lexicographic contributions in the field of onomatopoeic dictionaries is particularly fruitful, as seen in 2.2., whereas English-Ukrainian, Portuguese-Ukrainian, and Spanish-Ukrainian language combinations are not provided with lexicographic sources containing either interjections or onomatopoeias. Moreover, among the four mentioned languages, only the Ukrainian corpus is provided with correctly tagged interjections, which helps bring to light additional properties of the queries performed.</p>
<p>The best intuitive choice of corpora may seem the referential standard since the reference corpora better represent the general properties of a language. At the same time, with regard to the needs of the research, we are also constrained by the corpora technical details. Orrequia-Barea and Mar&#237;n-Honor stressed on the downloadability of the corpora:</p>
<disp-quote>
<p>Our first idea was to extract onomatopoeias from corpora of each language since we wanted to have empirical evidence that those onomatopoeic forms were actually used in the language. For this reason, we intended to download the corpora to look for onomatopoeias using regular expressions to get as many onomatopoeic forms as possible without restricting them to the most common ones. However, we could only follow this procedure with the BNC, since it was the only corpus that could be downloaded. For Spanish and French, the CREA and FRANTEXT corpora were not downloadable, so that we had to follow a different process, namely manually extracting onomatopoeias from comics (<xref ref-type="bibr" rid="B18">2020, p. 51</xref>).</p>
</disp-quote>
<p>To overcome this difficulty, we resorted to corpora provided with the <italic>CQL</italic> (<xref ref-type="bibr" rid="B26">Sketch Engine, n.d.</xref>) and <italic>CQL</italic> query language (<xref ref-type="bibr" rid="B10">Evert S. &amp; The CWB Development Team, 2022</xref>), whose usage is illustrated in the next section, which allows for the usage of regular expressions; therefore, it was not mandatory to download any corpus.</p>
<p>While the Portuguese reference corpus <italic>CRPC, Corpus de Refer&#234;ncia do Portugu&#234;s Contempor&#226;neo</italic> (<xref ref-type="bibr" rid="B8">CLUL, Centro de Lingu&#237;stica da Universidade de Lisboa, 2008&#8211;2016</xref>) is provided with the <italic>CQP</italic> query language search engine, the reference corpora of English are focused on particular countries. The referential corpus of Spanish <italic>CREA</italic> (<xref ref-type="bibr" rid="B20">Real Academia Espa&#241;ola, n.d.</xref>) does not allow for the usage of <italic>CQL</italic> or <italic>CQP</italic> and, logically, search by means of regular expression, and there is not any referential corpus for Ukrainian. However, given the fact that the team of the corpus <italic>GRAK</italic> team is making efforts to meet the referential criteria of the corpus and considering another advantage that interjections in this corpus are correctly tagged, we judge the corpus <italic>GRAK</italic> as the best source for achieving the goal set.</p>
<p>On the other hand, given the fact that onomatopoeias, especially occasionally created words, are likely to appear not only in fiction, but also in internet communication, we finally decided to use available <italic>CQL</italic> or <italic>CQL</italic> based referential of internet corpora of English, Portuguese, Spanish, and Ukrainian, particularly:</p>
<list list-type="bullet">
<list-item><p>English Internet Corpus from Leeds Collection of English Corpora. (<xref ref-type="bibr" rid="B30">2022a</xref>), 190 million tokens.</p></list-item>
<list-item><p>Spanish Internet Corpus from Leeds Collection of Internet Corpora. (<xref ref-type="bibr" rid="B31">2022b</xref>). 145 million tokens.</p></list-item>
<list-item><p><italic>CRPC</italic>, Corpus de Refer&#234;ncia do Portugu&#234;s Contempor&#226;neo. (<xref ref-type="bibr" rid="B8">2008&#8211;2016</xref>) 411 million tokens.</p></list-item>
<list-item><p><italic>GRAK</italic>, General Regionally Annotated Corpus of Ukrainian (<xref ref-type="bibr" rid="B13">2017&#8211;2022</xref>), 1,476 million tokens.</p></list-item>
</list>
</sec>
<sec>
<title>3.4. Building queries</title>
<p><italic>Corpus Query Language</italic> (CQL Guide), <italic>Corpus Query Processor</italic> (CQP, 2022) or alternative similar querying methods allow searching for given patterns based upon sequences of letters employing regular expression and corpus annotations.</p>
<p>Since many occasional onomatopoeias are not lemmatized nor annotated with particular tags in the corpus, their examples are to be chosen by the attribute <italic>word</italic>, which is aimed at selecting tokens with specific character sequences independently from their lexical and grammar properties, as shown in the example of Query 1:</p>
<disp-quote>
<p>(Query 1) <bold>[word=&#8221;([bcdfghklmnprstvwxz]+)[aeiou]{1,2}([bcdfghklmnprstvwxz]+)-\1[aeiou]{1,2}\2&#8221;]</bold></p>
</disp-quote>
<p>The snippets inside the quotation marks composing the major part of Query 1 in <italic>CQL</italic> and <italic>CQP</italic> are processed as regular expressions. The regular expression between quotation marks is designed to match words that follow a consonant-vowel-consonant structure, where the first and last consonant combinations in syllables are the same, while the vowels may be either identical or different, e.g., <italic>tic-tac, brum-brum</italic>, etc. Let us now provide a detailed explanation regarding each part of the regular expression used in Query 1 in <xref ref-type="table" rid="T1">Table 1</xref>:</p>
<table-wrap id="T1">
<label>Table 1</label>
<caption>
<p>Description of functionality of parts of Query 1.</p>
</caption>
<table>
<tbody>
<tr>
<td align="left" valign="top">([bcdfghklmnprstvwxz]+)</td>
<td align="left" valign="top">matches the sequence of one or more consonants (represented by the character class inside the parentheses) and saves them as the first capturing group; here it stands for graphemes representing the consonant sounds of the language of corpus</td>
</tr>
<tr>
<td align="left" valign="top">[aeiou]{1,2}</td>
<td align="left" valign="top">matches one or two vowels; here it stands for other graphemes representing the vowel sounds of the language of corpus</td>
</tr>
<tr>
<td align="left" valign="top">([bcdfghklmnprstvwxz]+):</td>
<td align="left" valign="top">matches another sequence of one or more consonants and saves them as the second capturing group; here it stands for graphemes representing the consonant sounds of the language of corpus</td>
</tr>
<tr>
<td align="left" valign="top">&#8211;</td>
<td align="left" valign="top">matches a hyphen</td>
</tr>
<tr>
<td align="left" valign="top">\1:</td>
<td align="left" valign="top">backreferences the first capturing group, (i.e., it matches the consonants of the first capturing group, ensuring that they are repeated at the current position)</td>
</tr>
<tr>
<td align="left" valign="top">[aeiou]{1,2}:</td>
<td align="left" valign="top">matches one or two vowels; here it stands for graphemes representing the vowel sounds of the language of corpus</td>
</tr>
<tr>
<td align="left" valign="top">\2:</td>
<td align="left" valign="top">backreferences the second capturing group, ensuring that the same consonants captured in the second group are repeated here</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Finally, the entire expression matches the pattern of at least two consonant-vowel-consonant hyphenated similar syllables with the same consonants and varying vowels. More detailed explanations of the <italic>CQL</italic> syntax usage are accessible in the <italic>Corpus Query Language Guide</italic> (<xref ref-type="bibr" rid="B26">Sketch Engine, n.d.</xref>).</p>
<p>Query 1 is well suited for the corpora of English. However, in the case of applying a similar query to a language with different character sets (diacritics, Cyrillic, Greek, etc.), the characters inside the regular expressions are to be adapted to its alphabet, as we do for Portuguese, Spanish, and Ukrainian. To apply the same query for the <italic>Corpus de Refer&#234;ncia do Portugu&#234;s Contempor&#226;neo</italic> (<xref ref-type="bibr" rid="B8">CLUL, Centro de Lingu&#237;stica da Universidade de Lisboa., 2008&#8211;2016</xref>), Portuguese Language Corpus from the Leeds Collection of Internet Corpora (2022), we need to extend the character class with the diacritics (Query 2). Query 3 is respectively adapted for Spanish, and Query 4 for Ukrainian:</p>
<disp-quote>
<p>(Query 2) <bold>[word=&#8221;([bcdfghklmnprstvwxz&#231;]+)[ aieou&#225;&#237;&#233;&#243;&#250;&#228;&#226;&#227;&#233;&#234;&#245;&#246;&#244;]{1,2}([bcdfghklmnprstvwxz&#231;]+)-\1[aieou&#225;&#237;&#233;&#243;&#250;&#228;&#226;&#227;&#233;&#234;&#245;&#246;&#244;]{1,2}\2&#8221;]</bold></p>
<p>(Query 3) <bold>[word=&#8221;([bcdfghklmnprstvwxz&#231;&#241;]+)[aieou&#225;&#237;&#233;&#243;&#250;&#228;&#226;&#227;&#233;&#234;&#245;&#246;&#244;]{1,2}([bcdfghklmnprstvwxz&#231;&#241;]+)-\1[aieou&#225;&#237;&#233;&#243;&#250;&#228;&#226;&#227;&#233;&#234;&#245;&#246;&#244;]{1,2}\2&#8221;]</bold></p>
<p>(Query 4) <bold>[word=&#8221;([&#1073;&#1074;&#1075;&#1169;&#1076;&#1078;&#1079;&#1082;&#1083;&#1084;&#1085;&#1087;&#1088;&#1089;&#1090;&#1092;&#1093;&#1094;&#1095;&#1096;&#1097;]+)[ &#1072;&#1110;&#1077;&#1086;&#1091;&#1103;&#1108;&#1102;]{1,2}([&#1073;&#1074;&#1075;&#1169;&#1076;&#1078;&#1079;&#1082;&#1083;&#1084;&#1085;&#1087;&#1088;&#1089;&#1090;&#1092;&#1093;&#1094;&#1095;&#1096;&#1097;]+)-\1[&#1072;&#1110;&#1077;&#1086;&#1091;&#1103;&#1108;&#1102;]{1,2}\2&#187;]</bold></p>
</disp-quote>
</sec>
<sec>
<title>3.5. Validation of examples</title>
<p>We consider valid those examples that are interjectional onomatopoeias, i.e., sound-imitating words belonging to the class of interjections. Since the corpus query cannot delimit the sound-imitating words from those phonically motivated lexemes that are no longer interjections, but that could be qualified as such in the moment of creation, we considered those cases of &#8220;etymological&#8221; onomatopoeias as valid examples (e.g.: <italic>zigzag, criss-cross, flip-flops</italic>, etc.). Although this decision may seem arbitrary, we aim to evaluate the queries&#8217; potentiality to match the necessary graphic patterns, rather than exploring their usability for distinguishing the evolution of the word meaning. At the same time, we discard from this survey other types of onomatopoeia expressed with nouns and verbs. For instance, the Query 7 (<xref ref-type="table" rid="T2">Table 2</xref>), among results, yields the words <italic>murmur</italic> and <italic>barber</italic>, that might be valid for other objectives. Whereas conventional onomatopoeias could be checked in the dictionaries, to judge the onomatopoeic function of occasional words, we use contextual analysis at the level of concordance line or paragraph.</p>
<table-wrap id="T2">
<label>Table 2</label>
<caption>
<p>Results for the Leeds Collection of English Corpora (Internet Corpus).</p>
</caption>
<table>
<tbody>
<tr>
<td align="left" valign="top"><bold>Type of syllables</bold></td>
<td align="left" valign="top"><bold>Query and extracted examples</bold></td>
<td align="left" valign="top"><bold>Useful examples over 100</bold></td>
<td align="left" valign="top"><bold>Overall results in the corpus</bold></td>
</tr>
<tr>
<td align="left" valign="top">Repeated hyphenated closed syllables</td>
<td align="left" valign="top">(Query 5) [word&#8220;([bcdfghklmnprstvwxz]+)[aeiou]{1,2} ([bcdfghklmnprstvwxz]+)-\1[aeiou]{1,2}\2&#8221;]<break/><bold>Valid examples</bold>: <italic>beep-beep, bling-bling, boing-boing, bon-bon, brrring-brrring, bump-bump, chit-chat, chop-chop, chow-chow, chug-chug, chun-chuan, clip-clop, cous-cous, criss-cross, der-der, dig-dug, ding-dong</italic>.</td>
<td align="left" valign="top">61</td>
<td align="left" valign="top">956</td>
</tr>
<tr>
<td align="left" valign="top">Repeated hyphenated open syllables</td>
<td align="left" valign="top">(Query 6) [word=&#8221;([bcdfghklmnprstvwxz]+)[aeiou]{1,2}h?-\1[aeiou]{1,2}h?&#8221;]<break/><bold>Valid examples</bold>: <italic>bee-bee, beh-beh, bi-bi, bla-bla, blah-blah, boo-boo, cha-cha, chi-chi, choo-choo, coo-coo, da-da, do-dah, do-do, doo-dah, doo-doo, duh-duh, foo-foo, froo-froo, frou-frou, ga-ga, go-go</italic>.</td>
<td align="left" valign="top">78</td>
<td align="left" valign="top">475</td>
</tr>
<tr>
<td align="left" valign="top">Repeated non-hyphenated closed syllables</td>
<td align="left" valign="top">(Query 7) [word=&#8221;([bcdfghklmnprstvwxz]+)[aeiou]{1,2} ([bcdfghklmnprstvwxz]+)\1[aeiou]{1,2}\2&#8221;]<break/><bold>Valid examples</bold>: <italic>boingboing, chinchin, hahhah, xiangxing</italic>.</td>
<td align="left" valign="top">4</td>
<td align="left" valign="top">993</td>
</tr>
<tr>
<td align="left" valign="top">Repeated non-hyphenated open syllables</td>
<td align="left" valign="top">(Query 8) [word=&#8221;([bcdfghklmnprstvwxz]+)[aeiou] {1,2}h?\1[aeiou]{1,2}h?&#8221;]<break/><bold>Valid examples</bold>: 0.</td>
<td align="left" valign="top">0</td>
<td align="left" valign="top">1000</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec>
<title>4. Results and discussion</title>
<p>The queries created for extracting data can be qualified as models, since they represent a generalized schema suitable for the search of the extensive set of phenomena in question. The efficiency of a model can be measured differently by applying such parameters as <italic>accuracy, precision, sensitivity</italic> and other commonly used metrics.</p>
<p>In the case of using a model for previously unknown data without any established benchmark, it is impossible to calculate the sensitivity (also called <italic>recall</italic>), which is the number of retrieved true cases out of all the true cases in the population. Neither we can calculate the accuracy, which represents the number of true positive and true negative cases in relation to the entire number of cases in the dataset, as we cannot know the number of true negatives. In contrast, the precision demonstrates how many true cases appear in the selection, which is the parameter we aim to apply to perform a selection of as many as possible onomatopoeias out of a corpus, and it can be calculated on the data retrieved. For this reason, to roughly evaluate the validity of a query in terms of precision, we calculate the number of valid examples out of the first 100 examples in the concordance lines.</p>
<sec>
<title>4.1. Retrieved data description</title>
<p>The representation of repeated syllables through grapheme level posits a series of questions such as syllables division, diphthongization and hiatus, unpronounced graphemes, and open and closed syllables. Nevertheless, some of these dilemmas can be overcome by assuming that onomatopoeic words are not created solely according to strict phonic patterns: sometimes the repetition may include one or several syllables (<italic>meow, meow-meow</italic>) and some onomatopoeias may present variants of graphic representations (<italic>achoo, atchoo, achew</italic>); therefore, there is probably little sense of rigorous compliance to the syllable divisions. Additional observations can shed light upon the fact that the vast majority of the onomatopoeias start with an initial consonant grapheme, end with another consonant grapheme, and, in separate cases, with a vowel. In other words, the main relevant feature to take into account for the corpus queries is to consider repetitions of sequence starting with consonant graphemes, followed by vowels and optionally ending with another consonant grapheme or a group of such graphemes.</p>
<p>Given that the hyphen can also be optional, the possible queries are to rely upon a four-member paradigm:</p>
<list list-type="bullet">
<list-item><p>repeated non-hyphenated closed syllables;</p></list-item>
<list-item><p>repeated hyphenated open syllables;</p></list-item>
<list-item><p>repeated hyphenated closed syllables;</p></list-item>
<list-item><p>repeated non-hyphenated open syllables.</p></list-item>
</list>
<p>Hereafter, in <xref ref-type="table" rid="T2">Tables 2</xref>, <xref ref-type="table" rid="T3">3</xref>, <xref ref-type="table" rid="T4">4</xref> and <xref ref-type="table" rid="T5">5</xref> we expose the results yielded by the respective queries with the valid examples out of 100 first generated lines in the concordances; we indicate the number of repeated forms in parenthesis.</p>
<table-wrap id="T3">
<label>Table 3</label>
<caption>
<p>Results for the <italic>CRPC</italic>.</p>
</caption>
<table>
<tbody>
<tr>
<td align="left" valign="top"><bold>Type of syllables</bold></td>
<td align="left" valign="top"><bold>Query and extracted examples</bold></td>
<td align="left" valign="top"><bold>Useful examples over 100</bold></td>
<td align="left" valign="top"><bold>Overall results in the corpus</bold></td>
</tr>
<tr>
<td align="left" valign="top">Repeated hyphenated closed syllables</td>
<td align="left" valign="top">(Query 9) [word=&#8221;([bcdfghklmnprstvwxz&#231;]+)[ aieou&#225;&#237;&#233;&#243;&#250;&#228;&#226;&#227;&#233;&#234;&#245;&#246;&#244;]{1,2}([bcdfghklmnprstvwxz&#231;]+)-\1[aieou&#225;&#237;&#233;&#243;&#250;&#228;&#226;&#227;&#233;&#234;&#245;&#246;&#244;]{1,2}\2&#8221;]<break/><bold>Valid examples:</bold> <italic>flic-flac(14), ping-pong (19), can-can (2), bip-bip, tan-tan, tim-tim(4), flics-flacs, zig-zag, zuns-zuns, den-den, hip-hop (28), tic-tac, tam-tam, tchim-</italic>tchim (3).</td>
<td align="left" valign="top">78</td>
<td align="left" valign="top">1007</td>
</tr>
<tr>
<td align="left" valign="top">Repeated hyphenated open syllables</td>
<td align="left" valign="top">(Query 10) [word=&#8221;([bcdfghklmnprstvwxz&#231;]+)[aieou&#225;&#237; &#233;&#243;&#250;&#228;&#226;&#227;&#233;&#234;&#245;&#246;&#244;]{1,2}h?-\1[aieou&#225;&#237;&#233;&#243;&#250;&#228;&#226;&#227;&#233;&#234;&#245;&#246;&#244;]{1,2}h?&#8221;]<break/><bold>Valid examples:</bold> <italic>cri-cri, ts&#233;-ts&#233; (40), bla-bla (5), fru-fru, frou-frou, tau-tau, wha-wha (3), glu-glu (2), tai-tai, chi-chi, xi-xi</italic>.</td>
<td align="left" valign="top">60</td>
<td align="left" valign="top">296</td>
</tr>
<tr>
<td align="left" valign="top">Repeated non-hyphenated closed syllables</td>
<td align="left" valign="top">(Query 11) [word=&#8221;([bcdfghklmnprstvwxz&#231;]+)[aieou&#225;&#237;&#233;&#243;&#250; &#228;&#226;&#227;&#233;&#234;&#245;&#246;&#244;]{1,2}([bcdfghklmnprstvwxz&#231;]+)\1[aieou&#225;&#237;&#233;&#243;&#250;&#228;&#226;&#227;&#233;&#234;&#245;&#246;&#244;]{1,2}\2&#8221;]<break/><bold>Valid examples</bold>: 0.</td>
<td align="left" valign="top">0</td>
<td align="left" valign="top">73,223</td>
</tr>
<tr>
<td align="left" valign="top">Repeated non-hyphenated open syllables</td>
<td align="left" valign="top">(Query 12) [word=&#8221;([bcdfghklmnprstvwxz]+)[aieou&#225;&#237;&#233;&#243;&#250;&#228;&#226;&#227;&#233;&#234; &#245;&#246;&#244;]{1,2}h?\1[aieou&#225;&#237;&#233;&#243;&#250;&#228;&#226;&#227;&#233;&#234;&#245;&#246;&#244;]{1,2}h?&#8221;]<break/><bold>Valid examples</bold>: 0.</td>
<td align="left" valign="top">0</td>
<td align="left" valign="top">322,820</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T4">
<label>Table 4</label>
<caption>
<p>Results for Spanish/Leeds Collection of Internet Corpora.</p>
</caption>
<table>
<tbody>
<tr>
<td align="left" valign="top"><bold>Type of syllables</bold></td>
<td align="left" valign="top"><bold>Query and extracted examples</bold></td>
<td align="left" valign="top"><bold>Useful examples over 100</bold></td>
<td align="left" valign="top"><bold>Overall results in the corpus</bold></td>
</tr>
<tr>
<td align="left" valign="top">Repeated hyphenated closed syllables</td>
<td align="left" valign="top">(Query 13) [word=&#8221;([bcdfghklmnprstvwxz&#231;]+)[aieou&#225;&#237;&#233;&#243;&#250;&#228;&#226;&#227; &#233;&#234;&#245;&#246;&#252;&#244;]{1,2}([bcdfghklmnprstvwxz&#231;]+)-\1[aieou&#225;&#237;&#233;&#243;&#250;&#228;&#226;&#227;&#233;&#234;&#245;&#246;&#252;&#244;]{1,2}\2&#8221;]<break/><bold>Valid examples</bold>: <italic>zig-zag, tut-tut, tun-tun, tic-tac, tap-tap, tan-t&#225;n, tam-tam, run-run, ruc-ruc, ris-ras, pon-pon, pis-pas, pin-pon, ping-pong, pim-pom, pil-pil, pill-pill, mish-mash, kin-kan, hip-hop, cric-cric, cric-crac, cous-cous, click-clack, chow-chow, chon-chon, chis-chas, chin-chin, chal-chal, can-can, bum-bum, boom-boom, bip-bip</italic>.</td>
<td align="left" valign="top">91</td>
<td align="left" valign="top">304</td>
</tr>
<tr>
<td align="left" valign="top">Repeated hyphenated open syllables</td>
<td align="left" valign="top">(Query 14) [word=&#8221;([bcdfghklmnprstvwxz&#231;]+)[aieou&#225;&#237;&#233;&#243;&#250;&#228; &#226;&#227;&#233;&#234;&#245;&#246;&#252;&#244;]{1,2}h?-\1[aieou&#225;&#237;&#233;&#243;&#250;&#228;&#226;&#227;&#233;&#234;&#245;&#246;&#252;&#244;]{1,2}h?&#8221;]<break/><bold>Valid examples</bold>: <italic>cu-cu, trau-trau, poo-poo, boo-boo, bu-bu, no-ni, re&#233;-r&#237;o, pro-prio, da-da, f&#237;o-f&#237;o, xie-xie, tsi-tsi, go-g&#243;, feo-feo, blah-blah, bla-bla, fr&#250;-fr&#250;, pai-pai, re-re, deu-da, du-du&#225;, cri-cri, cu-c&#250;, pi-pi, wah-wah, cua-cua, tue-tue, bee-bee, tse-tse, fru-fru, ga-ga, ka-ke, ts&#233;-ts&#233;, pa-pa, p&#237;o-p&#237;o, fru-fr&#250;, mi-mi</italic>.</td>
<td align="left" valign="top">45</td>
<td align="left" valign="top">62</td>
</tr>
<tr>
<td align="left" valign="top">Repeated non-hyphenated closed syllaba</td>
<td align="left" valign="top">(Query 15) [word=&#8221;([bcdfghklmnprstvwxz&#231;]+)[aieou&#225;&#237;&#233;&#243;&#250;&#228;&#226;&#227;&#233;&#234;&#245; &#246;&#252;&#244;]{1,2}([bcdfghklmnprstvwxz&#231;]+)\1[aieou&#225;&#237;&#233;&#243;&#250;&#228;&#226;&#227;&#233;&#234;&#245;&#246;&#252;&#244;]{1,2}\2&#8221;]<break/><bold>Valid examples</bold>: 0.</td>
<td align="left" valign="top">0</td>
<td align="left" valign="top">992</td>
</tr>
<tr>
<td align="left" valign="top">Repeated non-hyphenated open syllables</td>
<td align="left" valign="top">(Query 16) [word=&#8221;([bcdfghklmnprstvwxz]+)[aieou&#225;&#237;&#233;&#243;&#250;&#228;&#226;&#227;&#233;&#234;&#245;&#246;&#252;&#244;]{1,2}h?\1[aieou&#225;&#237;&#233;&#243;&#250;&#228;&#226;&#227;&#233;&#234;&#245;&#246;&#252;&#244;]{1,2}h?&#8221;]<break/><bold>Valid examples</bold>: 0.</td>
<td align="left" valign="top">0</td>
<td align="left" valign="top">999</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T5">
<label>Table 5</label>
<caption>
<p>Results for the Corpus <italic>GRAK</italic>-16.</p>
</caption>
<table>
<tbody>
<tr>
<td align="left" valign="top"><bold>Type of syllables</bold></td>
<td align="left" valign="top"><bold>Query and extracted examples</bold></td>
<td align="left" valign="top"><bold>Useful examples over 100</bold></td>
<td align="left" valign="top"><bold>Overall results in the corpus</bold></td>
</tr>
<tr>
<td align="left" valign="top">Repeated hyphenated closed syllables</td>
<td align="left" valign="top">(Query 17)<break/>[word=&#8221;([&#1073;&#1074;&#1075;&#1169;&#1076;&#1078;&#1079;&#1081;&#1082;&#1083;&#1084;&#1085;&#1087;&#1088;&#1089;&#1090;&#1092;&#1093;&#1094;&#1095;&#1096;&#1097;]+)<break/>[&#1072;&#1110;&#1077;&#1086;&#1091;&#1103;&#1108;&#1102;]{1,2}<break/>([&#1073;&#1074;&#1075;&#1169;&#1076;&#1078;&#1079;&#1082;&#1083;&#1084;&#1085;&#1087;&#1088;&#1089;&#1090;&#1092;&#1093;&#1094;&#1095;&#1096;&#1097;]+)-\1[&#1072;&#1110;&#1077;&#1086;&#1091;&#1103;&#1108;&#1102;]<break/>{1,2}\2?[&#1072;&#1110;&#1077;&#1086;&#1091;&#1103;&#1108;&#1102;]?.*&#8221;]<break/><bold>Valid examples</bold>: <styled-content style="font-family: Charis SIL"><italic>&#1073;&#1077;&#1085;-&#1073;&#1077;&#1085;-&#1073;&#1077;&#1085;, &#1073;&#1086;&#1084;-&#1073;&#1086;&#1084;, &#1073;&#1088;&#1091;&#1084;-&#1073;&#1088;&#1091;&#1084;&#1082;&#1072;&#1102;&#1095;&#1080;, &#1075;&#1077;&#1085;-&#1075;&#1077;&#1085; (6), &#1075;&#1086;&#1087;-&#1075;&#1086;&#1087;-&#1083;&#1103;, &#1075;&#1091;&#1088;-&#1075;&#1091;&#1088;, &#1075;&#1091;&#1087;-&#1075;&#1091;&#1087;, &#1075;&#1072;&#1074;-&#1075;&#1072;&#1074;, &#1076;&#1079;&#1103;&#1074;-&#1076;&#1079;&#1103;&#1074; (2), &#1082;&#1072;&#1169;-&#1082;&#1072;&#1169;, &#1084;&#1091;&#1088;-&#1084;&#1091;&#1088; (3), &#1085;&#1102;&#1093;-&#1085;&#1102;&#1093;, &#1088;&#1072;&#1079;-&#1088;&#1072;&#1079;, &#1088;&#1086;&#1093;-&#1088;&#1086;&#1093;, &#1088;&#1086;&#1093;-&#1088;&#1086;&#1093;-&#1088;&#1086;&#1093;, &#1090;&#1110;&#1082;-&#1090;&#1072;&#1082;, &#1090;&#1091;&#1078;-&#1090;&#1091;&#1078;&#1073;, &#1090;&#1091;&#1082;-&#1090;&#1072;&#1082;-&#1090;&#1091;&#1082;, &#1090;&#1091;&#1082;-&#1090;&#1072;&#1082;-&#1090;&#1091;&#1082;-&#1090;&#1072;&#1082;, &#1090;&#1091;&#1078;-&#1090;&#1091;&#1078;, &#1090;&#1091;&#1087;-&#1090;&#1091;&#1087;, &#1095;&#1086;&#1074;&#1075;-&#1095;&#1077;&#1088;&#1093; (3)</italic></styled-content>.</td>
<td align="left" valign="top">32</td>
<td align="left" valign="top">35,202</td>
</tr>
<tr>
<td align="left" valign="top">Repeated hyphenated open syllables</td>
<td align="left" valign="top">(Query 18)<break/>[word=&#8221;([&#1073;&#1074;&#1075;&#1169;&#1076;&#1078;&#1079;&#1082;&#1083;&#1084;&#1085;&#1087;&#1088;&#1089;&#1090;&#1092;&#1093;&#1094;&#1095;&#1096;&#1097;]+)<break/>[&#1072;&#1110;&#1077;&#1086;&#1091;&#1103;&#1108;&#1102;]{1,2}-([&#1073;&#1074;&#1075;&#1169;&#1076;&#1078;&#1079;&#1082;&#1083;&#1084;&#1085;&#1087;&#1088;&#1089;&#1090;&#1092;&#1093;&#1094;&#1095;&#1096;&#1097;]+)<break/>[&#1072;&#1110;&#1077;&#1086;&#1091;&#1103;&#1108;&#1102;]{1,2}.*&#8221;]<break/><bold>Valid examples</bold>: <styled-content style="font-family: Charis SIL"><italic>&#1075;&#1086;-&#1075;&#1086;-&#1075;&#1086;, &#1075;&#1091;-&#1075;&#1091; (2), &#1075;&#1091;-&#1075;&#1091;-&#1075;&#1091; (2), &#1082;&#1091;-&#1082;&#1091;, &#1093;&#1072;-&#1093;&#1072;, &#1090;&#1072;-&#1090;&#1072;, &#1090;&#1088;&#1072;-&#1090;&#1072;-&#1090;&#1072;-&#1090;&#1072;, &#1090;&#1091;-&#1090;&#1091;&#1084; (4), &#1090;&#1100;&#1092;&#1091;-&#1090;&#1100;&#1092;&#1091;-&#1090;&#1100;&#1092;&#1091; (2), &#1093;&#1110;-&#1093;&#1110;&#1082;&#1072;&#1085;&#1085;&#1103;, &#1093;&#1091;-&#1093;&#1091;-&#1093;&#1091;, &#1093;&#1072;-&#1093;&#1072; (2), &#1093;&#1072;-&#1093;&#1072;-&#1093;&#1072; (2), &#1096;&#1091;-&#1096;&#1091;-&#1096;&#1091; (2)</italic></styled-content>.</td>
<td align="left" valign="top">21</td>
<td align="left" valign="top">50,371</td>
</tr>
<tr>
<td align="left" valign="top">Repeated non-hyphenated closed syllables</td>
<td align="left" valign="top">(Query 19)<break/>[word=&#8221;([&#1073;&#1074;&#1075;&#1169;&#1076;&#1078;&#1079;&#1082;&#1083;&#1084;&#1085;&#1087;&#1088;&#1089;&#1090;&#1092;&#1093;&#1094;&#1095;&#1096;&#1097;]+)<break/>[&#1072;&#1110;&#1077;&#1086;&#1091;&#1103;&#1108;&#1102;]{1,2}([&#1073;&#1074;&#1075;&#1169;&#1076;&#1078;&#1079;&#1082;&#1083;&#1084;&#1085;&#1087;&#1088;&#1089;&#1090;&#1092;&#1093;&#1094;&#1095;&#1096;&#1097;]+)\1[&#1072;&#1110;&#1077;&#1086;&#1091;&#1103;&#1108;&#1102;]{1,2}\2?[&#1072;&#1110;&#1077;&#1086;&#1091;&#1103;&#1108;&#1102;]?.*&#8221;]<break/><bold>Valid examples:0</bold>.</td>
<td align="left" valign="top">0</td>
<td align="left" valign="top">5,504,583</td>
</tr>
<tr>
<td align="left" valign="top">Repeated non-hyphenated open syllables</td>
<td align="left" valign="top">(Query 20)<break/>[word=&#8221;([&#1073;&#1074;&#1075;&#1169;&#1076;&#1078;&#1079;&#1082;&#1083;&#1084;&#1085;&#1087;&#1088;&#1089;&#1090;&#1092;&#1093;&#1094;&#1095;&#1096;&#1097;]+)<break/>[&#1072;&#1110;&#1077;&#1086;&#1091;&#1103;&#1108;&#1102;]{1,2}-([&#1073;&#1074;&#1075;&#1169;&#1076;&#1078;&#1079;&#1082;&#1083;&#1084;&#1085;&#1087;&#1088;&#1089;&#1090;&#1092;&#1093;&#1094;&#1095;&#1096;&#1097;]+)<break/>[&#1072;&#1110;&#1077;&#1086;&#1091;&#1103;&#1108;&#1102;]{1,2}.*&#8221;]<break/><bold>Valid examples</bold>: <styled-content style="font-family: Charis SIL"><italic>&#1076;&#1091;-&#1076;&#1091;, &#1093;&#1072;-&#1093;&#1072;-&#1093;&#1072;</italic></styled-content>.</td>
<td align="left" valign="top">2</td>
<td align="left" valign="top">432,002</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>According to the results obtained, an outstanding fact that drew our attention was that for the queries 5, 6, 9, 10, 13, 14, 17 and 18, the rate of valid examples was much higher than those yielded by others. Many cases of onomatopoeias do not figure in the reference explanatory dictionaries. For example, out of 25 different onomatopoeias retrieved from the <italic>CRPC</italic> by queries 9 and 10, 17 do not appear as entries in the dictionary <italic>Priberam</italic> (2023) neither in bisyllabic or monosyllabic forms as sound-imitating lexemes: <italic>cri-cri, frou-frou, wha-wha, glu-glu, tai-tai, chi-chi, flic-flac, can-can, tan-tan, tim-tim, flics-flacs, zig-zag, zuns-zuns, den-den, hip-hop, tic-tac, tam-tam</italic>. This means that significant parts of the examples found are occasional sound-imitating words missing in lexicographic entries of modern dictionaries.</p>
<p>The impressive number of retrieved examples of repeated non-hyphenated open syllables observed in Portuguese and Ukrainian is likely due to a universal linguistic tendency. In the Leeds Corpora, the interface limited the maximum number of examples to 1,000, which is why we could not access the complete results.</p>
<p>From <xref ref-type="table" rid="T2">Tables 2</xref>, <xref ref-type="table" rid="T3">3</xref>, <xref ref-type="table" rid="T4">4</xref>, <xref ref-type="table" rid="T5">5</xref> we can also observe that the more overall concordance lines are generated as per the query, the less specific the query focus of the onomatopoeias is.</p>
</sec>
<sec>
<title>4.2. Exploring markers&#8217; statistical significance</title>
<p><xref ref-type="table" rid="T6">Table 6</xref> integrates the number of specific results produced by the queries 5&#8211;20 in the four corpora excluding the types of queries that yielded 0 results.</p>
<table-wrap id="T6">
<label>Table 6</label>
<caption>
<p>Integrated results: precision of patterns implemented in each query.</p>
</caption>
<table>
<tbody>
<tr>
<td align="left" valign="top"><bold>Type of syllable</bold></td>
<td align="left" valign="top"><bold>Valid examples over 100</bold></td>
</tr>
<tr>
<td align="left" valign="top">Repeated hyphenated closed syllables (English)</td>
<td align="left" valign="top">61</td>
</tr>
<tr>
<td align="left" valign="top">Repeated hyphenated open syllables (English)</td>
<td align="left" valign="top">78</td>
</tr>
<tr>
<td align="left" valign="top">Repeated non-hyphenated closed syllables (English)</td>
<td align="left" valign="top">4</td>
</tr>
<tr>
<td align="left" valign="top">Repeated hyphenated closed syllables (Portuguese)</td>
<td align="left" valign="top">83</td>
</tr>
<tr>
<td align="left" valign="top">Repeated hyphenated open syllables (Portuguese)</td>
<td align="left" valign="top">60</td>
</tr>
<tr>
<td align="left" valign="top">Repeated hyphenated closed syllables (Spanish)</td>
<td align="left" valign="top">91</td>
</tr>
<tr>
<td align="left" valign="top">Repeated hyphenated open syllables (Spanish)</td>
<td align="left" valign="top">45</td>
</tr>
<tr>
<td align="left" valign="top">Repeated hyphenated closed syllables (Ukrainian)</td>
<td align="left" valign="top">32</td>
</tr>
<tr>
<td align="left" valign="top">Repeated hyphenated open syllables (Ukrainian)</td>
<td align="left" valign="top">22</td>
</tr>
<tr>
<td align="left" valign="top">Repeated non-hyphenated open syllables (Ukrainian)</td>
<td align="left" valign="top">2</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>It is obvious that the repeated hyphenated closed syllables patterns yielded the most significant results among the four languages. To confirm that this is a tendency rather than an occasional combination of factors, we transposed the data as in <xref ref-type="table" rid="T7">Table 7</xref> and subjected it to an ANOVA test (single-factor), as we expected that in this experiment the only significant factor was the type of syllables.</p>
<table-wrap id="T7">
<label>Table 7</label>
<caption>
<p>Number of useful results (over 100) as per each type of query.</p>
</caption>
<table>
<tbody>
<tr>
<td align="left" valign="top"></td>
<td align="left" valign="top"><bold>Repeated hyphenated closed syllables</bold></td>
<td align="left" valign="top"><bold>Repeated hyphenated open syllables</bold></td>
<td align="left" valign="top"><bold>Repeated non-hyphenated closed syllables</bold></td>
<td align="left" valign="top"><bold>Repeated non-hyphenated open syllables</bold></td>
</tr>
<tr>
<td align="left" valign="top">English</td>
<td align="left" valign="top">61</td>
<td align="left" valign="top">78</td>
<td align="left" valign="top">4</td>
<td align="left" valign="top">0</td>
</tr>
<tr>
<td align="left" valign="top">Portuguese</td>
<td align="left" valign="top">83</td>
<td align="left" valign="top">60</td>
<td align="left" valign="top">0</td>
<td align="left" valign="top">0</td>
</tr>
<tr>
<td align="left" valign="top">Spanish</td>
<td align="left" valign="top">91</td>
<td align="left" valign="top">45</td>
<td align="left" valign="top">0</td>
<td align="left" valign="top">0</td>
</tr>
<tr>
<td align="left" valign="top">Ukrainian</td>
<td align="left" valign="top">32</td>
<td align="left" valign="top">22</td>
<td align="left" valign="top">0</td>
<td align="left" valign="top">2</td>
</tr>
<tr>
<td align="left" valign="top">Average</td>
<td align="left" valign="top">68.67</td>
<td align="left" valign="top">51.25</td>
<td align="left" valign="top">1</td>
<td align="left" valign="top">0.5</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><xref ref-type="table" rid="T8">Table 8</xref> illustrates the results of the ANOVA test performed in Microsoft Excel.</p>
<table-wrap id="T8">
<label>Table 8</label>
<caption>
<p>Anova test as per the data analysis in Microsoft Excel.</p>
</caption>
<table>
<tbody>
<tr>
<td align="left" valign="top" colspan="5">Anova: Single Factor</td>
</tr>
<tr>
<td align="left" valign="top"><bold>Groups</bold></td>
<td align="left" valign="top"><bold>Count</bold></td>
<td align="left" valign="top"><bold>Sum</bold></td>
<td align="left" valign="top"><bold>Average</bold></td>
<td align="left" valign="top"><bold>Variance</bold></td>
</tr>
<tr>
<td align="left" valign="top">Column 1</td>
<td align="left" valign="top">4</td>
<td align="left" valign="top">267</td>
<td align="left" valign="top">66.75</td>
<td align="left" valign="top">697.5833</td>
</tr>
<tr>
<td align="left" valign="top">Column 2</td>
<td align="left" valign="top">4</td>
<td align="left" valign="top">205</td>
<td align="left" valign="top">51.25</td>
<td align="left" valign="top">562.25</td>
</tr>
<tr>
<td align="left" valign="top">Column 3</td>
<td align="left" valign="top">4</td>
<td align="left" valign="top">4</td>
<td align="left" valign="top">1</td>
<td align="left" valign="top">4</td>
</tr>
<tr>
<td align="left" valign="top">Column 4</td>
<td align="left" valign="top">4</td>
<td align="left" valign="top">2</td>
<td align="left" valign="top">0.5</td>
<td align="left" valign="top">1</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<td align="left" valign="top" colspan="7">ANOVA</td>
</tr>
<tr>
<td align="left" valign="top"><bold>Source of variation</bold></td>
<td align="left" valign="top"><bold>SS</bold></td>
<td align="left" valign="top"><bold>df</bold></td>
<td align="left" valign="top"><bold>MS</bold></td>
<td align="left" valign="top"><bold>F</bold></td>
<td align="left" valign="top"><bold>P-value</bold></td>
<td align="left" valign="top"><bold>F crit</bold></td>
</tr>
<tr>
<td align="left" valign="top">Between groups</td>
<td align="left" valign="top">14053.25</td>
<td align="left" valign="top">3</td>
<td align="left" valign="top">4684.417</td>
<td align="left" valign="top">14.81434</td>
<td align="left" valign="top">0.000245</td>
<td align="left" valign="top">3.490295</td>
</tr>
<tr>
<td align="left" valign="top">Within groups</td>
<td align="left" valign="top">3794.5</td>
<td align="left" valign="top">12</td>
<td align="left" valign="top">316.2083</td>
<td align="left" valign="top"></td>
<td align="left" valign="top"></td>
<td align="left" valign="top"></td>
</tr>
<tr>
<td align="left" valign="top">Total</td>
<td align="left" valign="top">17847.75</td>
<td align="left" valign="top">15</td>
<td align="left" valign="top"></td>
<td align="left" valign="top"></td>
<td align="left" valign="top"></td>
<td align="left" valign="top"></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>It is seen that the <italic>p-value</italic> (i.e., the probability that the achieved results are due to random coincidence) is equal to 0.000251. This value is far lower than the conventional 0.05 (i.e., the 5% threshold), which confirms that the data obtained are not due to chance, and the closed-syllable base onomatopoeias with repeated sounds turn out to be the most productive query pattern in the four observed languages, making the hyphen a robust onomatopoeic marker.</p>
<p>It is obvious that most hyphenated onomatopoeias do exist both in bisyllabic and monosyllabic forms, e.g., <italic>bang-bang / bang, cling-cling / cling, ching-ching / ching, beep-beep / beep, plink-plink /plink</italic>, and this property could be used at a further stage to optimize the extraction programmatically according to the following algorithm: if a pattern with repeated syllables recurrently occurs in a corpus, perform monosyllabic search for the syllabus used in the pattern.</p>
</sec>
<sec>
<title>4.3. Searching for interjectional onomatopoeias through POS-filter</title>
<p>Since interjections in the corpus <italic>GRAK</italic> have been correctly tagged, there is the possibility to perform the queries now with an additional pos-filter to limit the sampling exclusively to the cases of interjections, which increased the precision approximately twice. The queries 17 and 18 (<xref ref-type="table" rid="T5">Table 5</xref>) are now extended with this pos-filter. The results are shown in <xref ref-type="table" rid="T9">Tables 9</xref> and <xref ref-type="table" rid="T10">10</xref>, respectively.</p>
<table-wrap id="T9">
<label>Table 9</label>
<caption>
<p>Repeated hyphenated closed syllables (Ukrainian) with pos-filter.</p>
</caption>
<table>
<tbody>
<tr>
<td align="left" valign="top"><bold>Query and extracted examples</bold></td>
<td align="left" valign="top"><bold>Language/corpus</bold></td>
<td align="left" valign="top"><bold>Useful examples over 100</bold></td>
<td align="left" valign="top"><bold>Overall results</bold></td>
</tr>
<tr>
<td align="left" valign="top">(Query 21)<break/>[word=&#8221;([&#1073;&#1074;&#1075;&#1169;&#1076;&#1078;&#1079;&#1082;&#1083;&#1084;&#1085;&#1087;&#1088;&#1089;&#1090;&#1092;&#1093;&#1094;&#1095;&#1096;&#1097;]+)[&#1072;&#1110;&#1077;&#1086;&#1091;&#1103;&#1108;&#1102;] {1,2}([&#1073;&#1074;&#1075;&#1169;&#1076;&#1078;&#1079;&#1082;&#1083;&#1084;&#1085;&#1087;&#1088;&#1089;&#1090;&#1092;&#1093;&#1094;&#1095;&#1096;&#1097;]+)-\1[&#1072;&#1110;&#1077;&#1086;&#1091;&#1103;&#1108;&#1102;]{1,2}\2?[&#1072;&#1110;&#1077;&#1086;&#1091;&#1103;&#1108;&#1102;]?.*&#8220;&amp;tag=&#8221;.*intj.*&#8221;]<break/><bold>Valid examples</bold>: <styled-content style="font-family: Charis SIL"><italic>&#1084;&#1091;&#1088;-&#1084;&#1091;&#1088;, &#1088;&#1086;&#1093;-&#1088;&#1086;&#1093;, &#1090;&#1110;&#1082;-&#1090;&#1072;&#1082;, &#1075;&#1091;&#1088;-&#1075;&#1091;&#1088;, &#1075;&#1091;&#1087;-&#1075;&#1091;&#1087;, &#1090;&#1091;&#1087;-&#1090;&#1091;&#1087;, &#1090;&#1072;&#1082;-&#1090;&#1072;&#1082;, &#1073;&#1086;&#1084;-&#1073;&#1086;&#1084;, &#1088;&#1086;&#1093;-&#1088;&#1086;&#1093;-&#1088;&#1086;&#1093;, &#1076;&#1079;&#1103;&#1074;-&#1076;&#1079;&#1103;&#1074;, &#1075;&#1072;&#1074;-&#1075;&#1072;&#1074;, &#1082;&#1083;&#1072;&#1094;-&#1082;&#1083;&#1072;&#1094;, &#1094;&#1086;&#1082;-&#1094;&#1086;&#1082;-&#1094;&#1086;&#1082;-&#1094;&#1086;&#1082;, &#1082;&#1072;&#1087;-&#1082;&#1072;&#1087;, &#1090;&#1091;&#1082;-&#1090;&#1091;&#1082;, &#1082;&#1083;&#1072;&#1094;-&#1082;&#1083;&#1072;&#1094;-&#1082;&#1083;&#1072;&#1094;, &#1094;&#1091;&#1088;-&#1094;&#1091;&#1088;&#1072;, &#1084;&#1072;&#1088;&#1096;-&#1084;&#1072;&#1088;&#1096;, &#1073;&#1091;&#1084;-&#1073;&#1091;&#1084;, &#1090;&#1091;&#1082;-&#1090;&#1091;&#1082;-&#1090;&#1091;&#1082;, &#1094;&#1086;&#1082;-&#1094;&#1086;&#1082;, &#1085;&#1103;&#1074;-&#1085;&#1103;&#1074;, &#1090;&#1091;&#1087;-&#1090;&#1091;&#1087;-&#1090;&#1091;&#1087;, &#1096;&#1090;&#1086;&#1074;&#1093;-&#1096;&#1090;&#1086;&#1074;&#1093;, &#1094;&#1110;&#1087;-&#1094;&#1110;&#1087;-&#1094;&#1110;&#1087;, &#1095;&#1072;&#1093;-&#1095;&#1072;&#1093;-&#1095;&#1072;&#1093;-&#1095;&#1072;&#1093;, &#1075;&#1072;&#1084;-&#1075;&#1072;&#1084;, &#1090;&#1110;&#1082;-&#1090;&#1110;&#1082;, &#1073;&#1086;&#1084;-&#1073;&#1086;&#1084;-&#1073;&#1086;&#1084;, &#1090;&#1072;&#1082;-&#1090;&#1072;&#1082;-&#1090;&#1072;&#1082;, &#1094;&#1084;&#1086;&#1082;-&#1094;&#1084;&#1086;&#1082;, &#1089;&#1074;&#1103;&#1090;-&#1089;&#1074;&#1103;&#1090;, &#1089;&#1074;&#1103;&#1090;-&#1089;&#1074;&#1103;&#1090;-&#1089;&#1074;&#1103;&#1090;, &#1075;&#1072;&#1074;-&#1075;&#1072;&#1074;-&#1075;&#1072;&#1074;</italic></styled-content></td>
<td align="left" valign="top"><bold><italic>CQL, GRAK</italic> 16</bold></td>
<td align="left" valign="top"><bold>92</bold></td>
<td align="left" valign="top"><bold>3,513</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T10">
<label>Table 10</label>
<caption>
<p>Repeated hyphenated open syllables (Ukrainian) with pos-filter.</p>
</caption>
<table>
<tbody>
<tr>
<td align="left" valign="top"><bold>Query and extracted examples</bold></td>
<td align="left" valign="top"><bold>Language/Corpus</bold></td>
<td align="left" valign="top"><bold>Use- ful examples over 100</bold></td>
<td align="left" valign="top"><bold>Overall results</bold></td>
</tr>
<tr>
<td align="left" valign="top">(Query 22)<break/>[word=&#8221;([&#1073;&#1074;&#1075;&#1169;&#1076;&#1078;&#1079;&#1082;&#1083;&#1084;&#1085;&#1087;&#1088;&#1089;&#1090;&#1092;&#1093;&#1094;&#1095;&#1096;&#1097;]+)<break/>[&#1072;&#1110;&#1077;&#1086;&#1091;&#1103;&#1108;&#1102;]{1,2}-([&#1073;&#1074;&#1075;&#1169;&#1076;&#1078;&#1079;&#1082;&#1083;&#1084;&#1085;&#1087;&#1088;&#1089;&#1090;&#1092;&#1093;&#1094;&#1095;&#1096;&#1097;]+)<break/>[&#1072;&#1110;&#1077;&#1086;&#1091;&#1103;&#1108;&#1102;] {1,2}.*&#8220;&amp;tag=&#8221;.*intj.*&#8221;]<break/><bold>Valid examples</bold>: <styled-content style="font-family: Charis SIL"><italic>&#1093;&#1072;-&#1093;&#1072;-&#1093;&#1072;, &#1093;&#1091;-&#1093;&#1091;-&#1093;&#1091;, &#1075;&#1086;-&#1075;&#1086;-&#1075;&#1086;, &#1093;&#1072;-&#1093;&#1072;, &#1082;&#1091;-&#1082;&#1091;, &#1085;&#1091;-&#1085;&#1091;-&#1085;&#1091;, &#1093;&#1077;-&#1093;&#1077;-&#1093;&#1077;, &#1075;&#1086;-&#1075;&#1086;, &#1093;&#1072;-&#1093;&#1072;-&#1093;&#1072;-&#1093;&#1072;, &#1075;&#1072;-&#1075;&#1072;-&#1075;&#1072;, &#1093;&#1072;-&#1093;&#1086;-&#1093;&#1086;-&#1093;&#1086;, &#1093;&#1086;-&#1093;&#1086;, &#1085;&#1091;-&#1085;&#1091;, &#1087;&#1110;-&#1087;&#1110;-&#1087;&#1110;, &#1093;&#1077;-&#1093;&#1077;, &#1082;&#1074;&#1072;-&#1082;&#1074;&#1072;, &#1084;&#1077;-&#1084;&#1077;-&#1084;&#1077;, &#1094;&#1091;-&#1094;&#1091;, &#1084;&#1077;-&#1084;&#1077;, &#1090;&#1072;-&#1090;&#1072;-&#1090;&#1072;, &#1093;&#1091;-&#1093;&#1091;, &#1073;&#1072;-&#1073;&#1072;&#1093;, &#1096;&#1072;-&#1096;&#1072;, &#1093;&#1086;-&#1093;&#1086;-&#1093;&#1086;-&#1093;&#1086;, &#1090;&#1102;-&#1090;&#1102;, &#1075;&#1086;-&#1075;&#1086;-&#1075;&#1086;-&#1075;&#1086;, &#1082;&#1091;-&#1082;&#1091;-&#1088;&#1110;-&#1082;&#1091;, &#1093;&#1110;-&#1093;&#1110;-&#1093;&#1110;, &#1090;&#1091;-&#1090;&#1091;, &#1090;&#1102;-&#1090;&#1102;-&#1090;&#1102;, &#1082;&#1091;-&#1082;&#1091;-&#1088;&#1110;-&#1082;&#1091;-&#1091;</italic></styled-content></td>
<td align="left" valign="top"><bold><italic>CQL, GRAK</italic> 16</bold></td>
<td align="left" valign="top"><bold>91</bold></td>
<td align="left" valign="top"><bold>5,284</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Among the corpora involved in this survey, the only corpus with correctly tagged interjections is the corpus <italic>GRAK</italic>, hence we put to test only the selection in Ukrainian. The yielded results, once added the condition ([tag=&#8220;.*intj.*&#8221;]), are shown in <xref ref-type="table" rid="T9">Table 9</xref> and <xref ref-type="table" rid="T10">Table 10</xref>.</p>
<p>In fact, in many corpora the interjections appear tagged as nouns or adjectives. A serious challenge of modern corpus linguistics is the interjections recognition in transcribed corpora (<xref ref-type="bibr" rid="B28">Tellier et al., 2010</xref>):</p>
<disp-quote>
<p>Among the seven consistent tagging errors presented above, some posed theoretical challenges due to their essentially pragmatic function and difficulty of fitting into a &#8216;traditional&#8217; word class defined following morphosyntactic criteria. This is the case of pragmatic markers such as <italic>well</italic>, interjections such as <italic>oh, ah</italic>, and response forms such as <italic>yes, no, okay, yeah, sure</italic>. Other tagging errors only need a specific rule to help CLAWS4 disambiguate and assign the correct tag (<xref ref-type="bibr" rid="B12">Galiano, L. &amp; Semeraro, A. Part-of-Speech and Pragmatic, 2023, p. 26</xref>).</p>
</disp-quote>
<p>The results of Query 22 (<xref ref-type="table" rid="T10">Table 10</xref>) confirm the demand for text corpora containing accurately tagged interjections. What catches our attention in <xref ref-type="table" rid="T9">Table 9</xref> and <xref ref-type="table" rid="T10">Table 10</xref> is the significantly superior performance of the query in the corpus <italic>GRAK</italic>, which is seemingly due to the presence of correct interjection annotation, whereas in the corpora of English, Spanish, and Portuguese utilized, the interjectional onomatopoeias are assigned tags of nouns or adjectives. This pos discrepancy makes it impossible to use the pos-attribute as a reliable feature for extraction.</p>
</sec>
<sec>
<title>4.4. Extraction from plain texts using regular expressions filter</title>
<p>The logical question arising from the observations of <xref ref-type="table" rid="T9">Table 9</xref> and <xref ref-type="table" rid="T10">Table 10</xref> is whether the query takes into account any linguistic feature. The corpora are textual databases provided with linguistic relevant data, but our queries performed to all the corpora except the corpus <italic>GRAK</italic> were based exclusively upon formal features such as grapheme combinations, with no relation to other linguistic properties. In fact, most queries include the attribute <italic>word</italic>. From the standpoint of corpus linguistics, <italic>word</italic> is a specific sequence of symbols separated by delimiters such as space. This means that the queries are intended to search particular sequences of letters by the regular expression syntax, which can also be successfully searched in many text editors with regular-expression engines, such as <italic>Notepad++, Sublime edit</italic>, or similar programs by applying the following query:</p>
<disp-quote>
<p>(Query 23) \b.*?([bcdfghjklmnprstvxz&#231;&#241;]+)[aieou&#225;&#237;&#233;&#243;&#250;&#228;&#226;&#227;&#233;&#234;&#245;&#246;&#244;]{1,2}([\w]+)-\1[aieou]{1,2}\2.*?\b</p>
</disp-quote>
<p>It can be observed that the regular expression from Query 23 was the same as in Query 22 (<xref ref-type="table" rid="T10">Table 10</xref>). As expected, the search within the novel <italic>La sombra del &#225;guila</italic> (&#8216;The Eagle&#8217;s Shadow&#8217;) (<xref ref-type="bibr" rid="B19">P&#233;rez Reverte, 1993</xref>) yielded 100% of the valid results: <italic>cling-clang</italic> (seven times), <italic>bang-bang</italic> (three times), <italic>zas-zas</italic> (two times), and <italic>ras-ras</italic> (once). This result illustrates an ideal accuracy, but not in term of precision, since some valid cases of onomatopoeia might have been left out by the query. This text is rich in onomatopoeias since this novel narrates war events, which was the main reason to use it as additional empirical data.</p>
<p>The regular expression from the query used in <xref ref-type="table" rid="T11">Table 11</xref>, once applied to the text by Arturo P&#233;rez Reverte, in contrast, did not yield purely onomatopoeic examples, however the found matches were related to another sound imitating phenomena, since they all imitated stuttering speaking of a character: <italic>vu-vuelto, lo-locos, lo-locos, su-suicidio, la-la, va-van, de-descuartizar, te-temo, po-posible, ma-malentendido, la-lapsus, he-hemos, po-polvo, ci-ciento, ma-ma&#241;ana, su-suman, co-compa&#241;&#237;a, lui-la, pa-parece, se-setecientos, he-heridos, ci-cierto, du-duele, so-sombra, sa-sacrificio, ge-gesta, pe-perdido, nu-nuevecitas, de-dem&#225;s, po-pod&#233;is, mi-mierda</italic> (<xref ref-type="bibr" rid="B19">P&#233;rez Reverte, 1993</xref>).</p>
<table-wrap id="T11">
<label>Table 11</label>
<caption>
<p>Repeated three or more letters followed by exclamation mark or ellipsis (<italic>CRPC</italic>, Portuguese).</p>
</caption>
<table>
<tbody>
<tr>
<td align="left" valign="top"><bold>Query and extracted examples</bold></td>
<td align="left" valign="top"><bold>Useful examples over 100</bold></td>
<td align="left" valign="top"><bold>Overall results in the corpus</bold></td>
</tr>
<tr>
<td align="left" valign="top"><list list-type="bullet">
<list-item><p>(Query 24) [word=&#8221;\w*([a-z&#225;&#237;&#233;&#243;&#250;&#228;&#226;&#227;&#233;&#234;&#245;&#246;&#252;&#244;])\1\1\1\w*&#8221;][word=&#8221;!&#124;\.\.\.&#124;\&#8221;&#8221;]</p></list-item>
</list><break/><bold>Valid examples</bold>: <italic>Ohhhhh!, Ahhhhh!, Tssss, Ehhhh!, Zzzzzzzzzzzzzzzzz, Hiiiiiiiiii!, Hiiiiiiiiiiiii!, yeaaaargh!, Pfffff!, Uiiiiiiiiimm, Uuuiiiiiii!, zzzz, Prrriuuuuu, hurrrraaaaaaaaaa!, vrrrr!, haaaaaa, goooooooo, oooooooo, ooooooo, oooooo, ooooo, hurrrrraaaaaaaaa!, OOOhhhh!, M&#233;&#233;&#233;&#233;!, dggggg, zzzz!, zzzzzz!, Uuuuu, Hmmmm!, vruuuumm!, Zzzzzzzzzp!, Zzzzzzzzp!, Aaaaah!, aaaaaaahs!, Psssst!, Aaaaaaah!, Aiiiii!, Zzzzzzzz, Bzzzz!, Aaaaaa, Aaaaatchim!, Oooootchim!, Trrrrrr, Trrrrri, trrrrru, booiiiii, Ummmm, Doooooiiiis, Iiiii, m&#233;eeeeeeh!, rrrrrt!, Ihhhhhh!</italic></td>
<td align="left" valign="top"><bold>62</bold></td>
<td align="left" valign="top"><bold>231</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>4.5. Three or more equal letters as a marker</title>
<p>Given the constraints of available space, we are unable to explore the marker based upon letter repletion in greater detail, but it is worth outlining some preliminary observations to be put to the test in further research. The consonant repetition at least three times as an onomatopoeic marker was earlier observed by Orrequia-Barea and Mar&#237;n-Honor (<xref ref-type="bibr" rid="B18">2020, p. 53</xref>). Our observations suggest that additional markers&#8217; usage in queries to transcribed text corpora, such as exclamation marks, ellipses, or quotation marks, may significantly improve the results. We should also admit that both vowels and consonants are relevant in this pattern. The results are shown in <xref ref-type="table" rid="T11">Table 11</xref>.</p>
<p>The sound sources are easily deducible from the nearest context, as shown in the examples 7, 8 and 9:</p>
<list list-type="gloss">
<list-item>
<list list-type="wordfirst">
<list-item><p>(7)</p></list-item>
</list>
</list-item>
<list-item>
<list list-type="sentence-gloss">
<list-item>
<list list-type="final-sentence">
<list-item><p><italic>Tenho de esperar que a m&#225;quina rebobine. <bold>&#8220;Zzzzzzzzzzzzzzzzz &#8230;&#8221;</bold> &#8211; Idiota! (CRPC)</italic>.</p></list-item>
</list>
</list-item>
</list>
</list-item>
</list>
<list list-type="gloss">
<list-item>
<list list-type="wordfirst">
<list-item><p>(8)</p></list-item>
</list>
</list-item>
<list-item>
<list list-type="sentence-gloss">
<list-item>
<list list-type="final-sentence">
<list-item><p><italic>O pior momento da campanha de Howard Dean veio depois da sua derrota no Iowa, quando proferiu um discurso em voz emocionada, que culminou num berro quase animalesco, <bold>yeaaaargh!</bold> (CRPC)</italic>.</p></list-item>
</list>
</list-item>
</list>
</list-item>
</list>
<list list-type="gloss">
<list-item>
<list list-type="wordfirst">
<list-item><p>(9)</p></list-item>
</list>
</list-item>
<list-item>
<list list-type="sentence-gloss">
<list-item>
<list list-type="final-sentence">
<list-item><p><italic>Ainda assim a ac&#231;&#227;o n&#227;o perder&#225; por completo a sua liga&#231;&#227;o ao universo dos quadradinhos, visto que as imagens reais se misturam com sequ&#234;ncias de anima&#231;&#227;o. <bold>Vrrrrrrummmm</bold>. Tac-tac-tac-tac-tac&#8230; <bold>Uiiiiiiiiimm&#8230;</bold> (CRPC)</italic></p></list-item>
</list>
</list-item>
</list>
</list-item>
</list>
</sec>
<sec>
<title>4.6. Other possible markers</title>
<p>Many onomatopoeic words are known to be neologisms, nonce or occasional words due to their creative nature. Therefore, they must appear in modern corpora as non-lemmatized wordforms, i.e., they must be stored in databases as words under unknown or empty lemmas. Hence, in the process of automatic lemmatization, the occasional onomatopoeias are not recognized as lexemes belonging to the given vocabulary and are assigned empty or &#8220;unknown&#8221; lemmas. <italic>CQL</italic> and <italic>CQP</italic> are provided with the possibility of searching for this kind of lemmas. In other words, rare lemmas, searched through the queries [lemma=&#8221;&#8221;], [lemma=&#8221;\&#124;\&#124;&#8221;] or [lemma=&#8221;unknown&#8221;] depending on conventions of a given corpus can increase the chance of finding exotic or occasional onomatopoeias.</p>
<p>Another promising formal feature to delve deeper into is the frequency factor, which can be also utilized as a marker: The transcribed onomatopoeias are likely to show lower frequency in the corpus in comparison to commonly used words, and <italic>Sketch Engine</italic> corpora provided with the <italic>CQL</italic> allow for applying the frequency as a separate filter in the query. The aforementioned and possibly other markers seem to be a promising perspective for research in further surveys.</p>
</sec>
</sec>
<sec>
<title>5. Conclusion</title>
<p>Interjectional onomatopoeias are characterized by occasionality and wide variance; they are relatively rare in literature and are still out of the scope of the lexicographers of many languages or language combinations. While given word lists of conventional onomatopoeias provided in dictionaries are still quite limited, corpus queries allow for retrieving occasional sound-imitating lexemes on the basis of observed patterns, such as repetitions of graphemes and similar syllables sequences in combination with punctuation markers (hyphens, exclamation marks, ellipsis, quotes). The most fruitful proved to be the pattern of similar syllables. The best fitting patterns proved to be the letter combinations representing hyphenated similar (either open or closed) syllables, whose precision scored 66.67% for the closed syllables and 51.25% for the open syllables. These patterns proved efficient for the four involved languages, demonstrating similar tendencies. An ANOVA test proved that the revealed similarity was not due to chance. Thus, it is applicable to other languages.</p>
<p>It was observed that, in the case that the interjections in a corpus are correctly tagged as such, the precision increases approximately twice by including into the corpus query an additional pos-filter to rule out non-interjectional results. But, in lack of such, the regular expression syntax and the corpus query languages demonstrated similar efficiency for the closed hyphenated syllables pattern. In contrast, for the open hyphenated syllables pattern, the regular expression in the searched text yielded 100% of a character&#8217;s stuttering speech. Among the involved corpora, only in the Ukrainian corpus <italic>GRAK</italic> were the interjections consistently annotated with part of speech tags, and the precision of the query for interjectional onomatopoeias reached 92% for the hyphenated closed-syllables pattern and 91% for the hyphenated open-syllables pattern.</p>
<p>Although corpus queries, on the one hand, do not provide an exhaustive sample and, on the other hand, contain some redundant results, they nevertheless significantly speed up the search for illustrative examples that can be used for research and didactic purposes.</p>
<p>This study unveiled the perspectives that can be extended to monosyllabic variants of the extracted multisyllabic words. The conclusions obtained allow for further implementation of the pattern of three or more repeated letters along with punctuational markers, evaluating its precision, building, and exploring queries for extracting nominal, verbal onomatopoeias as well as other parts of speech tags with onomatopoeic characteristics and exploring the influence on the query precision of such additional factors as the token frequency or unknown lemma. Additionally, it is important to further develop methodological tools for elaborating principles of searching onomatopoeia in translation practice for denoting sound-imitating of particular phenomena, objects, and beings, and for working out criteria for establishing equivalent relations among onomatopoeias in different languages.</p>
</sec>
</body>
<back>
<sec>
<title>Competing Interests</title>
<p>The author has no competing interests to declare.</p>
</sec>
<ref-list>
<ref id="B1"><mixed-citation publication-type="webpage"><collab>A bordo del Otto Neurath</collab>. (n.d.). <source>Pero, &#191;Hay algo que sea la dial&#233;ctica?</source> [But, is there anything that might be dialectic?]. <uri>https://abordodelottoneurath.blogspot.com/2009/08/pero-hay-algo-que-sea-la-dialectica.html</uri></mixed-citation></ref>
<ref id="B2"><mixed-citation publication-type="book"><string-name><surname>Anderson Earl</surname>, <given-names>R.</given-names></string-name> (<year>1998</year>). <source>A grammar of iconism</source>. <publisher-loc>Madison, New Jersey</publisher-loc>: <publisher-name>Fairleigh Dickinson University Press</publisher-name>; London: Associated University Presses.</mixed-citation></ref>
<ref id="B3"><mixed-citation publication-type="journal"><string-name><surname>Assaneo</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Nichols</surname>, <given-names>J.</given-names></string-name>, &amp; <string-name><surname>Trevisan</surname>, <given-names>M.</given-names></string-name> (<year>2011</year>). <article-title>The Anatomy of Onomatopoeia</article-title>. <source>PloS ONE</source>, <volume>6</volume>. DOI: <pub-id pub-id-type="doi">10.1371/journal.pone.0028317</pub-id></mixed-citation></ref>
<ref id="B4"><mixed-citation publication-type="journal"><string-name><surname>Bidaud</surname>, <given-names>S.</given-names></string-name> (<year>2022</year>). <article-title>Les onomatop&#233;es verbales du tcheque [Czech verbal onomatopeia]</article-title>. <source>Studies about Languages</source>, <volume>41</volume>, <fpage>21</fpage>&#8211;<lpage>31</lpage>. DOI: <pub-id pub-id-type="doi">10.5755/j01.sal.41.1.31330</pub-id></mixed-citation></ref>
<ref id="B5"><mixed-citation publication-type="webpage"><collab>Britannica</collab>. (<year>2024</year>). <article-title>Onomatopoeia</article-title>. <uri>https://www.britannica.com/topic/onomatopoeia</uri></mixed-citation></ref>
<ref id="B6"><mixed-citation publication-type="journal"><string-name><surname>Casas-Tost</surname>, <given-names>H.</given-names></string-name> (<year>2012</year>). <article-title>Translating onomatopoeia from Chinese into Spanish: A corpus-based analysis</article-title>. <source>Perspectives Studies in Translatology</source>, <volume>22</volume>, <fpage>39</fpage>&#8211;<lpage>55</lpage>. DOI: <pub-id pub-id-type="doi">10.1080/0907676X.2012.712144</pub-id></mixed-citation></ref>
<ref id="B7"><mixed-citation publication-type="webpage"><string-name><surname>Chamizo Babo</surname>, <given-names>C.</given-names></string-name> (n.d.). <source>Rapunzel</source>. <uri>https://www.eraumavezoutravez.com/rapunzel</uri></mixed-citation></ref>
<ref id="B8"><mixed-citation publication-type="webpage"><collab>CLUL, Centro de Lingu&#237;stica da Universidade de Lisboa</collab>. (<year>2008&#8211;2016</year>). <source>CRPC, Corpus de Refer&#234;ncia do Portugu&#234;s Contempor&#226;neo</source> [Reference Corpus of Contemporary Portuguese]. <uri>http://gamma.clul.ul.pt/CQPweb/crpc/</uri></mixed-citation></ref>
<ref id="B9"><mixed-citation publication-type="book"><string-name><surname>&#1045;&#1169;&#1072;&#1074;&#1072;</surname>, <given-names>&#1061;.</given-names></string-name> &amp; <string-name><surname>&#1050;&#1086;&#1073;&#1077;&#1083;&#1103;&#1085;&#1089;&#1100;&#1082;&#1072;</surname>, <given-names>&#1054;.</given-names></string-name> (<year>2016</year>). <source>&#1071;&#1087;&#1086;&#1085;&#1089;&#1100;&#1082;&#1086;-&#1091;&#1082;&#1088;&#1072;&#1111;&#1085;&#1089;&#1100;&#1082;&#1080;&#1081; &#1090;&#1077;&#1084;&#1072;&#1090;&#1080;&#1095;&#1085;&#1080;&#1081; &#1089;&#1083;&#1086;&#1074;&#1085;&#1080;&#1082;</source> &#1086;&#1085;&#1086;&#1084;&#1072;&#1090;&#1086;&#1087;&#1077;&#1111;&#1095;&#1085;&#1086;&#1111; &#1083;&#1077;&#1082;&#1089;&#1080;&#1082;&#1080; [Japanese-Ukrainian thematic dictionary of onomatopoeic vocabulary]. <publisher-loc>Kyiv</publisher-loc>: <publisher-name>Dmytro Burago Publishing House</publisher-name>.</mixed-citation></ref>
<ref id="B10"><mixed-citation publication-type="webpage"><string-name><surname>Evert</surname>, <given-names>S.</given-names></string-name>, &amp; <collab>The CWB Development Team</collab>. (<year>2022</year>). <source>CQP Interface and Query Language Manual</source>. <uri>https://cwb.sourceforge.io/files/CQP_Tutorial/</uri></mixed-citation></ref>
<ref id="B11"><mixed-citation publication-type="webpage"><string-name><surname>Fundeu</surname>, <given-names>Fundaci&#243;n del Espa&#241;ol Urgente</given-names></string-name>. (<year>2011</year>). <source>&#161;Tatatach&#225;n: 95 onomatopeyas!</source> [Tatatach&#225;n: 95 onomatopoeias]. <uri>https://www.fundeu.es/escribireninternet/tatatachan-95-onomatopeyas/</uri></mixed-citation></ref>
<ref id="B12"><mixed-citation publication-type="journal"><string-name><surname>Galiano</surname>, <given-names>L.</given-names></string-name>, &amp; <string-name><surname>Semeraro</surname>, <given-names>A.</given-names></string-name> (<year>2023</year>). <article-title>Part-of-Speech and Pragmatic Tagging of a Corpus of Film Dialogue: A Pilot Study</article-title>. <source>Corpus Pragmatics</source>, <volume>7</volume>, <fpage>17</fpage>&#8211;<lpage>39</lpage>. DOI: <pub-id pub-id-type="doi">10.1007/s41701-022-00132-9</pub-id></mixed-citation></ref>
<ref id="B13"><mixed-citation publication-type="webpage"><collab>GRAK</collab>. (<year>2017&#8211;2022</year>). <source>General Regionally Annotated Corpus of Ukrainian</source>. <uri>https://uacorpus.org/Kyiv/ua</uri></mixed-citation></ref>
<ref id="B14"><mixed-citation publication-type="book"><string-name><surname>Karamysheva</surname>, <given-names>I.D.</given-names></string-name> (<year>2017</year>). <chapter-title>Contrastive Grammar of English and Ukrainian Languages</chapter-title>. <publisher-loc>Vinnytsia</publisher-loc>: <publisher-name>Nova Knyha Publishers</publisher-name>.</mixed-citation></ref>
<ref id="B15"><mixed-citation publication-type="book"><string-name><surname>Medvediv</surname>, <given-names>A.</given-names></string-name>, &amp; <string-name><surname>Dmytruk</surname>, <given-names>A.</given-names></string-name> (<year>2019</year>). <chapter-title>Peculiarities of conveying the structural and semantic specificity of Japanese onomatopoeia in translation of texts of advertising character</chapter-title>. <source>Research Trends in Modern Linguistics and Literature</source>. <publisher-loc>Lutsk</publisher-loc>: <publisher-name>Lesya Ukrainka Eastern European National University</publisher-name>, <volume>2</volume>/2019, <fpage>77</fpage>&#8211;<lpage>94</lpage>. DOI: <pub-id pub-id-type="doi">10.29038/2617-6696.2019.2.77.93</pub-id></mixed-citation></ref>
<ref id="B16"><mixed-citation publication-type="thesis"><string-name><surname>Meinard</surname>, <given-names>M. E. M.</given-names></string-name> (<year>2022</year>). <source>The Challenge of Defining Interjections and Onomatopoeias: A Contribution, Centered on Contemporary English</source>. [PhD Thesis, <publisher-name>Universit&#233; Lumiere Lyon 2</publisher-name>]. <uri>https://www.academia.edu/67439120/The_Challenge_of_Defining_Interjections_and_Onomatopoeias_a_Contribution_Centered_on_Contemporary_English</uri></mixed-citation></ref>
<ref id="B17"><mixed-citation publication-type="webpage"><collab>Merriam-Webster Dictionary</collab>. (<year>2024</year>). <article-title>Onomatopoeia</article-title>. <uri>https://www.merriam-webster.com/dictionary/onomatopoeia?src=search-dict-box</uri></mixed-citation></ref>
<ref id="B18"><mixed-citation publication-type="journal"><string-name><surname>Orrequia-Barea</surname>, <given-names>A.</given-names></string-name>, &amp; <string-name><surname>Mar&#237;n-Honor</surname>, <given-names>C.</given-names></string-name> (<year>2020</year>). <article-title>Building a parallel corpus of literary texts featuring onomatopoeias: ONPACOR</article-title>. <source>Research in Corpus Linguistics</source>, <volume>8</volume>, <fpage>46</fpage>&#8211;<lpage>62</lpage>. DOI: <pub-id pub-id-type="doi">10.32714/ricl.08.02.03</pub-id></mixed-citation></ref>
<ref id="B19"><mixed-citation publication-type="book"><string-name><surname>P&#233;rez Reverte</surname>, <given-names>A.</given-names></string-name> (<year>1993</year>). <source>La sombra del &#225;guila</source> [The Shadow of the Eagle]. <publisher-loc>Madrid</publisher-loc>: <publisher-name>Alfaguara</publisher-name>.</mixed-citation></ref>
<ref id="B20"><mixed-citation publication-type="webpage"><collab>Real Academia Espa&#241;ola</collab>. (n.d.). <article-title>Banco de datos (CREA)</article-title>. <source>Corpus de referencia del espa&#241;ol actual</source> [Reference Corpus of Modern Spanish]. <uri>https://corpus.rae.es/creanet.html</uri></mixed-citation></ref>
<ref id="B21"><mixed-citation publication-type="book"><string-name><surname>Riera-Eures</surname>, <given-names>M.</given-names></string-name>, &amp; <string-name><surname>Sanjaume</surname>, <given-names>M. M.</given-names></string-name> (<year>2010</year>). <source>Diccionari d&#8217;onomatopeies i altres interjeccions: amb equival&#232;ncies en angl&#232;s, espanyol i franc&#233;s</source> [Dictionary of Onomatopoeias and other interjections with equivalents in English, Spanish and French]. <publisher-loc>Vic</publisher-loc>: <publisher-name>Eumo</publisher-name>.</mixed-citation></ref>
<ref id="B22"><mixed-citation publication-type="webpage"><collab>Riondlearn</collab>. (<year>2022</year>). <source>Onomatopeia in Portuguese</source> [Onomatopoeia in Portuguese]. <uri>https://rioandlearn.com/onomatopoeia-in-portuguese/</uri></mixed-citation></ref>
<ref id="B23"><mixed-citation publication-type="webpage"><string-name><surname>Rodr&#237;guez Guzm&#225;n</surname>, <given-names>J.</given-names></string-name> (<year>2011</year>). <article-title>Morfolog&#237;a de la onomatopeya. &#191;Subclase de palabra subordinada a la interjecci&#243;n? [Morphology of Onomatopoeia: A Subclass of Word Subordinate to the Interjection?]</article-title>. <source>Moenia</source>, <volume>17</volume>, <fpage>125</fpage>&#8211;<lpage>178</lpage>. <uri>http://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&amp;idt=25534073</uri></mixed-citation></ref>
<ref id="B24"><mixed-citation publication-type="journal"><string-name><surname>Round</surname>, <given-names>E.</given-names></string-name>, &amp; <string-name><surname>Kwon</surname>, <given-names>N.</given-names></string-name> (<year>2015</year>). <article-title>Phonaesthemes in morphological theory</article-title>. <source>Morphology</source>, <volume>25</volume>(<issue>1</issue>), <fpage>1</fpage>&#8211;<lpage>27</lpage>.</mixed-citation></ref>
<ref id="B25"><mixed-citation publication-type="webpage"><collab>RSVPLive</collab>. (<year>2024</year>). <source>Why hairdressers are the unsung heroes in our lives, writes Marguerite Kiely</source>. <uri>https://www.rsvplive.ie/life/hairdressers-unsung-heroes-lives-writes-14097868</uri></mixed-citation></ref>
<ref id="B26"><mixed-citation publication-type="webpage"><collab>Sketch Engine</collab>. (n.d.). <source>CQL Guide</source>: <uri>https://www.sketchengine.eu/documentation/cql-basics/</uri></mixed-citation></ref>
<ref id="B27"><mixed-citation publication-type="thesis"><string-name><surname>Sugahara</surname>, <given-names>T.</given-names></string-name> (<year>2011</year>). <source>Onomatopoeia in Spoken and Written English: Corpus- and Usage-based Analysis</source>. <publisher-name>Hakkaido University</publisher-name>. [Doctoral dissertation, Hokkaido University]. <uri>https://eprints.lib.hokudai.ac.jp/dspace/bitstream/2115/45138/1/Dissertation%20by%20Takashi%20SUGAHARA.pdf</uri></mixed-citation></ref>
<ref id="B28"><mixed-citation publication-type="webpage"><string-name><surname>Tellier</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Eshkol</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Taalab</surname>, <given-names>S.</given-names></string-name>, &amp; <string-name><surname>Prost</surname>, <given-names>J.&#8211;P.</given-names></string-name> (<year>2010</year>). <article-title>POS-tagging for Oral Texts with CRF and Category Decomposition</article-title>. <source>Research in Computing Science</source>, <volume>46</volume>, <fpage>79</fpage>&#8211;<lpage>90</lpage>. <uri>https://hal.science/hal-00467951/documentddh</uri></mixed-citation></ref>
<ref id="B29"><mixed-citation publication-type="webpage"><collab>Universal POS Tags</collab>. <source>Universal Dependencies</source>. (<year>2014&#8211;2024</year>). <uri>https://universaldependencies.org/u/pos/</uri></mixed-citation></ref>
<ref id="B30"><mixed-citation publication-type="webpage"><collab>University of Leeds</collab>. (<year>2022a</year>). <source>Leeds Collection of English Corpora</source>. <uri>http://corpus.leeds.ac.uk/protected/query.html</uri></mixed-citation></ref>
<ref id="B31"><mixed-citation publication-type="webpage"><collab>University of Leeds</collab>. (<year>2022b</year>). <source>Leeds Collection of Internet Corpora</source>. <uri>http://corpus.leeds.ac.uk/internet.html</uri></mixed-citation></ref>
<ref id="B32"><mixed-citation publication-type="book"><string-name><surname>Vahidian Kamyar</surname>, <given-names>T.</given-names></string-name> (<year>1996</year>). <source>Farhange Namavaha dar Zbane Farsi</source> [Persian Onomatopoeia Dictionary]. <publisher-name>Ferdowsi University of Mashhad Publication</publisher-name>.</mixed-citation></ref>
<ref id="B33"><mixed-citation publication-type="journal"><string-name><surname>Yaqubi</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Tahir</surname>, <given-names>R.</given-names></string-name>, &amp; <string-name><surname>Amini</surname>, <given-names>M.</given-names></string-name> (<year>2018</year>). <article-title>Translation of Onomatopoeia: Somewhere between Equivalence and Function</article-title>. <source>Studies in Linguistics and Literature</source>, <volume>2</volume>, <fpage>205</fpage>&#8211;<lpage>222</lpage>. DOI: <pub-id pub-id-type="doi">10.22158/sll.v2n3p205</pub-id></mixed-citation></ref>
<ref id="B34"><mixed-citation publication-type="webpage"><collab>Yourdictionary</collab>. (<year>2021</year>). <source>Sound Words: Examples of Onomatopoeia</source>. <uri>https://www.yourdictionary.com/articles/sound-onomatopoeia-examples</uri></mixed-citation></ref>
<ref id="B35"><mixed-citation publication-type="journal"><string-name><surname>&#1041;&#1086;&#1078;&#1082;&#1086;</surname>, <given-names>&#1030;.&#1057;.</given-names></string-name>, &amp; <string-name><surname>&#1050;&#1072;&#1083;&#1100;&#1085;&#1110;&#1095;&#1077;&#1085;&#1082;&#1086;</surname>, <given-names>&#1040;.</given-names></string-name> (<year>2023</year>). <article-title>&#1054;&#1085;&#1086;&#1084;&#1072;&#1090;&#1086;&#1087;&#1077;&#1103; &#1103;&#1082; &#1079;&#1072;&#1089;&#1110;&#1073; &#1077;&#1082;&#1089;&#1087;&#1088;&#1077;&#1089;&#1080;&#1074;&#1085;&#1086;&#1089;&#1090;&#1110; &#1074; &#1075;&#1088;&#1072;&#1092;&#1110;&#1095;&#1085;&#1086;&#1084;&#1091; &#1088;&#1086;&#1084;&#1072;&#1085;&#1110;: &#1076;&#1077;&#1103;&#1082;&#1110; &#1085;&#1102;&#1072;&#1085;&#1089;&#1080; &#1087;&#1077;&#1088;&#1077;&#1082;&#1083;&#1072;&#1076;&#1091; [Onomatopoeias as an expressive means in graphic novel: some nuances of translation]</article-title>. <source>&#1047;&#1072;&#1087;&#1080;&#1089;&#1082;&#1080; &#1079; &#1088;&#1086;&#1084;&#1072;&#1085;&#1086;-&#1075;&#1077;&#1088;&#1084;&#1072;&#1085;&#1089;&#1100;&#1082;&#1086;&#1111; &#1092;&#1110;&#1083;&#1086;&#1083;&#1086;&#1075;&#1110;&#1111;</source> [Notes on Romance and Germanic Philology], <volume>2</volume> (<issue>51</issue>), <fpage>30</fpage>&#8211;<lpage>41</lpage>. DOI: <pub-id pub-id-type="doi">10.18524/2307-4604.2023.2(51).296818</pub-id></mixed-citation></ref>
</ref-list>
</back>
</article>