Public platforms like Twitter, Facebook, and Weibo have provided a way for individuals of many cultures to create content and interact with those from other cultures and those speaking other languages. Often, when individuals speak multiple languages, they engage in code switching, where they communicate using a mixture of both languages. Code switching is readily seen on big social media platforms, which creates an exciting opportunity for understanding more about code-switching communication, such as why authors engage in code-switching, what content is spoken in one language versus another, or even where code-switching is most likely take place geographically!Our long-term goals for analyzing code switching in social media focus on understanding how linguistic groups communicate. Specifically, we are interested in
- how ideas are shared between speakers of different languages,
- how the terminology and language of one community influences another, and
- which language groups are in frequent contact with one another.
Observing how individuals communicate in multiple languages provides us with some insight into which communities are communicating with each other and what kinds of ideas and content are shared.One of the biggest challenges for looking at code-switching is just detecting which languages are being spoken. Many natural language processing tools assume that text is written in only one language, which makes it difficult to accurately identify cases where authors use two or more languages. Social media makes detecting languages even more challenging due to the creativity authors use when writing; abbreviations, slang, emoticons, or even things like word lengthening (“cooooooool”) make it difficult for computer programs to correctly identify the language of a post.The state of language identification in social media makes large-scale analyses of code-switching within a sentence currently infeasible. However, we can still get a first look at the phenomena by examining code-switching in a very restricted setting: the hashtag. In our view, the hashtag is a way for authors to comment on their post’s content and is a way for authors to link their content with others’ writing about the same theme — forming a virtual community around a hashtag, of sorts. This dual role of the hashtag is well-supported in the literature.[3,4,5] Furthermore, the use of hashtags to link content with a larger community (e.g., social-movement hashtags like “#BlackLivesMatter”) mirrors some authors’ interpretation of code-switching as an intention on the part of the speaker to show membership in multiple communities. Therefore, as an initial study code-switching at scale, we examine Twitter posts in which the author writes a post in one language and then uses a hashtag of a second language.As a part of our code-switching study, we want to answer the following questions:
- Given posts in a specific language, which other languages do
authors use in their hashtags?
- How can we reliably determine the language of a hashtag?
- When an author uses a hashtag from another language, how often is that an instance of code-switching (i.e., intentional communication in multiple languages) and when it is not code-switching, why is another language being used?
This blog post summarizes the first part of our study and was published in the EMNLP 2014 Workshop on Code Switching. See our paper for more details and don’t hesitate to contact me if you have questions!
What Language Is A Post or Hashtag Written In?
In order to study code-switching, we first need a reliable way to determine the language of a Twitter post and of the hashtags used in it. For the determining the language of a post, we use the off-the-shelf language identify tool langid.py. While this tool is not always completely accurate on social media text, langid.py supports the wide variety of language we see on Twitter, which is immensely important when studying language groups. To limit the noise from misidentifying the language of a post, we also filter the posts to only those containing at least three recognized words in some language, which removes posts such as those containing all hashtags, misspelled words, or emotive text (e.g., “ughhhhhh”). We also remove emoticons, urls, and user-mentions (i.e., @username) before performing language identification, all of which can confuse langid.py.Determining the language of a hashtag is much more complicated than for a post. A hashtag is often a single word, which may be present in multiple languages (e.g., “soy” is a plant in English and the phrase “I am” in Spanish). More troublesome is that a hashtag can be multiple words, like #whoknew, and in some cases, it is not obvious how to split the hashtag into its component words to determine the language. Even more challenging, hashtags can be acronyms or portmanteau of words. A particular favorite we saw was #YoloSwaggins (described more in detail here), which combines the acronym YOLO (You Only Live Once) and “swag” (an abbreviation of swagger) to mirror the name of Lord of the Rings character Frodo Baggins.The hashtag itself provides relatively little information about what language its written in. Therefore, we devised an alternate strategy for determining its language that uses outside information. Given a hashtag, we gather a thousand posts using that hashtag and identify the language of those posts (independent of the hashtag). In cases where majority of the posts use the same language, we classify the hashtag as being written in that language. The underlying hypothesis is that a hashtag is created in one language whose authors are the primary users of that hashtag, so identifying the most-used language in a hashtag’s posts also identifies its source language. In our manual analysis, this strategy correctly classified the language of the hashtag in 97% of cases.
Now that we have automatic ways of identifying a language and hashtag posts, we seek to analyze popular hashtags to find evidence of cross-language hashtags and, ideally, code-switching. We start by gathering posts for all hashtags used from March 2014 to July 2014 from our Twitter feed, which is roughly 10% of all tweets. We keep only those hashtags used at least 1000 times, which leaves us with 4624 hashtags to examine.Both langid.py and our method for identifying a hashtag’s language are imperfect, which can create false positives where a language misclassification would suggest a hashtag is being used in post with a different language when in reality it was not. While there will always be noise in automated methods, we aim to limit the noise by only asserting that a hashtag of language X is used by language Y if we can find at least 20 posts written in Y that use that hashtag. In our manual analysis of posts and hashtags, we found the resulting assertions roughly 67% accurate, which is high enough to make initial insights into large hashtag-borrowing trends.
Which Languages Use Another’s Hashtags?
To our surprise, hashtags were widely adopted by other languages than their origin language. In the chordgraph that follows, we show the number of hashtags originating in a language X that are subsequently used in language Y. The size of the chord connecting two language reflects the number of tags: the width of the chord at language X shows how many hashtags were used by Y that language relative to hashtags written in Y being used in posts written in Y (i.e., a thicker chord at one language indicates the other language is borrowing more hashtags from the first language). Mouse over the chart to get details on hashtag sharing between any two languages.
English is the most popular language on Twitter, so it is not surprising that many languages incorporate hashtags originating from English. However, we start to see trends that mirror cultural and geographic regularities. The authors writing in romance languages borrow from other romance languages much more frequently than from other languages, possibly as a result of their speakers being located in nearby countries which encourages bilingualism. Furthermore, although English is often thought of as the international language for sharing, authors of many languages incorporate hashtags from an unexpectedly-wide variety of languages. Although not all of these hashtag borrowings are necessarily instances of code switching (which will be discussed later), this cross-language adoption rate does point to larger trends in how communication spreads between cultures and which language communities are receptive to communication from each other.The Indonesian language presents an unusual case where authors from other languages seem to incorporate Indonesian-language hashtags without much reciprocity. However, peeking into the data shows that this phenomenon is often due to something other than communication! Indonesian-language authors will use a nonsensical hashtag to post spam-related or bot-control messages (in Indonesian) and then later, other accounts will then post random snippets of text in another language using that hashtag as well. A similar behavior is seen in Russian-language posts as well. We speculate that this behavior is designed to avoid Twitter’s spam detection filters, but more analysis is clearly needed.
What kind of hashtags are being code-switched?
While the chordgraph is very encouraging for observing hashtag-sharing, we want to look for evidence of code-switching, which requires looking at how authors are using the hashtags. That is, we want to assess whether these hashtags are being used in intentional communication on the part of the authors, or if some other phenomena are responsible for their use.Before starting to look at code-switching in hashtags, we first recognize that hashtags serve many roles within Twitter. Most-familiar are a hashtag’s role as a commentary on a post’s content (e.g., “#yum”, “#ThisSucks”) or as a way to link the post with a specific thread of posts using the same hashtag (“#BlackLivesMatter”, “#FactsAboutMe”). However, hashtags also serve other roles such as:
- those generated by applications and games when players post to Twitter from within the app (e.g., “#LastFM”, “#AndroidGames”)
- the names of people, places, and things that are widely recognized and do not vary between languages (“#WorldCup2014″, “#TeenChoiceAwards”)
- advertising and promotional uses (“#forsale”, “#porn”)
- spam-related hashtags that are used to coordinate botnets (“#681Team”)
Ideally, we want to focus only on hashtag uses where the author is communicating and where the author has a choice in how they express their message. This rules out hashtag usages like those including application-generated hashtags, which the author isn’t responsible for writing, or name-based hashtags where everyone agrees on what something should be called, regardless of what language they speak. (For the full taxonomy of hashtag types, see Table 1 in our paper.)Therefore, we focused specifically on hashtag uses that served as a commentary on the post’s content (i.e., an annotation) or as a way to link the post with a broader community. We manually analyzed 600 hashtags to classify them by type to find those serving as an annotation or for community-linking. From these and then selected 13 hashtags each from the annotation or community-linking types to manually analyze each post for evidence of code-switching.Looking at the annotation and community hashtags, we found wide-spread evidence of code-switching by the authors. Many cases followed expectations for which languages would authors borrow from, such as languages in countries with bilingual speakers or those languages with close geographic proximity. However, several hashtags were used in a variety of diverse languages. For example, #truth was used with languages such as Arabic, Bosnian, Bulgarian Hindi, and Punjabi. The most widely code switched hashtag was #magic. In English, the hashtag is commonly used with content on magic tricks; however, in other languages, the hashtag often connotes surprise. For example, consider this Latvian tweet:
Es izmeklēju visu plauktu ,nekur nav.Mamma piejiet ne sekunde nepagāja,kad viņa—
its ISO-639-1 code.
|Hashtag||Hashtag Language||Languages using the hashtag|
|#Facts||en||id th fr es ru|
|#simple||en||id es fr ms tr tl sw zh ja ko|
|#bitch||en||ar cs de es fr id it ja ms nl pt ru sv tl tr zh|
|#delicious||en||ca de es fr id it ja ko ms nl ru th tr zh|
|#Design||en||ar de es fr ja kr pt th tl zh|
|#SWAG||en||de es fr id it pl pt ru|
|#fresh||en||es fr id it ms nl sv|
|#truth||en||ar bs bu es fr hi id ja it ms pa pt ru tl zh|
|Hashtag||Hashtag Language||Languages using the hashtag|
|#Quran||ar||fa ms id sw az it de en|
|#tech||en||de es nl ar el fr ro id it ja ms no pl pt ru sq sv zh|
|#class||en||ar tr es bg de fr pt he hr id it ja lt lv ms nl ru sw tl uk zh|
|#animals||en||ar ca de es fr pt it ms ja mk pl pt ro ru tl tr ur vi|
|#cine||es||ca de en fr ja pt ro ru|
|#sunday||en||es ar tr fr ca de el gl hu id it ja ms ko pt nl nn no pl ro ru sl sv th tl zh|
|#Energy||en||ru es de fr it pt tr|
|#change||en||ar nl es cs de el eu fr pt id it ja ko jv lv ms nb no pl ro uk ru sv ta th tl tr ur zh|
|#magic||en||nl fr ar ru ca cs de el it es hu id ja jv ko lv ru ms nn pl pt ro sq sv sw sl tl tr zh|
Two of the community-type hashtags, #Hadith and #Quran, also stand out for being transliterations from Arabic script to Latin script. We investigated whether these hashtags are used most often instead of their Arabic-script counterparts but found that the Latin-script was used in almost all cases (i.e., Arabic-script tweets used the transliterated Latin-script hashtags instead of writing the word in the native script). We observed the same trend for authors in other non-Latin languages such as Japanese, that frequently included transliterated hashtags instead of their native-script counterparts. This phenomenon may point to authors intentionally trying to make it easier for other Latin-script authors to link their content to the hashtag. Alternately, it could be that authors are not unsure whether Twitter supports hashtags in non-Latin scripts (which it does) so they use Latin-script hashtags to be sure the content is linked somewhere.Other examples of code-switching can be seen in the tweets below:
English tweet using a Spanish hashtag:
Arabic tweet using an English hashtag:
English tweet using an Arabic hashtag:
Remember who’s Sunnah you’re supposed to follow. #ﷺ— Mally (@MallyPosts) February 20, 2015
Authors may even incorporate hashtags from more than one language as this Thai tweet below shows.
Thai tweet using English and Korean hashtags(!):
Oriya tweet using Oriya and English hashtags
English tweet using Korean and English hashtags
For those wanting to investigate further, the Twitter search interface makes it possible to try finding tweets containing code-switched hashtags. Using their advanced search interface, chose a hashtag in a language that you prefer and then search for tweets in a different language. As you will find, Twitter’s language identification is by no means perfect, but with a bit of effort, you can usually find code-switching tweets in a variety of languages. Based on our results, you should have decent luck searching for the English hashtags “#magic” or “#truth”.
Conclusions and Open Questions
Multilingual authors have long been known to engage in code switching where they communicate using elements of two or more languages. As our initial analysis has shown, this code-switching behavior occurs in social media as well, and, in particular, can take the form of hashtag code switching whether authors write in one language and then use a hashtag of a different language. The results of our automated analysis showed authors incorporate a hashtag from many other languages, revealing a global phenomenon not limited to just borrowing from English. While we found that not all cases of multilingual hashtag-borrowing were cases of code switching due to reasons such as application-generated posts and spam posts, our manual analysis revealed that code switching was still very present and occurred between many language pairs.Our study provides a first step towards understanding code switching in social media and raises many open questions and opportunities for future investigations. On the technical side, two important tasks stand out. First, a significant amount of work is needed to improve language identification in social media in order to more precisely measure the volume of posts engaging in cross-language hashtag use. Second, our analysis showed that hashtags are engaged in a wide-variety of uses, from annotation to spam. Automated methods are needed to classifying what the primary role of hashtag is. Such a classifier would be invaluable for focusing code switching analyses only on those hashtags serving to convey meaningful content.In the future, we intend to pursue looking at the communicative and social aspects of code switching. In particular, we seek to answer
- How does geography play into code-switching?
- What role the author’s language preference play? Do they switch languages in their posts or post predominantly in one language? Do they engage in intra-sentential code switching as well as hashtag code switching?
- What role the audience play? Is the social group of an author multilingual or divided into monolingual groups?
Endnotes and References
1. For the purposes of this study, we use a very loose definition of code switching, which incorporates both code-switching and borrowing (where a word is adopted by another language). We simply require that the act of communicating be intentional on the part of the author and uses words from two or more languages, where the author has a choice in which word to use (as opposed to cases where there is a universal term for something (like a person’s name) that could be considered independent of language).2. Among natural language processing tools, polyglot is a notable exception for supporting language detection in mixed-language documents.3. Dmitry Davidov, Oren Tsur, and Ari Rappoport. 2010. Semi-supervised recognition of sarcastic sentences in Twitter and Amazon. In Proceedings of the Fourteenth Conference on Computational Natural Language Learning (CoNLL), pages 107–116.4. Julie Letierce, Alexandre Passant, John Breslin, and Stefan
Decker. 2010. Understanding how twitter is used to spread scientific messages. In WebSci10: Extending the Frontiers of Society On-Line.5. Manish Gupta, Rui Li, Zhijun Yin, and Jiawei Han. 2010. Survey on social tagging techniques. ACM SIGKDD Explorations Newsletter, 12(1):58–72.6. Bonnie Urciuoli. 1995. Language and borders. Annual Review of Anthropology, 24:pp. 525–546.7. The comedy show Saturday Night Live famously plays upon this ambiguity when identifying the correct partitioning of a character sequence into words in its skit where Sean Connery provides his interpretation of An Album Cover during a Jeopardy game.8. Yabing Liu, Chloe Kliman-Silver, and Alan Mislove. 2014. The tweets they are a-changin’: Evolution of twitter users and behavior. In Proceedings of the 8th International Conference on Weblogs and Social Media (ICWSM).9. An example of message we interpreted as being related to spam or bot-control was “#12YKFamilia < kenapa sih? 867425″. Here, the number on the end varies between posts and indicates a operation for a botnet to perform. Posts containing such a message were frequently deleted automatically by Twitter and most of the accounts posting such messages were also deleted.10. These types of spam hashtags are often very bursty but short-lived on Twitter. The role of hashtags in spam and bot-control is just starting to be investigated.