Thesis topics

  • Topics of fact-checking on Twitter Community Notes


    Supervisor(s): Uku Kangur (uku dot kangur [ät] ut [dot] ee) and Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee)

    Description:
    Twitter (more recently branded as “X”) Community Notes were first launched in 2021 (under the old name “Birdwatch”). Since Elon Musk’s acquisition of Twitter in 2022, community notes have become Twitter's main new way of fighting misinformation on the platform. A particularly important part of the community notes is the peer-reviewing aspect of it, where community note contributors can mark notes as helpful or unhelpful (in addition to giving the reasoning behind their rating). This thesis proposal entails a comprehensive analysis of the topics of fact-checking in community notes and how they relate to the classification of misinformation.

    The questions that will be answered during the writing of the thesis:
    • To what extent do community notes concentrate on specific categories of content, such as politics, health, and entertainment?
    • What trends can be identified in the subject matter of community notes?
    • Are certain types of misinformation more common among certain topics?
    • Are there identifiable external events that correlate with spikes or shifts in the popularity of certain topics within community notes?

    Datasets: It will be provided.

  • Aligning English and Estonian misinformation using social media posts and news


    Supervisor(s): Uku Kangur (uku dot kangur [ät] ut [dot] ee) and Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee)

    Description:
    The advent of machine learning techniques, particularly multilingual Large Language Models (LLMs), provides an invaluable resource for studying the spread and nature of misinformation across linguistic boundaries. However, smaller languages (like Estonian) often do not have their own specific misinformation datasets. This thesis aims to align English misinformation posts with their Estonian counterparts using multilingual LLMs and active sampling methods. Understanding the cross-language similarities and differences in misinformation can offer valuable insights for combating it more effectively on a global scale.

    The questions that will be answered during the writing of the thesis:
    • How effective are multilingual LLMs in aligning English misinformation posts with Estonian misinformation posts?
    • What are the characteristics of misinformation posts in both languages (e.g., topics, sentiment, user engagement)?
    • How does active sampling improve the quality and relevance of the aligned dataset?
    • Are there any linguistic or cultural factors that influence the kind or spread of misinformation in English and Estonian?

    Datasets: Will be prepared together with the supervisor.

  • Understanding human behaviour through call data records.


    Supervisor(s): Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee)

    Description:
    We have Mobile call data records and GPS data. In this thesis topic, using mobile call data records (CDR) we would like to study about the following:
    CDR dataset:
    • Detection of households, friends, families, colleagues (calling each other, visiting the same locations, etc.).
    • Other homogenous social groups (students, retired persons, unemployed, employed).
    • “Vacation behavior”: human behavior is very routine – we have 1 very important.
    • Detection of nomads: people in instant motion (professional drivers, bus drivers).

    GPS-data:
    • Detection of transportation modes (pedestrian, cycling, car, public transportation, etc) from GPS points. Outcome could be sort of travel diary.
    • GPS-based time use survey (detection of meaningful places [home, work, school, children-related places, hobbies etc] and much time people spent there).

    Datasets: It will be provided.

  • Understanding human behaviour through call data records.


    Supervisor(s): Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee)

    Description:
    We have Mobile call data records and GPS data. In this thesis topic, using mobile call data records (CDR) we would like to study about the following:
    CDR dataset:
    • Detection of households, friends, families, colleagues (calling each other, visiting the same locations, etc.).
    • Other homogenous social groups (students, retired persons, unemployed, employed).
    • “Vacation behavior”: human behavior is very routine – we have 1 very important.
    • Detection of nomads: people in instant motion (professional drivers, bus drivers).

    GPS-data:
    • Detection of transportation modes (pedestrian, cycling, car, public transportation, etc) from GPS points. Outcome could be sort of travel diary.
    • GPS-based time use survey (detection of meaningful places [home, work, school, children-related places, hobbies etc] and much time people spent there).

    Datasets: It will be provided.

  • Analysis of mental health issues in minority communities.


    Supervisor(s): Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee)

    Description:
    In recent times, mental health issues are on the rise. One of the causes might be that, in recent decades, maybe due to the fast-moving lifestyle of individuals. However, the taboo around mental health is gradually decreasing; individuals are now opening out about their mental health issues, and various online social media (OSM) platforms, such as Reddit, Twitter, and others, provide a forum for individuals to discuss and seek assistance for their mental health issues. In previous works related to the analysis of mental health issues in OSM platforms, mental health issues among minority populations, such as migrants and LGBTQ+ have not been adequately addressed or examined. These communities are comparably more vulnerable to mental health issues w.r.t the general population. In this thesis work, we will investigate natural language data collected from OSM platforms for assessing mental health concerns in minority communities using various natural language processing (NLP) techniques.
    Related works:
    • Khatua, A., & Nejdl, W. (2021, October). Struggle to Settle down! Examining the Voices of Migrants and Refugees on Twitter Platform. In Companion Publication of the 2021 Conference on Computer Supported Cooperative Work and Social Computing (pp. 95-98).
    • Abbott, A. (2016). The mental-health crisis among migrants. Nature, 538(7624), 158-160.

  • A novel approach to analyzing sexist trends in Hollywood using subtitles.


    Supervisor(s): Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee)

    Description:
    There is no denying the tremendous potential of data science. But imagine how awesome it'd be if you could use data science methodologies to comprehend movies and how they changed over time. The 1984 film "Once Upon a Time in America" rocks a solid 8.3 rating on IMDB. But are you aware that it has a misogynistic plot and that women were abused on screen? One's toolkit as a data scientist should include the art of storytelling. That's what exactly a part of this project is about. You get to analyze movies from the past and present using state-of-the-art NLP techniques(which are currently being used by multiple tech giants) to understand the trends with respect to social aspects like gender equality among many. Towards the end of the project, you would have gained skills such as the art of storytelling, state-of-the-art NLP techniques, and research methodology.
    Dataset: Some data will be provided and we will guide you for more data collection.

  • Toxicity in Google Play Store reviews: What, Where, and Why?


    Supervisor(s): Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee)

    Description:
    In the consumer industry, particularly in e-commerce, market basket analysis, recommender systems, and customer churn prediction are some of the most sought-after techniques because grasping consumer psychology is literally the key to minting money. There is a 90% probability that you will read reviews before buying a product on e-commerce sites (such as e-bay, amazon, etc.) [1]. Do you know that simply from the operations of the Play Store in the first quarter of this year(2022), Google generated 11 billion USD in revenue? In the digital market, analysis of customer reviews is the holy grail of making money. As part of this project, you will get a unique opportunity to work with actual Google Play Store review data. We will study the toxicity in customer reviews using some of the most advanced NLP techniques and cluster them to understand the app genre organization(more understanding = more money). By the time you defend your thesis, you would have been well versed in research question formulation, state-of-the-art NLP techniques, data scraping, data visualization(we will follow the principles of Stephen Few[2, 3]), and writing(we will motivate you to write your findings). If you feel like you want a solid thesis on your CV that will make you stand out from the crowd, maybe this is the right project for you.
    Dataset: Some data will be provided and we will guide you for more data collection.
    Reference:
    1. The critical role of reviews in Internet trust (Trust Pilot, 2020)
    2. https://www.stephen-few.com/
    3. http://www.perceptualedge.com/examples.php

  • Behaviour analysis of city users: biker, pedestrians and public transport.


    Supervisor(s): Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee) and Flavio Bertini (University of Bologna, Italy)

    Description:
    Bella Mossa is a program of the City of Bologna which promotes a healthy lifestyle and sustainable mobility. In 2017, the program collected information on the transport habits of people. The data set contains mobility data (latitude, longitude, timestamp) of different transportation means for the period from April 1, 2017, to September 30, 2017. In particular, the activity types include Bus, Car Share, Cycle, Train and Walk. During 6 months of the experiment, there were over 15,000 unique users of the program and 3.7 million km was covered by them. By using this data, we would like to perform several descriptive analysis to study each single activity type and compare them together. Some partial list of goal: 1) analysis of the mobility behaviour of different users: when, where, distance, duration , 2) analysis of the mobility pattern of different users, also taking into account the different areas of the city, 3) comparison among activity types (e.g., "within that area the bike is faster than the car") and extract the road network for each of them We also plan to compare the results with big cities mobility patterns like NewYork (dataset will be provided).
    Dataset: Will be provided.

  • Detecting propaganda/Fake news in Telegram messages during war times (Ukrainian – Russian)


    Supervisor(s): Ivan Slobozhan and Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee)

    Description:
    Nowadays, more and more people are exposed to various types of fake news. Although this social phenomenon is extensively studied in the research community, propaganda, one of its subtypes, still requires careful analysis, especially during war times. In this thesis, we aim to analyse the topic of detecting propaganda messages in online communication media and their influence on people and online communities.
    A large dataset of messages and attached media (images) collected from channels and groups from different countries in Telegram will be provided to analyse this topic. The dataset is mostly in Russian and Ukrainian languages, so knowledge of one or both of these languages is a major advantage. The student will need to examine similar literature and apply data analysis, data visualisation and machine learning methods in this master thesis.
    Dataset: will be provided.

  • Prediction of News Coverage


    Supervisor(s): Roshni and Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee)

    Description:
    The immense rise in online news consumption and increase in the number of news sources has increased the competition among news media outlets to select news articles that can draw the most attention. The continuous influx of newsworthy events further aggravates the decision to select news articles. To overcome this information overload, news media sources look up automated systems that can filter out news articles that are worthy of being reported the next day considering the present-day news events. This requires identification and understanding of the several factors that play a role in selection of news article. For example, the users’ current engagement with the news event can predict the requirement of the audience the next day for the same news. Additionally, news media agencies have differences in both reporting and coverage of the same news event. Therefore, in this work, we intend to develop an automated system that relies on social media engagement of the news to understand the audience pulse, explore the different biases inherent to a news media agency, such as stance and writing of a news event along with the location of the event and the temporal applicability of the news event.
    Dataset:
    1. We have a 300 news article annotated dataset from New York Post.
    2. For other news media agencies, we need to create the datasets. We will provide you information regarding the sites from where and how to collect new datasets.

  • SigCON: Contrastive Learning for Signed Network Link Prediction


    Supervisor(s): Roshni and Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee)

    Description:
    The relationships and interactions among users in online social media platforms are characterized as positive, negative, or neutral based on shared opinions and views. Recently, several research works have proposed to represent these relationships through signed networks where the nodes represent the users and the edges are positive or negative depending on the polarity of the interaction/relationship. Signed link prediction is a popular task in signed networks that can predict whether a pair of users is connected positively, negatively, or neutral. Understanding of these signed links has several applications in personalized product/news recommendations, government surveys, and polls, troll detection, etc. However, sign prediction has different challenges than conventional link prediction in traditional unsigned networks. It needs to consider the integration of information from negative edges, high imbalance in the number of positive and negative edges, and inherent characteristics of signed networks, such as structural balance theory and status theory.
    Therefore, in this work, we propose a contrastive learning-based node embedding model that induces noises specific to signed networks so that the node embedding include specific signed network attributes. Additionally, we induce multiple forms of noise in the form of nodes and edges that comprises of structural attributes and structural balance theory, thus making SigCON robust.
    Dataset: will be provided.

  • Detection and Orbital Computation of Resident Space Objects


    Supervisor(s): Arun Balaji Kumar and Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee)

    Description:
    There are more than 20000 man-made objects of more than 10 cm in size floating around in near-earth space which pose collision threats to functional satellites. Predicting collision probability from these space objects is crucial from the security perspective as well as for the protection of public and private space assets of various countries. The outcome of this project will directly support the Space sector by providing an operationally flexible, scalable, transparent and indigenous collision probability solution.
    Dataset: will be provided.

  • Exercise Activity Detection using Smart Watches


    Supervisor(s): Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee)

    Description:
    Develop a smartwatch based application which can determine the time spent by a user in different exercise activities. Our definition of exercise activities includes both traditional and non-traditional forms of exercise. Examples of traditional forms of exercise include activities such as running, playing sports, etc. On the other hand, our definition of non-traditional form of exercise activities include daily house activities such as cooking, cleaning (mopping and sweeping), walking around the house, etc.
    Technical Objectives:
    1. Build smartwatch application which can discern typical exercise activities from non-exercise activities (e.g., watching TV, working on computer) using smartwatch sensor data.
    2. Energy efficiency -- the developed application should be battery efficient.
    3. Good accuracy -- the developed application should be able to discern the activities of interest with good accuracy.
    4. Adaptability -- the developed application should be able to adapt to new activities which can be specific to a user (e.g., yoga poses, gardening etc)

    Potential Value to Society: Building such an application is of value in health monitoring use cases. Such an application can give insights into the total daily time spent in exercise activities and thus, can help a user plan and execute a healthy lifestyle in the long run. Such an application could also be of great use in modern day where several people have long working hours, thus leaving very little time for any formal form of exercise. In such situations, doing home chores (e.g., cooking and cleaning) may perhaps be the only form of exercise that many people may get in a day. Thus, any application which can keep track of exercise activities (via home chores) could be of value in keeping track of total time spent in doing physical activities in a home setting.
    Skilset:
    • Essential:Programming in Java, Python, knowledge of ML techniques
    • Desirable: Experience in developing Android applications and Deep learning techniques

  • Integration of Large Language Model with Knowledge Graphs


    Supervisor(s): Mohit Mayank and Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee)

    Description:
    Large language models train on a humongous amount of unstructured data and perform exceptionally well for NLP tasks. On the other hand, KGs are used to store and process structured data and work well for relational tasks. Recent research shows combining the two techniques can lead to further improvements where LM acts as the language generator and KG acts as the database. Our work will be focused on enhancing the approaches to further improve the state-of-the-art.
    Useful Links:
    1. KELM: Integrating Knowledge Graphs with Language Model Pre-training Corpora
    2. Reasoning with Language Models and Knowledge Graphs for Question Answering
    3. Language Models are Open Knowledge Graphs
    4. Enhanced Story Comprehension for Large Language Models through Dynamic Document-Based Knowledge Graphs
    5. SKILL: Structured Knowledge Infusion for Large Language Models

  • Multi-modal Conversational Live Emotion Analysis


    Supervisor(s): Mohit Mayank and Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee)

    Description:
    Emotions are the reactions that human beings experience in response to events or situations. Live emotion detection has great application in Sales/Marketing and Customer relationship domain. The current emotion detection methodologies are majorly focused on the stateless textual domain. Our work will be on enhancing emotion detection by (1) being stateful i.e. considering the context in a conversation of what was said before, and (2) considering multi-modality: text, audio and/or video.
    Useful Links:
    1. Emotion Recognition in Audio and Video Using Deep Neural Networks
    2. Emotion detection by voice
    3. Speech Emotion Recognition with Convolutional Neural Network
    4. Multi-modal Residual Perceptron Network for Audio-Video Emotion Recognition

  • Public Vs. Leaders: Understanding the Ukrainian Immigration in Estonia through social media platforms.


    Description:
    The war in Ukraine has resulted in the immigration of Ukrainians to several EU countries, including Estonia. In this thesis, we would like to study the discourse of Leaders (Politicians) and the Public through online social media platforms like Twitter. By collecting public data and applying data science techniques, we would like to understand how Leaders and the Public reacted to immigration in Estonia.
    Dataset: We will help in the collection of dataset.
    Language preference: Good Estonian language knowledge would be helpful in the thesis.

  • An Empirical Study on evolving communities on a dynamic temporal transportation network: a use case of pre/post- COVID traffic dynamics


    Supervisor(s): Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee) and Angelo Furno

    Description:
    Traffic prediction has been studied in the past by using various techniques, such as traffic flow models, statistical methods and machine learning techniques [1]. In this work, we are interested in modelling road networks as complex networks, by taking inspiration from social network approaches. Our objective is to model traffic dynamics (measures of speed/flows related to vehicles transiting over the different road segments of a given city’s transportation network) as weights of a temporal evolving weighted graph. In the model, nodes will represent road intersections, links will represent road segments and edge weights will represent timestamped information related to traffic dynamics as regularly observed over each road segment.
    Dynamic communities, computed on such a kind of graph-based models, are expected to provide indications of congested states of the transport system as well as representing an innovative and powerful tool for spatio-temporal patterns detection. In this work, we plan to leverage datasets on GPS traces of vehicles available for the city of Lyon, France, and identify dynamic communities of road segments (or geographical aggregates of them) that evolve similarly over time, i.e., exhibit similar traffic dynamics. In particular, we are interested in evaluating normal traffic routines vs. abrupt ones (for example, COVID-related situations could be investigated to analyze, e.g., pre/post-lockdown road traffic patterns).
    Dataset: will be provided.
    Related works:
    • Nagehan İlhan, Şule Gündüz Öğüdücü, Feature identification for predicting community evolution in dynamic social networks, Engineering Applications of Artificial Intelligence, Volume 55, 2016, Pages 202-218
    • Xu K.S., Kliger M., Hero A.O. (2011) Tracking Communities in Dynamic Social Networks. In: Salerno J., Yang S.J., Nau D., Chai SK. (eds) Social Computing, Behavioral-Cultural Modeling and Prediction. SBP 2011. Lecture Notes in Computer Science, vol 6589. Springer, Berlin, Heidelberg

  • Reinforcement Learning (RL) based Fake news detection using Open domain Knowledge Graph (KG)


    Supervisor(s):Mohit Mayank and Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee)

    Description:
    Reinforcement learning has been used to create agents which play games, control robots or event trade stocks. One commonality between all these agents is their capability to take efficient actions provided an environment and current state. Such generality leads to the use of RL in KG based tasks, where the agents are trained to traverse the KG in search of answers, rules or explanations. Our intention from this project is similar i.e. to leverage the generic capability of RL based agents to advance the fake news detection task with KG.
    Dataset: will be provided.
    Related works: (not exactly related, but in the same direction)
    • Weak Supervision for Fake News Detection via Reinforcement Learning https://arxiv.org/abs/1912.12520
    • Social Reinforcement Learning to Combat Fake News Spread http://proceedings.mlr.press/v115/goindani20a/goindani20a.pdf

  • Understanding AI's concerns in society through social media discourse.


    Supervisor(s):Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee) and Tymofii Brik

    Description:
    AI has has brought much needed advantages for the society such as i) machine translation (google translation), ii) autonomous driving cars, etc. to name a few. However, at the same time, many are also concerned about intrusion of AI in their life. One of the main concerns of this part of the society is AI taking over human jobs for example. In this thesis, we would try to understand the discourse regarding AI concerns (ethical, societal, etc.) on human society. In particular, we will apply Natural Language Processing (NLP) techniques to study this discourse. This topic is available at both bachelor and masters level.
    Dataset: will be provided.
    Related works:
    • https://link.springer.com/article/10.1007/s00146-020-00965-5
    • https://arxiv.org/ftp/arxiv/papers/1907/1907.07892.pdf

  • Identifying relevant comments for predicting rumoured Tweets


    Supervisor(s):Shakshi Sharma (shakshi dot sharma [ät] ut [dot] ee) and Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee)

    Description:
    People extensively engage these days on online social networks such as Twitter via posting of messages, comments, likes and dislikes. It has been noticed that not all of the users' responses (comments) are relevant to the original (source) tweet and these engagements that assist in spreading of rumour. In this thesis, we will particularly focus on comments for classifying the tweets as rumor or non-rumor by employing our methodology on a publicly available dataset. By first performing Exploratory Data Analysis, we will understand the data and extract various useful features. Next, by using a transformer based approach we will try to identify relevant comments. We will also validate our approach using similarity matching approaches (like edit distance, cosine similarity) to evaluate our model.
    Dataset: will be provided.
    Related works:
    • Zubiaga, Arkaitz, et al. "Towards detecting rumours in social media." Workshops at the Twenty-Ninth AAAI conference on artificial intelligence. 2015.
    • Sharma, Shakshi, and Rajesh Sharma. "A Graph Neural Network based approach for detecting Suspicious Users on Online Social Media." arXiv preprint arXiv:2010.07647 (2020).

  • Understanding Decentralized protests in Nigeria


    Supervisor(s):Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee) and Tymofii Brik

    Description:
    End SARS, a decentralized movement started in the fall of 2020 were basically a series of mass protests against police brutality in Nigeria. In this thesis, by using Twitter data we would employ NLP and other data science techniques to understand the peculiarities of this decentralized protests from a region, which is not well representated in terms of research. Compared to previous works, which have mainly analyzed centralized protests using Facebook or Twitter have mainly studied protests which were centralized in nature.
    Dataset: will be provided.

  • Understanding users' preference for languages on online social media


    Supervisor(s):Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee) and Tymofii Brik

    Description:
    Common knowledge suggests that users prefer a particular language to communicate online. It could be a native language or an international "lingua franca". Sometimes, switching between a native and international language is easy, e.g. when languages belong to the same family (Roman, German, Slavic). However, sometimes users switch to very distant languages. The goal of this thesis is to investigate two patterns:
    • When people use other alphabets for transliteration of their native words
    • When people use other alphabets to communicate universal symbols (citations, memes, references)
    • When people use other alphabets to genuinely speak a foreign language.
    Dataset: will be provided. The anonymised dataset is from a Facebook page dedicated to Euromaidan revolution. As the posts and comments are in Ukrainian and Russian. Thus, the knowledge of these languages is preferred.

  • Understanding social media users and their relation with hate speech


    Supervisor(s):Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee) and Tymofii Brik

    Description:
    Recently it has become relatively easy to detect hate speech. There are libraries of obscene words and negative connotations that are used to train models. However, little is known about how people change their online behavior after they were exposed to hate speech? We suggest exploring the following questions:
    • Do people react to hate speech or ignore it
    • Do people distinguish shades and flavors of hate speech
    • Do people adopt and apply hate speech himself and whether this is sticky
    Dataset: will be provided. The anonymised dataset is from a Facebook page dedicated to Euromaidan revolution. As the posts and comments are in Ukrainian and Russian. Thus, the knowledge of these languages is preferred.

  • Media Perception of events and personalities across borders


    Supervisor(s):Raul Sirel (TEXTA) and Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee)

    Description:
    Media is often attributed for creating a perception of various events, personalities and brands to name a few. In this thesis, we will analyse a large collection of corpus from multiple European news agencies, including Sputnik and Euronews, related to various entities (for example, Belarussian protests, Apple as a brand, public figures such as politicians etc). The thesis will investigate how different media houses perceive entities (topics, personalities, brands, etc). A cross cultural analysis using NLP (sentiment analysis, topic modeling etc) will be employed. Dataset: will be provided.
  • A Transformer-based approach for Detecting Hateful Speech on Twitter


    Supervisor(s):Rajesh Sharma (rajesh dot sharma [ät] ut [dot] ee), Shakshi Sharma (shakshi dot sharma [ät] ut [dot] ee), and Neha Sharma (neha dot sharma [ät] ut [dot] ee)

    Description:
    Humans have grown more reliant on social media platforms (SMPs) such as Twitter, Facebook, Gab, and others to acquire information and voice their opinions. Despite the fact that the concept behind the development of these platforms is the same as aforementioned, a few bad actors take advantage of these SMPs for their own nefarious objectives [1]. As a result, these platforms are facing issues like propagation of disinformation, fake news and hate speech. This thesis will focus on hate speech detection. In this regards, to maintain the dignity of these platforms and to stop these anti-social behaviours, various hate speech detection techniques [2] have been developed using machine learning and deep learning [3,4]. These hate speech detection approaches are insufficient because in order to train models, we need a huge amount of annotated data. The goal of this thesis is to meet these demands by introducing utilizing a transfer learning approach [5] on transformer models such as XLNET and GPT 3. Dataset will be provided.
    Related works:
    • Aluru, S. S., Mathew, B., Sinha, P., & Mukherjee, A. (2020). Deep Learning Models for Multilingual Hate Speech Detection. arXiv, https://arxiv.org/abs/2004.06465
    • Davidson, T., Bhattacharya, D., & Weber, I. (2019). Racial Bias in Hate Speech and Abusive Language Detection Datasets. arXiv, https://arxiv.org/abs/1905.12516
    • Badjatiya, P., Gupta, S., Gupta, M., & Varma, V. (2017). Deep Learning for Hate Speech Detection in Tweets. arXiv, https://arxiv.org/abs/1706.00188
    • Davidson, T., Warmsley, D., Macy, M., & Weber, I. (2017). Automated Hate Speech Detection and the Problem of Offensive Language. arXiv, https://arxiv.org/abs/1703.04009