Abstract:
Social media delivers its users a large-scale easily usable and foolproof platform to communicate and to socialize that cannot be delivered using traditional media (such as newspapers and television). This platform is based on the technological foundations of Web 2.0 to define collaboration and data sharing among Internet users and operates as a group of software that allows the sharing of user-generated content. Social media users face two important problems when using this platform. The first problem is the following: when social media users receive data (user-generated content) via social media software, they might not know the exact quality of the data. Therefore, they may not be sure about the reliability and correctness of the data, how much emphasis it should be given, and whether they should help to disseminate the data. As a result, situations like information pollution can arise. The second problem is the following: social media software may change their privacy policies over time. As a result, users may not be able to set their privacy settings precisely according to the privacy measures that they demand. These policies determine the copyrights of the user's shared data. User's data intended to be disseminated among friend circle, may be disseminated via re-sharing within social media. Users are not aware of who actually can see his/her data or apply a process to it. As a result, problems like copyright violations can arise. In order to solve the two problems, users need information on the lifecycle of social media data. Provenance is defined as metadata that describes the origin, validity, quality, and ownership of data. Nowadays, we observe a lack of methodologies for detecting information pollution and copyright violations of users' shared data. The goal of this project is to develop methodologies that collect, store, pose queries and conduct analysis on the provenance of social media with a focus on the development of algorithms and methods for detecting information pollution and copyright violations of shared data. To begin to reach this goal, we developed algorithms and evaluated their correctness. We studied multiple provenance-quiring and storing systems to measure their abilities in aspects of scalability and performance with data of high magnitude. We proceeded by creating an abstract provenance data model that can be used to describe social interactions on different social network platforms by extending the PROV-O ontology. Using this model, we created a large-scale synthetic social provenance dataset, which we used to evaluate and test the proposed algorithms. We also tested our misinformation detection algorithm prediction capabilities against a real-life dataset. The results indicated the proposed algorithms shows promising outcomes.