John Asigbekye and Meghana Rao
For our final project, we wanted to understand how information is collected, spread, and stored on Twitter. The goal was to create a short and engaging podcast series on how the data of Twitter users is collected and used by the platform as well as learn about what tools the platform uses to maintain its databases. We were motivated to pursue this topic because we both enjoy using Twitter but have not given much thought to what data the platform collects and how it is utilized. Please listen to our podcasts here: https://drive.google.com/open?id=1OVlC8Z8QrcSqgg0XshK6upca5P_-Rc19
Episode 1: What information does Twitter store about its users?
In our first podcast episode, we answered the question “what information does Twitter store about its users?” Through our research, we were surprised to learn that Twitter stores data about non-registered users as well. For example, if a person without an account clicks on a tweet, Twitter stores their IP address, device type, and timezone. If you have an account, Twitter stores this information about you and additionally parses your direct message communication. They store whom a user has communicated with, the links and photos of the messages, and other information related to them to show a user more relevant content. Twitter also saves all of a user’s tweets, whether they are public or private. If a user deletes their tweets, search engines and other third parties may still retain copies of the tweets.
Twitter increases personalization of a user’s content through generating an “inferred identity” from the information they collect. A user can go to their settings, content preferences, and then personalization and data. Here, they can turn off all personalization and data services or turn off selective personalization services. For example, a user can stop Twitter from using their location data and make personalizations based on inferred identity. A user can also stop Twitter for collecting information on where they interact with Twitter data across the web. This control was granted to the user starting on January 1, 2020 when the California Consumer Protect was enacted, which mandates that large businesses give consumers transparency and control over their personal information and opt out of having their data sold to third parties.
Episode 2: How does Twitter curate your experience with the information they collect about you?
In our second episode, we answered the question, “how does Twitter curate your experience with the information they collect about you?” We began by understanding the main Twitter feed and how tweets are ordered. In 2006, Twitter utilized a reverse chronological feed and introduced recommendations and top tweets in 2014. Today, Twitter users, by default, are greeted by an algorithmically ordered feed. The explicit workings of the algorithm are not disclosed, but Twitter has shared some of the inputs to their main feed ranking algorithm. For example, input signals include the recency of a tweet, the engagement (number of retweets, who shared it), the type of media it includes (gifs, images, etc), how long a user has been away from the site, and the location of a user with respect to you. While the algorithm remains a black-box in nature, Twitter does provide tips on how to increase the visibility of one’s tweets. For example, Twitter encourages that a user tweets more often in order to increase their visibility on their follower’s feeds because it signals the algorithm to display content more often.
While the main feed orders tweets algorithmically by default, there are ways a user can reclaim control over their feed. For example, in 2019 Twitter reintroduced a feature where the user can revert to a reverse chronological feed. Additionally, a user can go to Settings and Muted Words and add specific strings that limit how the main feed algorithm works. For example, when the string “suggest_pyle_tweet” is added to muted words, the algorithm will stop serving tweets on a user’s feed because mutual’s engaged with them. If the string “suggest_who_to_follow” is added, the algorithm will stop providing follow suggestions to the user. This list of strings can be found here and is not widely known knowledge but can help a user reclaim control from the algorithmic feed order.
Data Storage on Twitter:
We are also interested in understanding the infrastructure for data storage at Twitter. Every day, Twitter caches, stores, and analyzes hundreds of millions of tweets. In 2010, Twitter was using third party colocation data centers. However, as traffic increased, Twitter has grown into a distributed infrastructure. They have an infrastructure that is highly scalable because their traffic grows faster than they can redesign their data centers. They also have to be prepared for world events that result in intense traffic surges. They currently use over 3,000 unique networks in data centers across the world. The storage infrastructure consists of Hadoop clusters, Manhattan clusters, graph clusters, and Blobstore clusters. Hadoop clusters are a collection of nodes that store and analyze structured and unstructured data in a distributed manner, and Twitter runs some of the largest Hadoop clusters in the world. At Twitter, the Hadoop filesystem (HDFS) stores over 300 petabytes of data across thousands of servers. Manhattan clusters were designed by Twitter for storing key-value pairs for fast response time, Manhattan is used for serving tweets, DMs, and advertisements. Twitter stores a graph of the relationships between users – followers, whom a user receives notifications from, whom a user follows, etc. To store this graph, they designed the database FlockDB. The graph is a set of edges between user ID noddes. The edges are time-stamped so that a user’s followers list is displayed in chronological order. Blobstore clusters are used to store photos and videos in tweets. When a photo is tweeted, it first goes to a set of Blobstore servers where the photo is written and stored. Photos are hashed to virtual buckets that map to real physical storage buckets that allows for efficient retrieval. Together, these clusters are central to Twitter’s data storage infrastrastructure.
We hope you have enjoyed this deep dive into how information is collected and stored on Twitter. Please listen to our podcast here! https://drive.google.com/open?id=1OVlC8Z8QrcSqgg0XshK6upca5P_-Rc19
Thank you for the wonderful quarter.
 “Democratizing data analysis with Google BigQuery,” Twitter. [Online]. Available: https://blog.twitter.com/engineering/en_us/topics/infrastructure/2019/democratizing-data-analysis-with-google-bigquery.html.
 “The Infrastructure Behind Twitter: Scale,” Twitter. [Online]. Available: https://blog.twitter.com/engineering/en_us/topics/infrastructure/2017/the-infrastructure-behind-twitter-scale.html.
 “Hadoop filesystem at Twitter,” Twitter. [Online]. Available: https://blog.twitter.com/engineering/en_us/a/2015/hadoop-filesystem-at-twitter.html.
 “What is a Hadoop Cluster?,” Databricks. [Online]. Available: https://databricks.com/glossary/hadoop-cluster.
 “Manhattan, our real-time, multi-tenant distributed database for Twitter scale,” Twitter. [Online]. Available: https://blog.twitter.com/engineering/en_us/a/2014/manhattan-our-real-time-multi-tenant-distributed-database-for-twitter-scale.html.
 “Introducing FlockDB,” Twitter. [Online]. Available: https://blog.twitter.com/engineering/en_us/a/2010/introducing-flockdb.html
 “How the Twitter algorithm works in 2019 and how to make it work for you.” Hootsuite. [Online]. Available: https://blog.hootsuite.com/twitter-algorithm/
 “How the Twitter algorithm works in 2020.” Sprout Social. [Online]. Available: