Creating A Smell Print Through Artificial Scents

Blog, Journal for High Schoolers, Journal for High Schoolers 2021


Jude Kamal, April Mares, Serena Pulopot, Jasmine Vargas, Victoria Whipple, Caeley Woo, Devon Baur


Smell is the oldest of humanity’s five senses, and it’s used for almost everything – identification, safety, emotions, relationships, and memories. Despite this, it’s one of society’s most undervalued senses [20]. As a result of the COVID-19 pandemic, people are increasingly beginning to recognize the importance of our olfactory system. To help continue highlighting the necessity of smell, we created an installation in the Architecture and Design Museum in Los Angeles. This installation attempts to craft a unique smell experience connected to their memory of this event, a phenomenon known as a smell print, by exposing the museum-goers to a new manufactured smell free of cultural biases. In addition, the installation encourages visitors to think of smells and space in a way they hadn’t before.


Smelling is an incredibly quick process: in just two synapses, smell travels to the highest region of the brain [25]. The regions of the brain that are involved with olfaction are connected to the memory and emotion portions of the brain. As a result of this, smell and memory often become intertwined, with each person having their own unique smell print — the specific memory that they have attached to a smell [21]. Often, these memories go back many years, sometimes all the way back to early childhood; in this aspect, smell related memories diverge from typical memory patterns, as most people tend to suffer from a period of pre-adolescent amnesia where most memories before adolescence tend to be inaccessible.

However, most people are unaware of this link between smell and memory because most people tend to undervalue smell as a whole. In a 2019 survey of adults, smell was the sense that people would be most willing to lose [20]. These results weren’t limited to just adults — in a 2011 survey, 53% of young people (ages 16-22) would rather lose their sense of smell than their technological devices [20]. Part of the reason for this lack of appreciation for smell is that although we process smell almost instantly, smell often doesn’t pass through the conscious part of our brain. Certain smells that we’ve experienced hundreds of times before — like our bedroom or car — our brain has deemed as safe and we no longer consciously process these smells in a phenomenon known as “nose-blindness.” Other times, we become so bombarded by different smells that we can no longer distinguish any of the smells. Since we often aren’t consciously aware of what we’re smelling, people mistakenly conflate that with smell being unimportant.

But, for people whose sense of smell is altered, the value of smell can’t be overstated. Parosmia is when smells are distorted, like coffee smelling like rotting garbage. Parosmia can lead to a loss of identity, as people no longer feel like they are living their own life [9]. Parosmia can also impair daily life, causing people to avoid eating and anything that triggers unpleasant smells. Similarly, anosmia, the inability to smell, can have extremely negative impacts. Sadly, anosmia and depression are linked because both anosmia and depression impact some of the same brain levels, and in studies done on rats, impaired olfactory bulbs actually resulted in chemical changes in serotonin and dopamine [36]; as a result, anosmia both mimics the symptoms of depression and dovetails into it. Despite these incredibly real issues, many parosmics and anosmics report feeling misunderstood by their friends and family who don’t understand the severity of their condition; others even report doctors being unaware of what anosmia and parosmia was [28].

Now, with COVID-19, these once niche terms of “anosmia” and “parosmia” are becoming more commonplace as select COVID-19 cases result in both anosmia and parosmia. Collaborating with Los Angeles’ Architecture and Design Museum, we hope our Smell. Print. exhibit keeps the conversation about smell going and exposes more people to the tremendous importance of smell.

Methods & Materials

To efficiently create the exhibit our group divided into two teams — research and art.

Research Team: Smell Stimulus Survey

The research team’s initial objective was to analyze research articles and focus on those that were survey based. Our mentor instructed us to find inspiration for research questions we could use for our own study. We were to look for research we found most interesting and take notice of anywhere we saw an information gap so that we could add to that body of knowledge. Inspired by the question we were given — “How can we explore the implications of different scents?” — we were able to build our own research.

Through Zoom, we established constant communication with other members of the group. We utilized Google Documents to record our initial survey responses from our first trial set of questions. Our original questions were as follows: does this smell take you back to a memory, is it one memory or multiple, how strong is this memory, what is the date of the memory, and from what point of view did you see the memory. The questions we were given had to be answered based on the item’s odor, and the items initially included sunscreen, printer paper, an old book, chapstick, laundry detergent, and a blown out match.

Using Google Documents, we recorded our responses into a table and answered each question. From our responses, we noticed that some questions were unclear and some items were difficult to find; therefore, using Zoom we communicated with our mentor to give her the feedback we had about our experience answering the initial questions. From there, the research team was able to modify the questions, rewriting the questions to be answered specifically instead of the unstructured way they were originally posed. The questions later changed to be as follows: does this smell take you back to a memory, is it one memory or multiple, how vivid is the memory (on a scale of slightly to very vivid), what is the date of the memory, how does this memory make you feel (in a scale of 1[very bad]-10[very good]), how long did it take you to remember, and what is the significance of the memory (scale of slightly-very).

Once the questions had been modified, we made another trial run of the questions, but we explored two different modes: survey and interview. Using Google Forms, we were able to add the new questions, along with the new items, including sunscreen, printer paper, an old book, chapstick, laundry detergent, and dirt, to smell and then answer the new questions. For the interview approach, two members of the group were asked the same questions as on the Google Forms survey but used to record their responses verbally. We did this to understand which method would give us better results. The survey was more convenient due to time constraints and our goal for the number of participants we wanted to answer the questions. Based on the two different methods, our group members gave us feedback on the questions and the items they were asked to smell. We decided to eliminate printer paper and chapstick and replace them with fresh herbs, spice and add 2 more items, citrus, and crayon. With the new items, we also modified the questions to be clearer and more understandable. The new questions became as follows: please describe the memory in a short paragraph, how long did it take you to remember, is it one specific memory or many memories that are difficult to place, what is the date of the memory, and what brand/item did you use. These questions were then made into a survey using Google Forms, and we sent it out to as many people as possible.

After 48 hours of collecting data, we obtained over 40 responses. The responses were carefully analyzed to find the common words that participants used to describe the memories they associated with the items’ odors. From these common words, we decided to create word maps to display the most frequently used words participants used to describe the memory they had associated with each item they smelled. We also focused on the date and the time it took each participant to remember and added them to the word maps. From the eight word maps that were created, sunscreen, fresh herbs, old books and laundry detergent were displayed physically in the exhibit with the rest being available in the digital exhibit.

The scent machine, a key attraction of the exhibit meant to aid in museum goers’ experience of scent, runs on Python code. In order for the device to function, it must be given raw serial commands. Natlie Cygan, a Stanford student responsible for coding our machine, and Richard Hopper, a Research Software Engineer from SCHI Labs, provided insight on the function of the device: the machine switches off every sixty seconds and must be controlled using an iMac.

Design Team

Qr Code and Space Proposal

We decided to create an interactive component for the exhibit. To do so, we displayed QR codes around the exhibit space, following figure 1’s structure. Figure 2 shows how we visualized the exhibit space.

FIGURE 1 ( A visitor would use their mobile phone to scan a QR code)
FIGURE 2 (Exhibit Space Visualization)

Smemory Theater Video Approach

As part of the installation, we decided to create a narrative-style video to illustrate the experience of becoming anosmic, particularly the gaps in memory that form as a result of anosmia. One of the main components of the video was a miniature kitchen (shown in figure 3) that was built dollhouse style: the walls of the kitchen were made out of cardboard that we painted white; the floor and shelves were made out of popsicle sticks we cut in half; cereal boxes, toothpicks, and beads were used to make the kitchen cabinets; the tile backsplash and marble countertops were printed out; the stove top was made out of cardstock and silver foil; the fridge, dishwasher, and sink were made out of silver foil and toothpicks; the chairs and kitchen table were made of chopsticks, toothpicks, cardboard, cardstock, and nail polish; the gallery wall was made out of our photos and pictures of our paintings that we printed out; small items around the kitchen were made of clay, cardstock, popsicle sticks, chapstick caps, and scrabble tiles. Similarly, the playground slides used during the memory sequence were built in a miniature style out of chopsticks, toothpicks, and cardstock.

FIGURE 3 (The kitchen used in the Smemory video)

The other main component of the video included memories that were shot on an iPhone 8 using members of our group and our families. AfterEffects was used to stylize the videos and to digitally change the color of select objects during the video. To create the final video, iMovie was used to splice together the clips. Sniffing sound effects were used, and a recording of the hr-Sinfonieorchester conducted by Lionel Brunguier playing the Allegro con Grazia from Tschaikovsky’s 6th Symphony was used for the background music.

Animations Proposal And Approach

To create our frame-by-frame animated videos, we learned and explored the Adobe Creative Cloud Programs, including Adobe After Effects, Adobe Premiere Pro, Adobe Illustrate, Adobe Animate, and Adobe Media Encoder. We also used iMovie to splice some of our clips together. For file compression we used an application called HandBrake. We used Adobe Creative Cloud Programs over an open source program, Blender, because the Adobe system offers program adaptive files that makes the creation process more efficient. Figure 4 includes an example of our storyboards for the animations.

FIGURE 4 (“The Beach”; a final storyboard for our “Sunscreen” scent.)


Research Team

Forms response chart. Question title: How good do you think your sense of smell is?. Number of responses: 44 responses.
GRAPH 1 (How good do you think your sense of smell is? This graph shows the first question the Google survey asks. The x-axis is the scale from 0-5, 0 representing very weak and 5 representing very strong. The y-axis is the number of participants. The bars show the frequency of people who selected that value.)

Final QR Code Designs and Exhibit Space

Using Adobe Illustrator, we designed QR code frames (Figure 9). Within these frames sat a QR code linked to a webpage on the main exhibit’s website (Figure 10). The QR code was created with an online QR Code generator. We ended up with these designs to make it obvious that a visitor would have to scan a QR Code.

FIGURE 9 (Icons outlined in pink represent our eight scents. Icons outlined in blue represent the scents physically displayed at the exhibit. Icons outlined in green are not scent related: a. Represents Laundry Detergent b. Represents Old Books C. Represents Sunscreen D. Represents Fresh Herbs E. Represents Spices F. Represents Dirt G. Represents Lemons H. Represents Crayons I. Link to Main Website J. Link to Smemory Theater Video)
FIGURE 10 (Example of how a QR code would be placed within a QR code frame)

Our final physical exhibit space is in Los Angeles, California in cooperation with Los Angeles’ Architecture and Design Museum. The exhibit runs from August 6, 2021 to August 20, 2021 on select days. Figure 11 displays the final organization of our work.

Final Smemory Theater Video

In the end, the Smemory Theater video explored the associated memories someone might have with the smell of certain objects in a space that’s very familiar to them: their kitchen. The video moved through the kitchen, focusing in on different objects — coffee, lemons, dirt in a flower pot, a cookbook, and paprika — as the unseen narrator smelled them. These smells then each triggered their own memory, and the video cut away to “Smell-Memory Land,” which was signified by a slide that transported the viewer to the “Smemory Theater.” Here, different memories played out: stirring coffee, making lemonade, walking in a forest, reading an old book, and sprinkling paprika on a deviled egg.

However, the video soon took a sad turn. The narrator lost their sense of smell, and with it, all the smell-related memories. To signify this, the objects turned gray, and when the video cut to Smell-Memory Land, the landscape was gray as well because the memory was no longer there; now, there was something preventing the memory from playing out — a missing slide, a broken ladder, a toppled slide, or no Smemory Theater at all. With all the smell-related memories now all gone, the entire kitchen faded to gray as it was now nothing more than a kitchen.

Final Smell Print Animations

We created eight frame-by-frame animated videos for each smell: Sunscreen, Fresh Herbs, Laundry Detergent, Old Books, Crayons, Spices, Lemons, and Dirt. Figure 16 displays shots of the animations and explanations of the audio.



SUNSCREEN: “A Day at the Beach”

AUDIO: Ambient Beach Sounds

Mother: “Time to put sunscreen on!”

Child: “No, no, it’s so cold!”

FRESH HERBS: “Helping in the Kitchen”

AUDIO: Pasta Boiling

Friend 1: “What’s the next ingredient?”

Friend 2: “I think basil would be a good addition, go get some from the plant please.”

Friend 1: “Okay!”



Voice 1: “Hey, you made it!”

Voice 2: “Thanks for coming!”

Voice 3: “Don’t cry, it’ll be okay.”

Voice 4: “Right on time!”

Voice 5: “Look how old you’ve gotten.”

OLD BOOKS: “Learning to Play”

AUDIO: Simple Jingle Bells tune that transitions to a more advanced version of Jingle Bells

CRAYONS: “Elementary”

AUDIO: Ambient Classroom Music

Child: “Teacher! Look at my drawing!”

Teacher: “Wow, is that a dinosaur? You did a fantastic job!”

SPICES: “Spicing it up”

AUDIO: CHEF humming

Ambient pan noises and sizzling

LEMONS: “Lemons make Lemonade”

AUDIO: Ambient outside noises

Car driving by and honking

DIRT: “Rebellion in the Mud”

Child 1: “Aw, there’s no mud!”

Child 2: “Here, we can make some!”

Mother: “What are you guys doing?”

FIGURE 16 (Smell Print Animations (listed in no particular order).)



As we created our exhibit, we recognized there is a clear connection between smell and memory. Most of our Smell Stimuli Survey participants show this connection, as the recall time for “Immediately” for our scents ranged from 52.6% to 81.1%. Incoming feedback on our exhibit further suggests that we can successfully use technology to create art that intrigues an audience and encourages them to appreciate their sense of smell. We understand, however, that experiences and smells we worked with may not be universally recognized and contain American cultural bias.

Future Directions

Future work including ties of olfaction to memory would help to add onto and clarify the research. Relatively, there is not a lot of data on smell prints and memory, so additional research on memories and smell would be very helpful in expanding the field. Additionally, more research about how anosmia and parosmia work would be very helpful as it would support the research of the tie of memory to smell and how the absence or alteration of smell affects memory recall. More research and studies will hopefully occur in the future due to the mass experience of parosmia and anosmia as a result of COVID-19. This pandemic, while an incredible tragedy, has brought awareness to the importance of smell as a sense and will hopefully inspire more research in the field of olfaction.


  1. AbScent Anosmia Support. Parosmia – Latest Research. Accessed July 6, 2021.
  2. Adams, Dara R, David W Kern, Kristen E Wroblewski, Martha K McClintock, William Dale, and Jayant M Pinto. “Olfactory Dysfunction Predicts Subsequent Dementia in Older U.S. Adults.” Journal of the American Geriatrics Society. U.S. National Library of Medicine, September 25, 2017.,controlling%20for%20the%20above%20covariates.
  3. Adams, Dara R, Kristen E Wroblewski, David W Kern, Michael J Kozloski, William Dale, Martha K McClintock, and Jayant M Pinto. “Factors Associated with Inaccurate Self-Reporting of Olfactory Dysfunction in Older US Adults.” OUP Academic. Oxford University Press, December 22, 2016.
  4. Amos, Caroline, and Raymond McAnally. “S2:E1 Brooke | Parosmia I: When Smells Are No Longer Safe | Fatigued Podcast.” Fatigued: Conversations About COVID-19. Accessed June 30, 2021.
  5. Ashley Loiseau, TheBirdGetsFit. “Everything Tastes Like Rotten Eggs! | My Parosmia Experience.” April 22, 2021.
  6. Attems, Johannes, Lauren Walker, and Kurt A Jellinger. “Olfaction and Aging: A Mini-Review.” Gerontology. Karger Publishers, May 9, 2015.
  7. Bellini, Gustavo Bueno. “What Is TMPRSS2?”, June 15, 2020.
  8. BioLegend. COVID-19 and Olfactory Dysfunction. Accessed July 6, 2021.
  9. Bonfils P, Avan P, Faulcon P, Malinvaud D. Distorted Odorant Perception: Analysis of a Series of 56 Patients With Parosmia. Arch Otolaryngol Head Neck Surg. 2005;131(2):107–112. doi:10.1001/archotol.131.2.107
  10. “Chemoreception – Smell | Britannica.” Accessed July 6, 2021.
  11. Chung, Tom Wai-Hin, Hui Zhang, Fergus Kai-Chuen Wong, Siddharth Sridhar, Kwok-Hung Chan, Vincent Chi-Chung Cheng, Kwok-Yung Yuen, Ivan Fan-Ngai Hung, and Henry Ka-Fung Mak. “Neurosensory Rehabilitation and Olfactory Network Recovery in Covid-19-Related Olfactory Dysfunction.” Brain Sciences 11, no. 6 (May 23, 2021): 686.
  12. Ciurleo, Rosella, Simona De Salvo, Lilla Bonanno, Silvia Marino, Placido Bramanti, and Fabrizia Caminiti. “Parosmia and Neurological Disorders: A Neglected Association.” Frontiers in Neurology 11 (November 9, 2020): 543275.
  13. “COVID-19 Anosmia Loss of Smell and Taste Symptom Monitoring.” Accessed July 8, 2021.
  14. Croy, Ilona, Steven Nordin, and Thomas Hummel. “Olfactory Disorders and Quality of Life-An Updated Review.” OUP Academic. Oxford University Press, January 15, 2014.
  15. Dade, Lauren A., Robert J. Zatorre, and Marilyn Jones-Gotman. “Olfactory Learning: Convergent Findings from Lesion and Brain Imaging Studies in Humans.” Brain: A Journal of Neurology 125, no. Pt 1 (January 2002): 86–101.
  16. Dhikav, Vikas, and KuljeetSingh Anand. 2012. “Hippocampus In Health And Disease: An Overview”. Annals Of Indian Academy Of Neurology 15 (4): 239. doi:10.4103/0972-2327.104323.
  17. erz, Rachel S. “The Role of Odor-Evoked Memory in Psychological and Physiological Health.” Brain Sciences, vol. 6, no. 3, July 2016, p. 22. PubMed Central, doi:10.3390/brainsci6030022.
  18. Health Essentials from Cleveland Clinic. “What to Know About Losing Your Sense of Smell and How It Relates to COVID-19,” January 22, 2021.
  19. HearingLife. “How Does My Hearing Work and What Happens When Something Goes Wrong.,” 2021.
  20. How Dangerous Is Nasal Polyps ? “How Dangerous Is Nasal Polyps ?,” September 24, 2020.
  21. Howell, Jessica, Richard M Costanzo, and Evan R Reiter. “Head Trauma and Olfactory Function.” World Journal of Otorhinolaryngology – Head and Neck Surgery. Elsevier, March 14, 2018.
  22. Jarvis, Brooke. “What Can Covid-19 Teach Us about the Mysteries of Smell?” The New York Times. The New York Times, January 28, 2021.
  23. Kohler, Christian G., Paul J. Moberg, Raquel E. Gur, Michael J. O’Connor, Michael R. Sperling, and Richard L. Doty. “Olfactory Dysfunction in Schizophrenia and Temporal Lobe Epilepsy.” Cognitive and Behavioral Neurology 14, no. 2 (June 2001): 83–88.
  24. Kropf, Aleisha. “The Nose Remembers: Olfactory Memory and Where It Takes Us.” WonderLab Museum of Science, Health & Technology (blog), April 13, 2019.
  25. Leopold, Donald. “Distortion of Olfactory Perception: Diagnosis and Treatment.” Chemical Senses 27, no. 7 (September 1, 2002): 611–15.
  26. Liu, David T., Maha Sabha, Michael Damm, Carl Philpott, Anna Oleszkiewicz, Antje Hähner, and Thomas Hummel. 2020. “Parosmia Is Associated With Relevant Olfactory Recovery After Olfactory Training”. The Laryngoscope 131 (3): 618-623. doi:10.1002/lary.29277.
  27. Marin (Curley), Allison. “Making Sense of Scents: Smell and the Brain.” Accessed July 6, 2021.
  28. Masaoka, Y., H. Sugiyama, A. Katayama, M. Kashiwagi, and I. Homma. 2012. “Slow Breathing And Emotions Associated With Odor-Induced Autobiographical Memories”. Chemical Senses 37 (4): 379-388. doi:10.1093/chemse/bjr120.
  29. Memory and Aging Center. “Speech & Language,” 2021.
  30. Meng, Xiangming, Yanzhong Deng, Zhiyong Dai, and Zhisheng Meng. “COVID-19 and Anosmia: A Review Based on up-to-Date Knowledge.” American Journal of Otolaryngology 41, no. 5 (2020): 102581.
  31. Moore, Kevin. “You really know you’re anosmic when… In the spirit of the 101 Other uses for your nose… topic on the congenital board, let’s try to get to 101. I’ll start.” Facebook. December 28, 2012.
  32. Müller, Antje, Basile N. Landis, Uwe Platzbecker, Vjera Holthoff, Johannes Frasnelli, and Thomas Hummel. “Severe Chemotherapy-Induced Parosmia.” American Journal of Rhinology 20, no. 4 (August 2006): 485–86.
  33. Nazario, Brunilda. “What Is Parosmia?” WebMD, March 24, 2021.
  34. “COVID-19 and Smell Loss (Anosmia),” April 19, 2021.
  35. “How Does a SARS-CoV-2 Virion Bind to ACE2?,” July 17, 2020.
  36. Parker, Jane K., Christine E. Kelly, and Simon B. Gane. “Molecular Mechanism of Parosmia.” MedRxiv, February 8, 2021, 2021.02.05.21251085.
  37. Philpott, C. M., and D. Boak. 2014. “The Impact Of Olfactory Disorders In The United Kingdom”. Chemical Senses 39 (8): 711-718. doi:10.1093/chemse/bju043.
  38. Psychology and Smell – Fifth Sense. Accessed 16 July 2021.
  39. “Olfactory Cortex”. 2021. Physiopedia.
  40. Rashid, Rasheed Ali, Ameer A. Alaqeedy, and Raid M. Al-Ani. “Parosmia Due to COVID-19 Disease: A 268 Case Series.” Indian Journal of Otolaryngology and Head & Neck Surgery, May 23, 2021.
  41. Rochet, Marion, Wissam El-Hage, Sami Richa, François Kazour, and Boriana Atanasova. “Depression, Olfaction, and Quality of Life: A Mutual Relationship.” Brain sciences. MDPI, May 4, 2018.
  42. Rombaux, Philippe, André Mouraux, Bernard Bertrand, Georges Nicolas, Thierry Duprez, and Thomas Hummel. “Retronasal and Orthonasal Olfactory Function in Relation to Olfactory Bulb Volume in Patients with Posttraumatic Loss of Smell.” The Laryngoscope 116, no. 6 (June 2006): 901–5.
  43. Sayette, Michael A., and Dominic J. Parrott. “Effects of Olfactory Stimuli on Urge Reduction in Smokers.” Experimental and Clinical Psychopharmacology, vol. 7, no. 2, May 1999, pp. 151–59.
  44. Schofield, Peter, Tammie Moore, and Andrew Gardner. “Frontiers | Traumatic Brain Injury and Olfaction: A Systematic Review | Neurology.” Accessed June 29, 2021.
  45. Sivam, Anita, Kristen E Wroblewski, Gorka Alkorta-Aranburu, Lisa L Barnes, Robert S Wilson, David A Bennett, and Jayant M Pinto. “Olfactory Dysfunction in Older Adults Is Associated with Feelings of Depression and Loneliness.” OUP Academic. Oxford University Press, January 24, 2016.
  46. SoP. “Temporal Lobes.” The Science of Psychotherapy (blog), October 1, 2018.
  47. Thirtyacre, Logan. “Dealing with Parosmia after Coronavirus.” April 24, 2021.
  48. Von der Brelie, Linda, Christoph Becker, and Christian von der Brelie. 2020. “Parosmia As An Early
  49. Symptom Of Acute SARS-Cov-2 Infection”. Deutsches Aerzteblatt Online. doi:10.3238/arztebl.2020.0328.
  50. Watson, Kathryn, and Seunggu Han. “Parosmia: Symptoms, Causes, Diagnosis, Treatment, and Recovery.” Healthline, September 17, 2018.
  51. Willander, Johan, and Maria Larsson. 2006. “Smell Your Way Back To Childhood: Autobiographical Odor Memory”. Psychonomic Bulletin & Review13 (2): 240-244. doi:10.3758/bf03193837.
  52. Willander, Johan, and Maria Larsson. “Olfaction and Emotion: The Case of Autobiographical Memory.” Memory & Cognition, vol. 35, no. 7, Oct. 2007, pp. 1659–63. Springer Link, doi:10.3758/BF03193499.
  53. Zald, D. H., and J. V. Pardo. “Functional Neuroimaging of the Olfactory System in Humans.” International Journal of Psychophysiology: Official Journal of the International Organization of Psychophysiology 36, no. 2 (May 2000): 165–81.

Predictive Load Balancing for Financial Exchanges

Blog, Journal for High Schoolers, Journal for High Schoolers 2021


Akshat Parikh, Sohrab Hassibi


With the continual development of the modern stock market and new stock exchanges, there has been a growing trend of individual participation in our capitalistic financial system. With these newfound changes, numerous participants of certain stock exchanges have elected to use HFTs, or high-frequency traders, to take advantage of powerful computer programs in order to reach higher levels of success in the stock market. Furthermore, some stocks are traded more frequently than others and have a higher trading volume. From this arises an inefficiency in resources, as stocks with lower trading volumes may still be given the same amount of computational resources as stocks with higher trading volumes. In other words, there is a discrepancy between the required resources for smaller stocks and what they are given, which then takes away from stocks with higher volumes. Matching adequate computational resources to symbols on a stock exchange can help equalize the latency needed to transact across symbols and reduce the ability of latency advantaged traders to cause short term price fluctuations. However, performing this match requires an expectation of future trading volume. In this project, we leverage predictive models to forecast stock symbol daily volumes and use this information to effectively shard a synthetic exchange.


Today’s equity markets allow hundreds of millions of users across the world the opportunity to capitalize millions of businesses. Modern stock exchanges are heavily engineered to provide a high level of service for these users. While in 2016, the New York Stock Exchange reported an average of 3.6 billion stocks traded per day several trends point to increased demand for high availability, performance and fair access in equity exchanges. In 2020 during the Covid-19 pandemic, these equity markets added more than 10 million new users. As interest in market participation grows, the ability to provide reliable, fast and fair access to exchanges becomes more challenging.

A recent example which introduces some challenges associated with large trading volumes and stock exchanges is the Gamestop debacle in January 2021. Gamestop, a video game retail company, saw a sharp increase of more than 5000% in trading volume as the stock ticker was promoted on various social media platforms including Reddit and Twitter. These users, among others, used online brokerages to buy and sell large quantities of the stock. Due to high trading volume in Gamestop, several brokerages faced latency issues in reaching exchanges and exchanges themselves halted shares from trading. A measure to mitigate the effects of such halts in the future was introduced by the SEC in June 2021 which outlines “regulatory halts” that enable exchanges to halt shares from trading when significant latency or outages are observed. Although trading was unusually high for only a handful of tickers, latency and access to financial markets across a broad range of symbols were hampered.

Modern exchanges operate a central matching engine, which maintains data structures pertaining to client orders to buy or sell various stock symbols. For each symbol the aggregation of these orders results in a limit order book — a list of buy and sell orders which are prioritized first by price and secondly by time.

As orders enter the central exchange server, these orders are queued and then await processing by the matching engine. Since the limit order books pertaining to each symbol have no overlap in information with one another, symbols may be sharded and routed to separate queues and matching engine shards for processing to improve the total throughput of an exchange. Doing so would result in each shard at the central matching engine receiving and fulfilling orders pertaining to disjoint subsets of symbols which the exchange offers. If volume across shards are roughly equivalent and the orders, the queueing lengths and consequently queueing delays would be roughly equal. However, assorting symbols into shards to balance the shards volume can be challenging for a few reasons. Firstly, volumes for stock symbols are not uniform — some tickers can receive substantially higher volume than others. Secondly, these allocations must be performed ahead of receiving orders which necessitates the prediction of volumes for stock tickers.

Our research aims to analyze the use of several predictive algorithms to forecast daily volume from previous volume data for stock symbols. Consequently, we implement an algorithm to use these forecasts to balance the load across matching engine shards within an exchange. After we predict the volume data for all stocks, the stocks get organized into shards which are a collection of several stocks.

Methods and Dataset


To evaluate approaches for load forecasting and balancing, we use historical volume data from the Nasdaq 200 — the 200 largest companies by market capitalization listed on the Nasdaq exchange. For each symbol we compile daily volume across the exchange from June 15, 2016 to June 14, 2021. In some cases, data may be missing for specific days, for example Affirm which listed on Nasdaq in 2021. In these cases, we exclude the ticker from analysis.

Figure 1 (Small subset of the full dataset)

To process the dataset, we created two arrays or collections that contained the “output” data and the “input” data. The input data is pairs of subsequent volume data, and the output data is the volume that follows each pair. For example, if we wanted to process the Apple Stock for the first five days, the input data would look as follows: the first pair is 117780800 and 125307200; the second pair is 125307200 and 244032800; the third pair is 244032800 and 1378647600. The output data is as follows and matches with the corresponding input data: the first output is 24403280;, the second output is 137647600; the third output is 142185600. We construct such input/output pairs for each symbol before applying several algorithms described below.


After processing the dataset, we applied algorithms for three purposes. Firstly, we used various machine learning algorithms to predict the volume for the next day. The algorithms used were linear regression, bayesian regression, and nearest neighbor regression. We chose the linear regression algorithm as a baseline to compare how other algorithms fare. We also chose nearest neighbor regression because it works well with the data we have. Nearest neighbor regression works by looking at the similarity between data points and using that information to make predictions. Consider several data points on a standard x versus y graph with one data point which we are trying to predict for. The nearest neighbors regression algorithm calculates the distance from that point to all other points on the graph. Depending on the number of neighbors, it chooses the number of points closest to its current point. Finally, it averages out the values for all the chosen points and that results in the prediction for the unknown point. In our case, the “neighbors” would be previous stock volumes.

Figure 2 (Example of Nearest Neighbor Regression with Three Neighbors)

To implement the nearest neighbor algorithm, we had to devise a way to find the optimal number of neighbors for any stock volume and for any number of days. Our solution was to generate predictions for all neighbors which would be from one neighbor all the way to predicted day minus one neighbors. Then, for all predictions, we calculate the mean percent error which is the percentage that the predicted amount is off by when compared to the actual amount. Finally, our algorithm chooses the amount of neighbors that has the least amount of percent error. In all, by optimizing the amount of neighbors, we could create strong predictions with the nearest neighbor algorithm, but it led to some issues further down the road with creating figures.

Our second goal was to find the correlation, if any, between the stocks in our dataset. We could then use the correlation and the volumes of the two stocks to predict for one of the stocks. The correlation matrix, a collection that shows the correlation coefficients between all variables in the dataset, was calculated using a pairwise correlation algorithm . Before feeding our data to the algorithm, we converted our entire dataset into percent increase/decrease values using the previous day to calculate the percentage change. This is because it will lead to better correlation results as large companies might have a large average volume and small companies might have a small average volume. This means that after converting the data to percentage changes, it calculates the correlation based on the changes and the wide gap between the average values is gone.

Our third and final goal was to use the predicted volume data from the machine learning algorithms or the correlation algorithm to balance the volume on the shards. We balanced the shards by using a greedy allocation approach. Before adding the volume to any shard, our algorithm would first check which shard had the smallest total volume and it would add to that shard. This process would run until all of the 200 symbols are assigned to a shard and it would somewhat equalize the total volume in the shards.


The first results show the accuracy of the three machine learning algorithms when predicting volume data for 80 days (July 2, 2016 to September, 2016). The accuracy was generated by calculating the average mean percent error between the predicted data and the actual data for ally symbols. Average mean percent error is the average of all the mean percent errors for the specified number of days; mean percent error is just the percentage that the predicted amount is off from the actual amount.

In this chart, the Bayesian and linear regression do not differ that much in terms of accuracy which persits with the full data. Their accuracy mainly hovers between 30 percent to 35 percent. In this chart, the nearest neighbor algorithm is less accurate with its average mean percent error hovering between 38 to 45 percent error. This is because to generate this figure, we had to limit the amount of neighbors for the nearest neighbor regression optimization to 8 days back till 2 days back from the prediction day. As the number of days increased the number of neighbors also increased, exponentially increasing the computing time for the optimization. With the full dataset and optimizations, the nearest neighbor algorithm is much more accurate than the other two algorithms with its average mean percent error hovering around 10 to 30 percent error.

Figure 3 (Accuracy of Bayesian Linear and Nearest Neighbor Regression Algorithms when Predicting Volume Data)

This next graph shows our results for when we balanced the shards using different algorithms. The total number of shards we chose was 5 which meant there were 40 stocks per shard. We landed on this amount of shards because as we increased the total number of shards, the stocks per shard decreased. This led to more error when balancing as the shards had less room to balance and one large prediction could throw off the other members when having a lot of shards. We calculated the mean percent error between the balanced volume and the actual volume for each stock over 96 days. The same optimizations as in the previous results were used as the nearest neighbor algorithm also took a large time to compute with such a large amount of days. Our results show that when balanced, the percent of error is less for bayesian regression than linear regression. The mean percent error for bayesian regression is more tightly packed from around 5% to 35% than linear regression whose percent error is from 8% to 50%. The nearest neighbor algorithm is slightly more accurate when balancing with its mean percent error values being around 4% to 30%. However, there are some outliers for the nearest neighbor algorithm which are around 60% error. One thing to be noted is that when running the same calculations with the full dataset and optimizations, the accuracy of the bayesian and linear regression algorithms stayed the same, but the accuracy of the nearest neighbor algorithm was brought down to 2% to 20%.

Figure 4 (Accuracy between Balanced Sharding and Actual Amount for all Three Algorithms)

This next diagram shows the correlation matrix between stocks in our dataset. An important aspect of a correlation matrix is the correlation matrix metric which shows how strongly two variables are related. The values of the metric are decimals that span from -1 to 1 with -1 representing a strong negative or inverse relationship and 1 representing a one to one direct relationship. As the correlation coefficients approach 0 from either side, the involved variables are more and more unrelated.

We have filtered the dataset to only include stocks who have a large positive or negative correlation with another symbol (0.5 or greater). From this, we can see that symbols like GOOG and GOOGL have high positive correlation; to be precise, it is around 0.9337. There are 19 pairs of stocks that have a high positive correlation above 0.5. There are no stocks with a strong negative correlation, showing that inverse relationships with volume largely do not exist in our dataset.

Figure 5 (Correlation Matrix between Symbols with a Correlation Coefficient higher than 0.5)

Our final diagram shows the accuracy between randomly allocating shard versus balancing the shard allocation. In this chart, the mean percent error is the error between balanced sharding volume and the actual amounts, as well as, the random sharding volume and the actual amounts. The balanced sharding volume is calculated by balancing all the shards and then taking the average of the total volume of the shard. Then, we calculate the mean percent error between the average volume of the shard and the actual amount for the symbol.

For this chart, it also only uses data for 96 days after the initial day and 5 shards for both balanced sharding and random sharding. The chart shows that with random shard allocation the mean percent error is in the thousands. Whereas, the balanced sharding has a mean percent error of 5 to 40 percent. The mean percent error values for the balanced sharding allocation were calculated by taking the average of all the mean percent errors for all the algorithms. The second figure in this section shows a zoomed in version of the balanced sharding accuracy data.

Figure 6 (Accuracy of Random versus Balanced Sharding Allocation)
Figure 7 (Accuracy of Balanced Sharding Allocation)


Through our research, we devised an algorithm that predicted the volume of a stock using historical data and then balanced the load the server or stock exchange would receive. This is supported by our data from the random sharding versus the balanced sharding. In the balanced sharding, we were able to bring down the percent error to 7 to 35 % when compared to the actual data. With randomly allocating the shards, the percent error was several thousand percent. This would have led to an improper allocation of resources in the stock exchange, causing highly skewed latency across symbols in an exchange.

We also found out that the nearest neighbor algorithm was the strongest for this dataset followed by Bayesian regression and linear regression. To generate the figures for the presentation, we had to minimize the nearest neighbor optimization we did. This led to some error which was then demonstrated in the chart. However, when we calculate the mean percent error with full optimizations the nearest neighbor algorithm leads to the most accurate results compared to the other algorithms. A possible explanation for this is that when we look at volume data for multiple successive days, the volume only has minor variations. However, over time the variations compound, so the nearest neighbor algorithm fares the best because it can use multiple neighbors or days and then make assumptions based on that. The other algorithms tend to look at a larger trend than just the selective group of neighbors that this algorithm chooses.

Our research also provided insight into relationships between stocks. Some of the highest correlated stocks were between stocks of different classes. For example, the Google Class C stock and the Google Class A stock had a correlation of 0.9337. Another similar pair was between the ViacomCBS Class A stock and the regular ViacomCBS stock. This makes sense as both of the stocks come from the same company so they would be traded with similar volume. A peculiar relationship we saw was between the Coinbase stock and the Applovin stock which had a high correlation coefficient of 0.85. One possible reason for this is that both companies went public at a similar time and they were both stocks that people were highly anticipating. Hence, they were traded in similar volumes.

Our final load balancing algorithm was able to assist stock exchanges by using machine learning algorithms to predict the volume data and then effectively balance the load. This allowed us to improve the conditions of stock traders and reduce some of the unfair advantages that high frequency traders have.

Future Directions

There are still improvements needed for our algorithm as it still leads to some inaccuracies. This could be attributed to the fact that for a lot of the symbols in the dataset there were missing values or the company went public just recently like coinbase. This meant there was not enough data to perform strong analysis on. Future research could be to use a larger dataset with more stock as well as use other machine learning algorithms such as ridge regression. Currently, our correlated stock prediction model predicts using the two input data for the stocks and their corresponding output data. This is strong in some situations, but it does not put the correlation coefficient into consideration. A better option for variables that have multicollinearity is to use ridge regression which can handle correlated variables.

Another future direction is to apply our algorithm to a real-world scenario, or more specifically, an actual stock exchange such as CloudEx. This will allow us to actually use a stock exchange and see how the allocation of resources and the actual latency depend on the algorithm. Finally, we can use our research for abuse detection. People who can trade lots of a single type of stock in one day using HFTs can directly contribute to higher latencies and problems, as servers do not have adequate resources to handle their trading volume. We can use the algorithm to predict the expected amount of volume for a particular day and notice any sharp increases in trading volume which could signify abuse of the stock exchange. In all, our research this summer used machine learning to predict the volume data of a stock and then balance the total volume across the stock exchange which helped stock exchanges and reduce unfair advantages in the financial realm.


  1. Budish, Eric, et al. “The High-Frequency Trading Arms Race: Frequent Batch Auctions as a Market Design Response.” The Quarterly Journal of Economics, vol. 130, no. 4, 23 July 2015, pp. 1547–1621,, 10.1093/qje/qjv027.
  2. Geng, Yilong, et al. Exploiting a Natural Network Effect for Scalable, Fine-Grained Clock Synchronization This Paper Is Included in the Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’18). Exploiting a Natural Network Effect for Scalable, Fine-Grained Clock Synchronization. —. This Paper Is Included in the Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’19). SIMON: A Simple and Scalable Method for Sensing, Inference and Measurement in Data Center Networks SIMON: A Simple and Scalable Method for Sensing, Inference and Measurement in Data Center Networks. , 2019.
  3. Jagannathan, Ravi. “On Frequent Batch Auctions for Stocks.” Journal of Financial Econometrics, 10 Jan. 2020,, 10.1093/jjfinec/nbz038. Accessed 6 Aug. 2021.
  4. Kushal Vala. “Regression: Kernel and Nearest Neighbor Approach – towards Data Science.” Medium, Towards Data Science, 26 Feb. 2019, Accessed 6 Aug. 2021.
  5. Lamport, Leslie. “Time, Clocks, and the Ordering of Events in a Distributed System.” Communications of the ACM, vol. 21, no. 7, 1 July 1978, pp. 558–565, 10.1145/359545.359563.
  6. Lin, Yow-Jian, et al. Sync-MS: Synchronized Messaging Service for Real-Time Multi-Player Distributed Games.
  7. Sviridov, German, et al. Demo Abstract: Leveraging AI Players for QoE Estimation in Cloud Gaming. Trading Algorithms – Resources. “
  8. Trading Algorithms – Resources.” Google Docs, 2021, Accessed 6 Aug. 2021.
  9. SECURITIES AND EXCHANGE COMMISSION [Release No. 34-92071; File No. S7-24-89], 2021 Accessed 13 Aug 2021.
  10. Local broker Stake forced to suspend GameStop trading, 2021, 2-p56yqz.html. Accessed 13 Aug 2021.

Analyzing Mutations of SARS-CoV-2 Variants of Concern

Blog, Journal for High Schoolers, Journal for High Schoolers 2021


Jenny Lam, Andrew Hong, Riley Kong, HoJoon Lee, Stephanie Greer


The SARS-CoV-2 virus was discovered in late December of 2019, with more than 4 million cases of COVID-19 detected as of August 2021. Because SARS-CoV-2 is an RNA virus, it is highly susceptible to mutations in its genome. Some mutations give the virus a replicative advantage and allow it to pose a greater threat to humans, leading to variants of concern, such as the alpha, beta, gamma, and delta variants. Combatting COVID-19 is heavily dependent on understanding the mutations of SARS-CoV-2 and how they spread. In this study, we analyzed different aspects of the COVID-19 genomic metadata from GISAID to identify patterns in the variants of concerns and mutations in the spike protein. We focused on mutation frequency over time, the possibility of combination of variants of concern, and visualizing mutation frequency in the spike protein. Through this analysis, we identified patterns that can be used to further investigate the virus and set a framework for more in-depth analysis of SARS-CoV-2 and other viruses in the future. We were able to find that mutation frequencies over in the Delta variant displayed a different pattern as mutations increased in frequency more slowly and that variants with combinations of mutations from two variants of concern are mostly likely not a threat, as well as identify positions in the spike protein that could be a potential focus for future treatment.

1. Background

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the RNA virus that causes the global pandemic of coronavirus disease 2019 (COVID-19), has led to over 200 million confirmed cases and 4 million deaths worldwide [1] as of August 13, 2021. As a member of the family of coronaviruses, SARS-CoV-2 contains four primary structural proteins, the spike (S), envelope (E), membrane (M), and nucleocapsid (N) proteins [2]. The S protein, located on the surface of the virus, plays an important role in viral infection, as it is the part that attaches the virus onto the receptor sites of the host cells. The S protein mediates receptor binding and the fusion of viral and host cell membranes in order for the virus to release its genome into the host cell and replicate itself [3].

As the virus passes on its genome and spreads, errors in replication of its genome result in mutations, giving rise to variants, or different strains of the virus. The majority of mutations are benign, but several, especially amino acid changes that affect the S protein, can affect the virus’s transmissibility and virulence. The World Health Organization (WHO) and Centers for Disease Control and Prevention (CDC) has established criteria to categorize SARS-CoV-2 variants based on threat level. These designations include variants of interest (VOIs), variants of concern (VOCs), and variants of high consequence (VOHCs). This study primarily focuses on VOCs, which are characterized by evidence of increased transmissibility, reduction of neutralization by antibodies from vaccination or previous infection, reduced effectiveness of treatment, or reduced accuracy in COVID-19 detection [4]. These variants are characterized by specific amino acid substitutions in the S protein. As of August 13, 2021, four VOCs have been identified and are known as the Alpha (B.1.1.7), Beta (B.1.351), Gamma (P.1), and Delta (B.1.617.2) variants.

Genomic sequence is a process used to identify the order of nucleotides in a genome. It is an essential part of understanding and combating COVID-19, as it allows for the identification of mutations in the genome and to see when new variants of the virus arise, allowing them to be better prepared and potentially be able to limit the spread of a new variant or current variants. It may also enable treatment options in the future by understanding what areas of the SARS-CoV-2 genome are essential and for those areas to be targeted.

2. Methods

2.1 k-mer Mutation Table

We analyzed the SARS-CoV-2 reference genome by creating a mutation table consisting of every possible amino acid substitution, as well as unique k-mers, which are substrings of the genome of length k associated with each possible substitution. Our k-mer mutation table uses the SARS-CoV-2 reference genome (accession number: NC_045512.2) from the National Center for Biotechnology Information (NCBI) [5]. For each position in the reference genome, we applied all possible substitution mutations to generate an altered genomic sequence, which we then used to produce mutant 10- and 20-mers. Correspondingly, we described each mutation with the resulting amino acid change and the relevant gene. We created a Python script in Google Colaboratory, an online collaborative programming environment, to generate a .csv file containing this information. Each row in the resulting file corresponds to one substitution mutation in the genome (Figure 1).

Figure 1: First Several Lines of the SARS-CoV-2 k-mer Mutation Table

2.2. Metadata Analysis

Our study utilizes COVID-19 genomic metadata (updated August 6, 2021) with over 2.6 million cases from the Global Initiative on Sharing Avian Influenza Data (GISAID), a publicly available reference database, in order to characterize the VOCs and S protein mutations through data such as collection date, collection location, patient age, patient gender, variant, and amino acid substitutions. The metadata also includes additional information for each reported case, which was not used for our research, namely the virus name, type, accession ID, additional location information, sequence, host, clade, Pango lineage, if the genome is a reference, if the genome is complete, if the genome is high coverage or low coverage, N-content, and GC-content.

To parse through the metadata, we used Python with Google Colaboratory, and utilized libraries such as Matplotlib and Pandas to generate graphs to visualize the data (Figure 2).

Figure 2: Code snippet utilizing Matplotlib and Pandas

Our code can be found at

3. Results

3.1 Overview of S Protein Mutations in VOCs Over Time

For each VOC, we created a graph that showed the relative frequency of S protein mutations over several months (Figures 3-6). The data points for each variant were determined by dividing the number of samples that contained the mutation by the total number of samples for each month. The graphs do not display every mutation found in the variants; they only include the most frequent mutations for each variant. Months with a small sample size (< 30 cases for that variant), which were the months near the start of the COVID-19 pandemic, were excluded from the graphs.

Nine mutations remain dominant in the Alpha variant: D614G, A570D, P681H, T716I, D1118H, S982A, N501Y, H69del/V70del, and Y144del. Similarly, the Gamma variant has several substitutions, including D614G, H655Y, N501Y, V1176F, T1027I, E484K, P26S, L18F, K417T, T20N, D138Y, and R190S, that were prominent near its rise and have been maintained among the Gamma variant genomes. However, the relative mutation frequencies of the Beta and Delta variants appear to have greater variation from the graphs. Prominent S protein mutations in the Beta variant include D614G, A701V, K417N, D80A, E484K, D215G, A242del/L243del/L244del, and L18F. In contrast to the first five mutations mentioned, D80A, D215G, and L18F were at a low relative frequency in September 2020, but have since increased to be dominant among the population. This increase may suggest that these mutations play a role in giving the virus a replicative advantage and are beneficial to the survival of the virus, along with the mutations that hold a constant high relative frequency. L242del/A243del/L244del, or deletion at amino acid position 242-244 of the S protein, appear to decrease in relative frequency of Beta variant genomes. Mutations that decrease in relative frequency over time likely may not give the virus a replicative advantage anymore. Relative mutation frequency over time for the Delta variant is less consistent, especially compared to the Alpha and Gamma variants. This might have caused the sequencing of the variant to become more difficult to track until more cases were able to be sequenced, allowing it to infect many people before it was able to be properly detected, and also shows a potentially high rate of mutation that would have changed the virus frequently, potentially making it hard to detect for previous antibodies. Overall, relative mutation frequencies for D614G, P681R, T19R, L452R, T478K, D950N, F157del/F158del, and E156G increase to become dominant mutations for the variant, with G142D and T95I present in some sequences. However, the smaller sample sizes used for the earlier months compared to the later months may have affected the relative frequencies.

In general, we have also noticed that variants that become dominant among an area and decrease do not tend to reappear in the population. This pattern suggests that mutations are unlikely to resurface once they have become dominant and it can help predict the behavior of mutations found in currently dominating variants. This could be due to the mutations not providing a beneficial adaptation anymore as a result of environmental changes or improvements in treatment and is a potential topic of further investigation.

3.2 Identifying Combination of VOCs

To assess whether variants that have mutations from multiple VOCs are a possible threat, we analyzed the prevalence of cases in the COVID-19 metadata that were labeled as a VOC in the metadata, but contained mutations of another VOC that are hypothesized to have the strongest effect on the virus or are of the most concern. The mutations chosen for each variant are as follows: Alpha: P681H, N501Y, H69del/V70del [6], Beta: K417N, N501Y, E484K [7], Gamma: K417T, N501Y, E484K [8], and Delta: P681R [9], L452R [10], T478K [11].

Variant + Mutations

Number of Cases (Countries)

Alpha + K417N, N501Y, E484K (Beta)

3/21: 5 (Sweden, USA)

4/21: 7 (Croatia)

5/21: 2 (USA, France)

Alpha + K417T, N501Y, E484K (Gamma)

4/21: 2 (USA)

5/21: 11 (USA)

Alpha + P681R, L452R, T478K (Delta)

5/21: 1 (Netherlands)

6/21: 2 (USA, Czechia)

7/21: 7 (Czechia, Spain, South Korea, Sweden, France)

Beta + P681H, N501Y, H69del/V70del (Alpha)

1/21: 1 (Israel)

2/21: 1 (Bosnia and Herzegovina)

3/21: 1 (USA)

5/21: 5 (USA, Belgium)

Beta + K417T, N501Y, E484K (Gamma)

1/21: 1 (Chile)

3/21: 5 (USA, Turkey)

4/21: 13 (USA, Germany, Colombia)

5/21: 11 (USA)

6/21: 1 (Mexico)

7/21: 1 (Ecuador)

Beta + P681R, L452R, T478K (Delta)

6/21: 2 (Botswana)

7/21: 2 (Botswana)

Gamma + P681H, N501Y, H69del/V70del (Alpha)

5/21: 3 (USA)

Gamma + K417N, N501Y, E484K (Beta)

3/21: 1 (Turkey)

Gamma + P681R, L452R, T478K (Delta)


Delta + P681H, N501Y, H69del/V70del (Alpha)

3/21: 5 (Germany)

4/21: 3 (United Kingdom, USA)

5/21: 5 (Germany, United Kingdom, South Africa)

6/21: 4 (USA, United Kingdom, South Korea)

7/21: 1 (Spain)

Delta + K417N, N501Y, E484K (Beta)

3/21: 1 (Sweden)

Delta + K417T, N501Y, E484K (Gamma)

5/21: 2 (USA, Luxembourg)

6/21: 5 (USA, Canada)

7: Table of Number of Combination of VOCs: first column lists the combination of the variant, using the ‘variant’ attribute of the metadata and S protein mutations; second column lists the number of sequences (if any) for each month that has more than 0, as well as the countries the sequences were found in.

Overall, there were very few sequences that contained a combination of two VOC as defined above. These particular combinations do not seem to stay in the population or grow past 15 cases in one month. This suggests that combinations of VOCs may not pose a serious threat to humans as they have not been spread widely and that the specific combination of mutations do not give the virus a replicative advantage or aid in its transmissibility. Although there is little evidence of the impact of these variants, more data would give a better understanding of how these strains behave. However, there are several emerging strains of the VOCs that include one additional mutation that is characteristic of another VOC. One such example is the Delta Plus (AY.1) variant found in India, which has acquired the K417N mutation, one that is also found in the Beta variant [12]. No current evidence suggests that Delta plus will pose a larger threat than Delta, but it is too early to assess the risk of this variant. In addition to Delta Plus, several hundred Delta variant cases have been found with ​​H69/V70 deletions in the GISAID metadata, mutations that are also found in the Alpha variant. Furthermore, several hundred Alpha variant cases contain the E484K mutation characteristic of the Beta and Gamma variants, and several hundred Gamma variant cases contain the P681H mutation, a dominant mutation in the Alpha variant [13]. In general, many newly discovered variants contain many of the same mutations found in older variants, suggesting that most mutations that provide the most effective adaptations may have already been discovered.

3.3 Graphing Mutation Frequency by Position in S Protein

We created a log plot graph of the relative frequency of mutations at each position in the S protein to get a better visual understanding of S protein mutations by position (Figure 8). The figure is color-coded according to the value of each frequency. This was obtained from looking at the number of cases which had a mutation at each position over each month divided by the total number of cases at each position and month. The maximum frequency for each position over all months was used to calculate the relative frequency. In addition, we created a similar graph of the total number of mutations at each position (Figure 9).

Figure 8: Relative mutation frequencies (f) at each position in the S protein (red = f>50%, yellow = 5%<f<50%, green = 1%<f<5%, blue = f<1%)

Relative mutation frequencies (f) and color from figure 8

No. of positions

Amino Acid Positions in S Protein

f > 50% (red)


19, 69, 70, 142, 144, 156, 157, 158, 452, 478, 501, 570, 614, 681, 716, 950, 982, 1118

5% < f < 50% (yellow)


5, 18, 20, 26, 27, 95, 138, 152, 190, 222, 417, 477, 484, 655, 792, 909, 1027, 1176

f < 5% (green)


8, 12, 13, 21, 49, 54, 80, 98, 153, 189, 215, 242, 243, 244, 251, 253, 262, 272, 367, 439, 583, 675, 677, 701, 719, 732, 769, 772, 780, 859, 936, 957, 1163, 1167, 1191, 1263, 1264

Figure 9: Table of all positions with a relative mutation frequency of greater than 1% (red, yellow, green)
Figure 10: Log plot of total mutations at each position in the S protein

There are 18 mutations in the S protein that have reached past a relative frequency of 0.5 and are labeled in red in figure 8. They are mutations found in the Alpha and Delta variants, most likely due to their global spread, allowing for many mutations in those variants. This indicates that these mutations were linked to the proliferation of the Alpha and Delta variants. There are also 18 positions in the S protein that have a relative frequency between 0.05 and 0.5 that are labeled in yellow in figure 8. They are mainly found from the Gamma variant, with some from the Beta variant. 37 positions in the spike protein labeled in green had a relative mutation frequency of between 0.01 and 0.05. Some of the positions are not found in any specific VOC, meaning that these mutations could either be benign or helpful to every type of variant which could be investigated through further research. These positions could also be sites of key mutations for future VOCs, which means that we should carefully investigate these positions in order to better understand their mutation patterns.

The vast majority of mutations had a frequency of less than 1%, indicating that these positions are essential to the SARS-CoV-2 function, Importantly, some positions had virtually no mutations, like positions 492 with a relative frequency of 0.000017, 1105 with a relative frequency of 0.000018, 488 with a relative frequency of 0.000021, 379 with a relative frequency of 0.000023, and 56 with a relative frequency of 0.000024. The median relative frequency of all positions is 0.00034. Because these positions had a much lower relative frequency than the median, these positions may be essential to the virus and could thus be targeted for future treatments, since any mutations there would disrupt the function of the S protein.

Analysis of general regions of high mutation rates indicates that the area with one of the highest number of total mutations is from the 350-500 position, an area known as the receptor-binding domain (RBD) [14]. This is the area of the spike protein that allows entry into the human cell and causes a person to become infected. High mutation rates in this region can allow for higher rates of infection, making the comparatively large number of mutations reasonable. Furthermore, the 0-250 region of the S protein also contains a very high number of mutations. This may indicate a yet-to-be-investigated area of high importance to the virus, as high mutation rates in this region may similarly allow higher rates of infection.

4. Conclusions

Through analysis of COVID-19 sequencing metadata from GISAID, we were able to visualize the relative mutation frequencies of each VOC over time, investigate whether combinations of mutations from different VOCs pose a possible threat, and depicted mutation frequencies by graphing each position in the S protein. We found that the relative mutation frequencies of mutations in the Delta variant over time were less consistent over time compared to the other variants and its mutations took longer to become dominant. The variability of these relative frequencies of these mutations may have posed challenges to detection early one. We also found that combinations of mutations between two variants of concern most likely do not pose a threat to humans due to their low prevalence and inability to stay in the population. Although the mutations in the VOCs may increase transmissibility and/or infectivity in the virus, a combination of them may not give the virus a stronger evolutionary advantage compared to the VOCs on their own. Additionally, certain regions of the S protein that have a high or low mutation rate could indicate regions allowing for higher infection rates and regions integral to the function of the virus respectively. Positions in the S protein that have a relative frequency significantly lower than the median relative frequency, such as 492, 1105, and 488, may be essential to the function of the S protein and can be potential targets for treatment and vaccination.

5. Future Directions

For future research, we hope to leverage k-mers to more efficiently and accurately identify mutations in SARS-CoV-2 genomes, as well as identify and analyze insertions and deletions, since using k-mers is more efficient than traditional sequence alignment methods. This would be accomplished by utilizing the k-mer table we have already created as a base for a method of comparison between mutations in the genome.

In addition, we hope to be able to further analyze patterns of SARS-CoV-2 mutations in conjunction with various meteorological factors in order to examine how they affect mutation frequency, spread and other factors of SARS-CoV-2. Investigating the weather during specific months or days in various geographical regions would greatly expand the specificity of our knowledge of the effects that weather patterns have on mutations.

6. Acknowledgements

We would like to thank our mentors Hojoon Lee and Stephanie Greer for their guidance throughout this research project. We would also like to thank Cindy Nguyen, Professor Tsachy Weissman, and everyone who helped coordinate different events that allowed the STEM to SHTEM program to be possible, and gave us our first steps into the world of academic research.

We also gratefully acknowledge the Authors from the Originating laboratories responsible for obtaining the specimens and the Submitting laboratories where genetic sequence data were generated and shared via the GISAID Initiative, on which this research is based (see attached PDFs:

7. References

  1. WHO Coronavirus (COVID-19) Dashboard. (n.d.). Retrieved August 13, 2021, from
  2. Astuti, I., & Ysrafil. (2020). Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2): An overview of viral structure and host response. Diabetes & Metabolic Syndrome, 14(4), 407–412.
  3. Huang, Y., Yang, C., Xu, X., Xu, W., & Liu, S. (2020). Structural and functional properties of SARS-CoV-2 spike protein: Potential antivirus drug development for COVID-19. Acta Pharmacologica Sinica, 41(9), 1141–1149.
  4. CDC. (2020, February 11). Coronavirus Disease 2019 (COVID-19). Centers for Disease Control and Prevention.
  5. Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome (1798174254; Version 2). (2020). [Data set]. NCBI Nucleotide Database.
  6. B.1.1.7: What We Know About the Novel SARS-CoV-2 Variant. (n.d.). ASM.Org. Retrieved August 13, 2021, from
  7. Corum, J., & Zimmer, C. (2021, January 18). Inside the B.1.1.7 Coronavirus Variant. The New York Times.
  8. Hirotsu, Y., & Omata, M. (2021). Discovery of a SARS-CoV-2 variant from the P.1 lineage harboring K417T/E484K/N501Y mutations in Kofu, Japan. The Journal of Infection, 82(6), 276–316.
  9. Saito, A., Nasser, H., Uriu, K., Kosugi, Y., Irie, T., Shirakawa, K., Sadamasu, K., Kimura, I., Ito, J., Wu, J., Ozono, S., Tokunaga, K., Butlertanaka, E. P., Tanaka, Y. L., Shimizu, R., Shimizu, K., Fukuhara, T., Kawabata, R., Sakaguchi, T., … Sato, K. (2021). SARS-CoV-2 spike P681R mutation enhances and accelerates viral fusion (p. 2021.06.17.448820).
  10. Motozono, C., Toyoda, M., Zahradnik, J., Saito, A., Nasser, H., Tan, T. S., Ngare, I., Kimura, I., Uriu, K., Kosugi, Y., Yue, Y., Shimizu, R., Ito, J., Torii, S., Yonekawa, A., Shimono, N., Nagasaki, Y., Minami, R., Toya, T., … Sato, K. (2021). SARS-CoV-2 spike L452R variant evades cellular immunity and increases infectivity. Cell Host & Microbe, 29(7), 1124-1136.e11.
  11. Giacomo, S. D., Mercatelli, D., Rakhimov, A., & Giorgi, F. M. (2021). Preliminary report on severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) Spike mutation T478K. Journal of Medical Virology, 93(9), 5638–5643.
  12. SARS-CoV-2 variants of concern and variants under investigation. (n.d.). 71. says, J. (2020, July 6). What is a Receptor-Binding Domain (RBD)? News-Medical.Net.
  13. Emergence and spread of SARS-CoV-2 P.1 (Gamma) lineage variants carrying Spike mutations 𝚫141-144, N679K or P681H during persistent viral circulation in Amazonas, Brazil—SARS-CoV-2 coronavirus / nCoV-2019 Genomic Epidemiology. (2021, July 4). Virological.
  14. What is a Receptor-Binding Domain (RBD)? News-Medical.Net.

Analyzing the Information Density of Various Tokenizations for the Optimization of Natural Language Processing Models

Blog, Journal for High Schoolers, Journal for High Schoolers 2021


Riya Bhatia, Angela Yuan, Shivam Syal, Komal Keesara, Tawshia Choudhary, Dmitri Pavlichin


Tokenization, a pre-processing step for Natural Language Processing (NLP) models, involves the breaking down of a text into smaller strings to build context for the model. There has been a range of tokenizers and tokenization methods designed to improve NLP model efficiency; however, there has been little research examining which tokenization methods are effective for various categories of inputs. Users are required to pay per token to utilize large-scale NLP models, and advice on which tokenizers to utilize can ultimately allow them to save money.

In this study, we aim to determine tokenization methods that are effective for different languages, focusing on texts written in languages that are more morphologically complex than English. To accomplish this, two metrics of quantification are used: tokenization length, which is the total number of tokens, and the token dictionary size, which is the number of distinct tokens that occur in the tokenized text. Seven tokenization methods were tested specifically, with three being pre-trained tokenizers trained on the English corpus, three tokenizers that were trained on multilingual data, and the last being byte-pair encoding. Then, two distinct categories of languages were analyzed to determine their information density and repeatability, and the best NLP model for multilingual language processing was determined as the M2M100 tokenizer when tested on six languages. For pre-trained tokenizers trained on the English language, it was discovered that the BERT tokenizer was the most effective when evaluated with the two metrics mentioned. We further analyzed genomes, where the n-grams are comparable to the tokens. The number of n-grams is compared to the length of the sequence to measure repeatability.

Keywords: Natural Language Processing, optimization, byte-pair encoding, tokenization, genomes, pre-trained tokenizers, multilingual, repetitiveness


NLP is a subcategory of Artificial Intelligence that involves the understanding of natural language by computers, combining the computer science and linguistics fields. While understanding a language is intuitive for humans, it is a complicated task for computers to comprehend due to varying syntax and word choice. [1] NLP is a broad field but spans many areas of high interest, including machine translation, human-computer interaction, and sentiment analysis. NLP is also utilized to aid the hearing-impaired and those that have learning disabilities in communication and understanding. Generally, the input into NLP models is text. However, before texts can be fed into larger models, such as BERT and GPT-2, they need to be processed.

As so, tokenization is a valuable pre-processing step in NLP for developing more efficient NLP models. Tokenization is a method to process text, involving breaking down text inputs into smaller strings. These can include individual characters, smaller chunks of text, or entire words. Machine learning models accept numerical inputs and tokenization converts text into a sequence of numbers, where the integer-valued tokens correspond to a vocabulary of strings. Tokenization aids computers in understanding the underlying definition of the set of words or sentences. For instance, the sentence, “It is sunny” can be broken down into [“It,” “ is,” “ sunny”] based on word tokenization. However, words can be broken down into further chunks as well, as in the word, “lowest,” being broken into the strings, [“low”, “est”] through subword tokenization. In general, texts can be broken up at the character level, subword level, or word level.

“Hello there!” → [“H”, “e”, “l”, “l”, “o”, “ “, “t”, “h”, “e”, “r”, “e”, “!”]

“Hello there!” → [“Hello”, “ “, “there”, “!”]

“Hello there!” → [“He”, “el”, “ll”, “lo”, “o “, “ t”, “th”, “he”, “er”, “re”, “e!”]

Figure 1. Three tokenizations are seen, with the first being character tokens, the second being the GPT-2 tokenization, and the third being that every pair of consecutive bytes is a token.

In Figure 1, the first example highlights the scenario where each letter is its own token. The tokenization length and the distinct number of tokens are both long; there are 12 total tokens and 8 distinct tokens. In the next example, a sample GPT-2 tokenizer output is seen, where each word is its own token, besides the symbols. Because GPT-2 is trained to capture patterns in natural English, like many other pre-trained tokenizers in the English dictionary, it recognizes the words in the input string. The tokenization length and token dictionary size are the smallest here, with both being four. Further, in the last example, we separate the string into all possible 2-grams, where each pair of consecutive bytes is its own token. Here, the tokenization length is 11, but the dictionary size is especially large, as the pairs do not repeat frequently. This is because the total number of combinations of letters in English when a text is broken into n-grams is:


where n is the number of letters per token in the tokenization. For a 2-gram, there are 262 = 676 combinations that can appear, while for a 1-gram (where every character is a token), there are simply 26 possibilities (excluding symbols), so the probability that tokens are repeated is higher for a text broken into 1-grams than a 2-grams.

As mentioned above, tokenizers are trained to capture patterns in natural English. Thus, words displayed in different languages will result in longer tokenizations, as the tokenizer will be unable to recognize such inputs and thus, will break the input into further sub-strings.

Broadly, a token dictionary is a collection of strings that can tile a text, but a good token dictionary should meet the following criteria: 1) short tokenization length, and 2) small vocabulary size. [1] The first lowers the distinct number of tokens, which increases the speed of applying a model. The second leads to a smaller model, given that a larger vocabulary would result in more model parameters. This factor allows for faster, cheaper model evaluation.


N-Gram Analysis for Genomes

In our analysis, we viewed n-grams for genomes as tokens. Our goal was to determine if different genomes differ in their repetitiveness by the number of distinct tokens.

To analyze genome repeatability, we compared the number of distinct n-grams (strings of length n) to the length of the sequence scanned. We began by examining 6 different genomes: Aaosphaeria arxii (Eukaryota), Abiotrophia defectiva (Bacteria), Abditibacterium utsteinense (Bacteria), Sars-CoV-2, and Homo sapiens chromosome 21 (open reading frame 91). After downloading these as FASTA files, we filtered out the line breaks, initial text descriptions, and capitalization so that we could solely analyze the base pairs.

We used Python and Jupyter Notebook to plot the number of distinct n-grams in the base pair versus the length of the genome that has been examined. The first algorithm that we created precisely displays these two values on the x-axis and y-axis, respectively. With this method, some of the downloaded genomes had larger file sizes as imported into the Jupyter Notebook, such as for Aaosphaeria arxii. Thus, we examined both the entire and partial genome lengths in their base pair count.

After, we modified this initial algorithm to better distinctly display the data. This new algorithm returns the fraction of non-unique 8-grams. With NumPy and Matplotlib, we created a superimposed plot of the 6 genomes. It shows the increasing repetition found in the 8-grams of each genome sequence. We mainly analyzed the first 90,000 base pairs to focus on the sections of the sequence that display the most change in the slope of the plot.

For example, if a sequence only consists of A’s and n = 3, then the only 3-gram seen would be “AAA,” and the number of distinct 3-grams would always equal 1. If a genome sequence is ACGACGACGACGACG and n = 3, then the following would be computed:

# of distinct n-grams Base pair distance from the beginning of the genome
13 (ACG)
24 (ACG, CGA)
35 (ACG, CGA, GAC)
36 (same 3-gram above keeps repeating onwards)

Above, this demonstrates some factors tracked in the repeatability analysis, showing that genomes eventually reach the maximum number of distinct n-grams. Plots begin to have smaller slopes and flatten. Here, at 5 base pairs, the genome has reached all possible 3-grams.

We defined “n_non_unique” to evaluate the number of 8-grams in the genome sequence that repeats at least once, along with “n_unique” to compute the number of 8-grams that occur only once. Therefore, dividing the former by the sum of the two computes the fraction of non-unique, or repeated, 8-grams. This represents dividing the accumulated number of repeated 8-grams by the total number of 8-grams at the scanned sequence length. We appended this to our Python dictionary using this equation:

(n_non_unique) / (n_non_unique + n_unique)

These values are shown on the superimposed plot, using logarithmic axes. A smaller-valued input occurs when there is less repeatability in the genome base pair based on the equation, whereas a greater number indicates more repeatability.

Analysis of Various Natural Languages

Byte-Pair Encoding

In information theory, Byte-Pair Encoding (BPE) is a method of data compression that iteratively replaces repeated pairs of consecutive bytes with a byte that does not occur in the data. BPE is also used in NLP models including GPT-2 and Transformer models for tokenization. [9] An example is shown in Figure 2.

Original string: “hello hello there”

[“h”, “e”, “l”, “l”, “o”, “ “, “h”, “e”, “l”, “l”, “o”, “ “, “t”, “h”, “e”, “r”, “e”] (1)

[“he”, “l”, “l”, “o”, “ “, “he”, “l”, “l”, “o”, “ “, “t”, “he”, “r”, “e”] (2)

[“hel”, “l”, “o”, “ “, “hel”, “l”, “o”, “ “, “t”, “he”, “r”, “e”] (3)

[“hell”, “o”, “ “, “hell”, “o”, “ “, “t”, “he”, “r”, “e”] (4)

[“hello”, “ “, “hello”, “ “, “t”, “he”, “r”, “e”] (5)

[“hello “, “hello “, “t”, “he”, “r”, “e”] (6)

Figure 2. Byte-Pair Encoding Applied to an Input String.

As in Figure 2, BPE first separates the input string into character-level tokenization, where each letter is its own token, as seen in (1). Then, it iterates through this, counting the frequency that two adjacent pairs of bytes occur throughout the tokenization. If the algorithm notices that a pair repeats more than once, it merges the two tokens into one item in the shown list, as in (2). Continuing through this process, BPE ensures that no two consecutive items in the list that are repeated occur.

It is important to note that the token dictionary size and tokenization length alter from (1) to (6). In the beginning, the tokenization length is large, while the token dictionary size is small. However, continuing through the iterations, the tokenization length becomes smaller, while the token dictionary size increases. This is the reason that BPE plots are a curve.

In this work, BPE is used as a method to analyze the repeatability of texts, especially those texts that are in different languages and families.

Text Translation and Applying BPE

To allow for consistency between the languages, we chose one text, “The Monkey’s Paw” by W.W. Jacobs, to convert into the seven different languages initially. Specifically, the seven most popular languages that NLP processes utilize were tested, which include English, Chinese (Traditional), Urdu, Persian, Arabic, French, and Spanish. This text was chosen as a result of it being a short story, which the algorithm could process in a small amount of time. Also, the range of vocabulary words in the story allowed it to be a good fit for this experimentation.

This text was translated into seven different languages using the Google Translate API. BPE was then applied to these translated texts. The tokenization length and token dictionary size were recorded after each iteration of BPE and plotted for easier visualization.

Analyzing Categories of Languages

To understand if these trends in content repeatability occur across a variety of texts in the same families, we applied BPE to two distinct language subgroups based on the most common NLP inputs: Romance Languages and Indo-Iranian families. Amongst these, we choose the most commonly available languages to apply BPE to. The same text used in step 2.2.2 is converted into a larger set of languages in the aforementioned families.

Applying Pre-Trained Tokenizers to Various Languages

Since BPE alone is not commonly utilized for NLP systems, but rather pre-trained tokenizers that are specifically made for a variety of languages, we then apply pre-trained multilingual tokenizers to compare each tokenizer’s output tokenization length and token dictionary size for the most common language of each language family identified above. We seek to identify which provides a balance between the two metrics, which would ultimately be the most effective tokenizers for such categories of inputs.

Pre-Trained Tokenizers for the English Language

We aimed to test the tokenization of pre-trained tokenizers on English texts of similar sizes. We used English texts because pre-trained tokenizers are historically trained on large English corpora, so testing the tokenizers on any other language texts would prove both ineffective and fruitless.

We chose three commonly used pre-trained tokenizers, which are applied in developmental settings for various uses: BERT, WordPiece, and GPT-2 [3, 7, 9].

We used a Macbook Pro (2017) with a 3.1 GHz Dual-Core Intel Core i5 processor and 8 GB 2133 MHz LPDDR3 RAM. For testing each tokenizer, we used the transformers and tokenizers libraries from Hugging Face in an iPython Notebook environment.

For testing, first, the texts were all transformed from “.txt” files into a string after parsing and removing all control characters (\n, \t, etc.). Secondly, we ran each tokenizer on the same system, and recorded the length of the tokens generated, alongside the size of the token dictionary generated. Finally, to visualize the results, we used Numpy and Matplotlib to find the relationship between each pre-trained tokenizer with a dot plot.


Genome Repetitiveness by Analyzing N-Grams

After developing our superimposed plot of the 6 genomes, we found the fraction of repeated 8-grams. As the length of the genome increases, the plots flatten because, at some point in the sequence length, the genome exhibits all possible distinct n-grams. We are particularly interested in the various rates at which different genomes achieve this. Because this first plot shows the repeated n-grams instead of the distinct n-grams, it still follows this pattern but is simply shown in a different manner. Given the equation used for our Python dictionary in our second algorithm, “(n_non_unique / (n_non_unique + n_unique),” and since all of the genomes eventually exhibit all possible 8-grams, the computed number for the following plot reaches 1 as the numerator increases.

Figure 3. 8-gram repetition analysis applied to 6 filtered genomes. A logarithmic scale is used. For example, when x = 0, log(x) equals -∞, not shown in the plot. Conversely, the log of a positive input is detailed above.

In figure 3, all genomes tend to be more repetitive towards the beginning of their sequence, then become less repetitive, and finally more repetitive again. The fraction of repeated n-grams eventually goes to 1, as more repetition causes the lines to increase from -∞.

The purple line—the human genome—is the most repetitive of out the 6 genomes, meaning that it has more frequent repetitions than bacteria or viruses on long-length scales. This could be because human cells reproduce rarely and are rather large. There could be less of a need for human cells and eukaryotic cells to have rather short genomes, in comparison to viruses that need to compact their genetic material in small capsules. Further, since the human genome is involved in producing a multitude of proteins to support fundamental body functions, the repetitiveness of the sequence could be useful in ensuring that vital human reactions and systems are constantly in progress, also considering the exposure to potential mutagens.

Both viruses, shown by the red and brown lines representing Sars-CoV-2 and the Tai Forest Ebolavirus (Ivory Coast), are much more repetitive toward the beginning of their sequences in comparison to the other genomes. There could be an aspect of the viral sequence that is more important for serving the virus’s function, leading to its high repetitiveness earlier on.

Figure 4. Using our initial algorithm to directly plot the number of non-unique 3-grams. No logarithmic scale is used.

As the length of the sequence from the beginning of Genome 1, or Aaosphaeria arxii, increases, the plot depicts the output of the distinct 8-gram count eventually flattens. All possible 3-grams are eventually shown, given that 48 = 65,536. Here, 4 represents the four types of base pairs—A, C, G, T—and 8 is the length of each n-gram. The result indicates the number of all possible 3-grams that can be shown, which matches the plot above, as the number of distinct 8-grams eventually reaches this value.

Figure 4 presents the results in a slightly different manner than figure 3: figure 4 plots the number of 8-grams that are unique, the graph increasing as this occurs and leveling out once all 8-gram possibilities are shown. Conversely, figure 3 plots the repeated fraction of 8-grams, showing increasing slopes as more repetitions occur and flatting to 1 as there are more non-unique 8-grams.

Byte-Pair Encoding and Language Analysis

To determine the information density and repetitiveness of various languages and their corresponding language families, Byte-Pair Encoding was applied to them. Languages analyzed include the seven most popular languages amplified in NLP systems as well as two language families. We aim to look for trends in repetitiveness by plotting the tokenization length and token dictionary size between the languages and language families.

Figure 5. Analyzing the Repeatability of the Seven Most Popular Languages Used in NLP.

From this, we understand that Arabic has the smallest tokenization length at the byte-level, meaning that it has the smallest number of individual characters in the text. We speculate that this occurs because in Arabic, short vowels are not written but long vowels are. Arabic also works with the root system, so combinations of root letters can have multiple meanings. It is only when conjugated that the meaning becomes more specific and apparent. It is also noticed that the final token dictionary size is the smallest for Arabic. This could be as a result of a number of reasons: articles such as a/an not being written and verbs are usually one word. This shows that the language is more compactly written, allowing it to be more efficient for NLP systems to understand if inputted. French, on the other hand, has the largest token dictionary size and tokenization length. We assume that this is because of silent letters being written in addition to consonants.

After this, we seek to understand if this trend occurs in languages’ corresponding language families as well. We test two distinct families: Indo-Iranian languages and Romance languages.

Figure 6. BPE Applied to Five Indo-Iranian Languages. Hindi, Bengali, Urdu belong to the Indo-Aryan family, while Persian and Pashto are part of the Iranian family.

As depicted in Figure 6, the languages in the Indo-Iranian family have generally similar characteristics. However, Pashto has a smaller tokenization length and token dictionary size, implying that information can be conveyed in a smaller amount of words, while Urdu had the largest of the two metrics discussed. Additionally, Bengali, Hindi, and Persian are especially similar in their starting and final tokenization length and token dictionary size. This is especially interesting because Bengali is written from left to right while Persian is written from right to left. We speculate that this is because of the origins of the languages—Persian came from Old Persian, while Bengali came from Sanskrit. Old Persian and Sanskrit were especially close in general regards to their qualities.

Figure 7. BPE is Applied to Five Romance Languages.

In Figure 7, it is noticed that Spanish, Portuguese, and Romanian are especially close in terms of their initial and final tokenization length and token dictionary size. French and Italian, on the other hand, have the largest initial tokenization length and final token dictionary size, and almost overlap each other. This is because the lexical similarity, or the similarity between the two languages, is around 85-90%, based on prior analysis, which shows that almost 90% of the words are similar in both languages. [6] As a result, they have similar characteristics in terms of repeatability. However, Italian is more difficult grammatically, which was not taken into account in the graph but can impact NLP systems’ effectiveness when text in that language is inputted.

Multilingual Pre-Trained Tokenizers

We tested three multilingual tokenizers that were pre-trained on various data: the M2M100 Tokenizer, mBART Tokenizer, and BERT Tokenizer. [4, 5, 3]

Figure 8. Three Distinct Pre-Trained Tokenizers are Applied to Seven Languages.

In Figure 8, the tokenization length and token dictionary size of six distinct languages is shown when three multilingual pre-trained tokenizers are applied. As seen in the graph, the BERT tokenizer was seen to not perform as well, as it did not provide a balance between a good tokenization length and token dictionary size; rather, it provided for a smaller token dictionary size in many cases, such as Urdu, but an especially large tokenization length. However, the M2M100 Tokenizer did provide this balance that would allow for cheaper and faster evaluation of NLP systems as well as effective training for such, without compromising one for the other. Thus, it is suggested that for most languages, the M2M100 Tokenizer should be utilized.

A possibility as to why the M2M100 Tokenizer ran more effectively is because it is a sequence-to-sequence model, which is made for translation tasks. It can translate between 100 languages, and since it is created specifically for translation, it may have been more effective for the text used as it was translated from English to other languages as well.

Pre-Trained Tokenizers for the English Language

After running each tokenizer on the English texts, there was a clear pattern for the efficiency of each pre-trained tokenizer. Overall, BERT proved to be more efficient than the other tokenizers, as the tokenization size, and token dictionary, was smaller in nearly every text, as shown in Figure 10. There were some instances where the other two tokenizers were a bit more efficient than BERT, however, in general, GPT-2 performed the worst and WordPiece performed almost the same as BERT, while sometimes being less efficient.

Screenshot 2021-08-26 004520
Figure 9. Pre-trained tokenizers applied to various English texts, with their tokenization length and token dictionary size recorded.

GPT-2 ran last possibly because the functionality of the program is quite different from what we are testing. GPT-2 was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next token in a sequence, while less efficient as for producing the least amount of tokens when breaking up English text.

BERT and WordPiece ran quite closely, as BERT was actually trained on the WordPiece tokenization. There were odd cases, such as Text 4 and Text 6, where the two tokenizers were on opposite ends of the spectrum. However, this could be a case of bias in the text, as BERT prefers simpler text, while WordPiece can span larger, more complex texts easily.



Comparing the number of repeated n-grams, representative of tokens, to the length of the sequence scanned offers an explanation of how unique a subset, or the whole, the genome is. All genomes eventually reach a limit in regards to the number of unique n-grams, causing plots to level out and reach a particular number. It was particularly insightful to look into the different rates that different types of genomes exhibited this behavior.

From the genomes that we analyzed, we found that the human genomes, looking at chromosome 21, are more repetitive than viruses. These viruses are much more repetitive towards the beginning of their sequences in comparison to the bacteria and eukaryotes.

While working through different approaches for plotting and displaying the data results, we encountered some that were not as noticeable due to different parameters that we had set in the Jupyter Notebook. Therefore, our first algorithm looks into the direct distinct number of 8-grams, However, the finalized superimposed genome plot uses our second algorithm with a logarithmic scale and the new equation used for our dictionary of distinct n-grams (n_non_unique / (n_non_unique + n_unique)).

The foundation of our analysis into genome repeatability can be largely useful for practitioners in the genomics and bioinformatics fields interested in the uniqueness spectra of genomes, and how unique one section is compared to another. Particularly, for certain engineering practices, it is significant to target unique sites that do not have homologous base pairs in another genetic region [2].

BPE and Various Languages

For this study, Byte-Pair Encoding was used to measure the repeatability of a variety of languages and language families. Through the visualization and comparison of the token dictionary size and tokenization length, it was found that Arabic was the most compact and efficient language for NLP tasks when compared with the six other most commonly used languages in NLP. This was as a result of short vowels and articles not being written, but also because it relies on the root system.

When language families were compared, the Romance Languages had many characteristics in common. Italian and French overlapped each other with their individual BPE curves, while Spanish, Italian, and Romanian had comparable characteristics as well. However, this was unlike the Indo-Iranian languages, which differed in terms of the two metrics used.

The languages and families utilized can allow for NLP systems to be more efficient. Our findings, such as French being less repetitive than English and other languages, point to the idea that certain languages that convey less information in more words may be ineffective for NLP tasks. By analyzing the information density of the text, ambiguity issues in NLP models can be reduced.

Multilingual Pre-trained Tokenizers

We also sought to analyze three popular pre-trained tokenizers that were trained on a variety of languages. We found that the M2M100 tokenizer worked the most effective as it provided a balance between the tokenization length and token dictionary size, which ultimately, would allow users to not only save money but also allow for effective training of such.

The BERT tokenizer, however, was seen to do the worst when evaluated on these metrics. This is as a result of it providing a small token dictionary size but an especially large tokenization length, or the total number of tokens. The mBART tokenizer was also ineffective for a similar reason, but instead of providing a large tokenization length, it provided a smaller tokenization length but a larger token dictionary size.

Thus, it was seen that for NLP tasks that involve one of six popular languages, the M2M100 tokenizer is the most effective generally. This generalization can also be applied to other texts not included in the six languages listed if the characteristics of the language is similar to that of a language shown. Further research would need to be conducted to examine other, less-common languages and trends that occur between the pre-trained tokenizers used.

English-Specific Pre-trained Tokenizers

Although each pre-trained tokenizer has its own purpose and can be used for a variety of developmental applications, BERT seemed to surpass other tokenizers in terms of efficiency for short English texts. By knowing that BERT is more efficient for tokenizing small text, we can use this tokenizer for more specialized applications (NLP apps, machine translation, etc.) until a better algorithm, such as improved BPE, can be widely implemented.

Specifically, there are many potential applications for improving NLP applications for disabled people. Recently, in a study done by Svensson et al (2019), 149 students representing various urban and rural schools in Sweden took part in a study “to investigate if an intervention using assistive technology for students with reading disabilities affected their reading ability, ability to assimilate text, and motivation for schoolwork.” Over the 4 weeks of testing, students used NLP applications, such as SayHi (speech-to-text = STT), VoiceDreamReader (text-to-speech = TTS), Prizmo (scanning from written text to digitized text), Skolstil-2 (an easy word processor and text-to-speech app that even pronounces each sound-letter, words, sentences, and the whole text while writing a text), Legimus (an audiobook reader), and Ruzzle (a word game). [8] The results show gains in reading ability despite using nothing but assistive technology during the intervention period, and these gains were comparable to the enhancement in a control group during 1 year. Approximately 50% of the students and their parents reported an increase in motivation for schoolwork after they had finished the interventions.

With tokenizers like BERT, we can improve the situation in schools for children, or even in the workforce. NLP applications can become more efficient with better-implemented tokenizers. This research is promising for future work in advancing current technology into the mainstream consumer market for NLP apps.

Future Directions

NLP Applications for the Disabled

We hope to apply our findings to devise new teaching methods for intellectually disabled and hearing-impaired students. This could include a new hybrid between an improved BPE algorithm and a historically efficient tokenizer like BERT to optimize the NLP models that developers currently use for NLP assistive technology.

Evaluation Framework for Developers

We would also like to construct an evaluation framework that would provide real-time recommendations to NLP model users on which tokenizers to use based on the NLP task at hand. In this study, it has been determined that BERT and M2M100 tokenizers are the most effective for English and multilingual texts, respectively, but we would like to expand these findings and find trends between language families and the behavior of such tokenizers. Because of the large number of people that use and are constantly improving NLP models, it would be effective for users to understand how to best utilize their resources as well as which tokenizer is best fit for their tasks, whether that be translating between languages, working with English texts, or working with specific texts translated into another language.


We would like to thank our incredible mentor, Dr. Dmitri Pavlichin, for providing guidance throughout the project and sharing his knowledge with us. We would also like to thank Cindy Nguyen and Professor Tsachy Weissman for organizing and coordinating the STEM to SHTEM program. They, along with many others, have been especially influential in our research.


  1. Bostrom, K., & Durrett, G. “Byte Pair Encoding Is Suboptimal for Language Model Pretraining.” Findings of the Association for Computational Linguistics: EMNLP 2020, 2020, doi:10.18653/v1/2020.findings-emnlp.414.
  2. Cho, Seung Woo, et al. “Analysis of off-Target Effects Of Crispr/Cas-Derived RNA-Guided Endonucleases and Nickases.” Genome Research, Cold Spring Harbor Laboratory Press, Jan. 2014,
  3. Devlin, Jacob, et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” 11 Oct. 2018, arXiv: 1810.04805.
  4. Fan, Angela, et al. “Beyond English-Centric Multilingual Machine Translation.” Journal of Machine Learning Research 22, 2021, arXiv: 2010.11125.
  5. Liu, Yinhan, et al. “Multilingual Denoising Pre-training for Neural Machine Translation.” Transaction of the Association for Computational Linguistics, 2020, doi:
  6. Schepens, Job et al. “Cross-language distributions of high frequency and phonetically similar cognates.” PloS one vol. 8. 10 May. 2013, doi:10.1371/journal.pone.0063006.
  7. Schuster, Mike, and Nakajima, Kaisuke. “Japanese and Korean Voice Search.”
  8. Svensson, Idor, et al. “Effects of Assistive Technology for Students with Reading and Writing Disabilities.” Disability and Rehabilitation: Assistive Technology, vol. 16, no. 2, 2019, pp. 196–208., doi:10.1080/17483107.2019.1646821.
  9. Radford, Alec, et al. “Language Models are Unsupervised Multitask Learners.” OpenAI,

Diagnosing Tuberculosis through Analysis of X-ray Scans by Neural Networks

Blog, Journal for High Schoolers, Journal for High Schoolers 2021


David J. Florez Rodriguez, Nam Dao, Kathryn Xiong, Srividya Koppolu, Teaghan Knox


Pulmonary tuberculosis (PTB) is a potentially fatal infectious lung disease most commonly found in developing countries. Early diagnosis of PTB can help limit its spread and increase its treatability. Doctors commonly diagnose PTB using a skin test or blood test followed by a x-ray or CT scan of the lungs. In developing countries, where PTB is most prevalent there is not always personnel with the proper training to interpret the x-rays accurately. Therefore, developing a computer model to diagnose tuberculosis that does not require intensive processing power could improve diagnosis rates in such places, as well as decrease the cost. In this project, we first tested basic predictive models like logistic regression, then moved on to CNNs. We explored hyperparameter tuning and various data prep methods in order to attempt to achieve an accuracy suitable for patients’ needs. The best model we tested was a SKlearn Random Forest Classifier which correctly classified x-rays as PTB positive or negative 83.7% of the time. While not yet accurate enough for medical purposes, it required minimal processing power and is a promising start. With more tuning and an expansion of the types of models tested, it is a real possibility that a machine learning algorithm could be applied to diagnosing PTB in the field in the near future.


Pulmonary tuberculosis is a contagious airborne disease that occurs in people with weakened immune systems when the bacterium Mycobacterium tuberculosis attacks the lungs. When left untreated, M. Tuberculosis causes tissue destruction and lung lesions that are often fatal (World Health Organization, 2020). Tuberculosis is a global problem ranking as one of the top ten causes of death worldwide with 1.4 million deaths in 2019 alone. Developing countries have experienced the largest outbreaks with over 95% of cases and deaths (World Health Organization, 2020) With the rise of new drug resistant strains, the problem has only become more urgent making rapid, accurate diagnosis of tuberculosis vital both for ensuring the best treatment outcome of the patient and for public health intervention to reduce further spread in the community (Ryu, 2015).

Currently, diagnosing active pulmonary tuberculosis efficiently and accurately remains a problem. The common tests for pulmonary tuberculosis are the tuberculin skin test, or an IGRA blood test used in combination with a chest x-ray (CDC, 2016). Despite the popularity of chest x-rays as a means of diagnosis, detecting abnormalities in the lungs that are consistent with pulmonary tuberculosis is difficult due to the variety of features that can occur and the similarity of the features with that of other lung diseases (Pasa et al., 2019). This is especially true in developing countries where tuberculosis is most prevalent and where there is often a lack of properly trained physicians (Melendez et al., 2016).

In such cases, the role of a well trained physician for chest x-ray analysis could be filled by a cheaper and more effective machine learning algorithm. Due to these benefits it is no surprise that numerous groups have researched the application of machine learning algorithms for medical imaging diagnoses, including chest x-rays. Chest x-rays are good for diagnosing tuberculosis with machine learning algorithms due to x-ray’s many similarities with natural images, their relative cheapness, and availability of data sets such as the Montgomery and Shenzhen.

Because of the abundance of research on tuberculosis diagnosis a variety of techniques involving machine learning have been applied. One common subject of deep learning research is the application of existing powerful natural image classifiers such as ResNet, AlexNet, and GoogLeNet (Lakhani & Sundaram, 2017). While these models achieved high accuracy they are time and memory intensive, requiring high end equipment to run (Pasa et al., 2019). Our goal is to experiment with less complex algorithms in an attempt to achieve a high accuracy chest x-ray based pulmonary tuberculosis diagnosis model that requires less processing power.

To diagnose tuberculosis using x-ray scans we utilized a variety of different deep learning classification algorithms. Each model will utilize a different classification algorithm. Models will be trained on data containing both X-ray images and their corresponding label (tuberculosis positive or negative). After processing the data, the models will be used to make predictions on a test set, producing results that will be verified with real data. From this, we extracted accuracy rates for each of the models, which we compared as a metric to determine the predictive power (and thus effectiveness) of each model. Then, we focused on models which posed the greatest potential for improvements in predictive power and adjusted their hyperparameters for optimal results.

Methods and Materials:

Prior to modeling with X-rays, we used smaller images from the Tensorflow Cifar-10 dataset and importable models from the Sklearn library. This step allowed us to conserve time and memory while testing possible data conversion methods and image recognition models. Because Sklearn’s models only accept one-dimensional arrays as input data, we had to flatten our three-dimensional numerical arrays into one-dimensional arrays. We first averaged the third dimension–color–to create a black-and-white-photo–X-rays which are also black-and-white photos. This reduced the dimensions of the images from height * width * rgb colors to simply height * width. We then appended each row of the resulting two-dimensional array onto the first row, creating a one-dimensional array. We tested Sklearn’s KNearestNeighbors algorithm with five neighbors, Logistic Regression algorithm with no hyperparameters , and MLPClassifier algorithm with no hyperparameters. The three models had corresponding accuracies of 20 percent, 28 percent, and 10 percent. We got an accuracy of 67 percent for the Keras API which had no hyperparameters while experimenting with models.

Our composite dataset consisted of chest X-rays from the National Library of Medicine of patients located in Shenzhen, China, and Montgomery County in Maryland, USA, although we have only applied our models to the Shenzhen dataset. While working with the X-ray images, we first created a training set for our models. We acquired X-ray scans of people’s lungs (along with labels) from Kaggle. We stored all the images in a folder on Google Drive, which allowed for collaboration using Google Colab, and using an indicator in each or their corresponding addresses, we created the output array–whether each patient had tuberculosis. Because the X-ray images were not uniform in size and shape, and some contained a color dimension, we tried to uniformize the data. Having similar examples of training data will help the models generalize and develop better weights. Using the same process as our CIFAR-10 data, we converted the images of the x-ray scans into black and white with PIL’s getBands function, and using PIL’s resize function, we shrank all the images into 1800 by 1800 arrays. The reduction in data helped alleviate some pressing concerns about the lack of computational power for the processing of heavier data. Furthermore, since the color of X-ray images should have no bearing on whether an image is ‘diagnosed’ to have tuberculosis or not, getting rid of extraneous data helps prevent model overfitting. Finally we used Sklearn’s train_test_split function to split our data into a training and testing group. This division will help us quantify the predictive accuracy of the models later. In total, we created four groups of data: X_train (containing the image arrays used for training), y_train (containing the corresponding labels to X_train), X_test (containing the image arrays used for testing), and y_test (containing the corresponding labels to X_test. The X data consisted of 3D matrices: # of examples by 1800 by 1800. The dimensions of our X_train matrix was 496 by 1800 by 1800, and 166 by 1800 by 1800 for our X_test. Additionally, we used NumPy’s save and load functions to store the images, allowing for easier access to our data when we built our models.

Before our image data was ready, however, we needed to complete one last step. Since specific models require specific data input, we reshaped the size of our training and testing data to fit specific requirements. For Convolutional Neural Networks (CNNs), which require 4D matrices, we added an extra dimension to our input (X_train and X_test) data. For all other models, which require 2D matrices, we reduced the dimension using the function numpy.reshape().

Next, we tested several models to experiment with effectiveness and accuracy. The models we worked with were: logistic regression, random forest classifier, extra trees, naive bayes, CNN, KNN. In these models we included hyperparameters to fine tune the functionality.

Most of these models contained little to no hyperparameters. Our Naive Bayes model, for example, relied entirely on mathematical prediction, devoid of personal input. In the case of our K-Nearest Neighbor model, the only intuitive number of neighbors was 2, as the distinction between the data is binary: the images are either labeled with tuberculosis or not. We used warm start on our logistic regression, random forest, and extra trees classifiers. The warm-start hyper-parameter allowed us to create a model the first time we fit it, and continuously train it with new data. Therefore, we split our data into 3 batches to conserve RAM while still training efficiently. Other models, however, contained a lot more hyperparameters. In particular, we had to decide on the optimal architecture for our CNN model. Our CNN was constructed with a series of layers imported from the Python library Tensorflow. Although it was possible for us to implement a more sophisticated architecture which would guarantee better predictions, we purposefully picked a simpler design with less layers (Figure 1). This is in accordance with our research goal: the aim is to create a model that doesn’t require absurd amounts of computational power and time, allowing it to be implemented on computers in developing countries. Furthermore, overly complex architecture could cause a problem with overfitting.

Figure 3. Architecture of CNN

To maximize accuracy, we compiled our CNN with the optimizer Adam and a binary cross entropy loss function. For our final dense layer, we used sigmoid activation. These are all standard hyperparameters employed in CNNs that classify binary labels. After running our first model, despite a high percentage of accuracy on training data, the CNN model didn’t perform well on predictive data. Thus, we included a validation_split clause in compiling our second CNN model. Besides this change, the two models are identical.


Through trial and error, our accuracy improved. For the random forest classifier model, the model with the best results, we got an accuracy of 83.7%. The first CNN model tested successfully categorized about 81.3% of the x-ray scan data, classifying them as tuberculosis positive or negative. After changes, the second iteration of the CNN was only able to categorize around 79.5% of the data.

We also tested several other models that resulted in varying levels of accuracy (Table 1). It is hard to pinpoint exactly how one kind of model outperformed another. However, the K-Nearest Neighbor, logistic regression, and Naive Bayes models likely had the lowest accuracies because of the nature of the input data: these models couldn’t detect patterns in pixel variations. By merely comparing the values of each individual pixel across images and trying to detect overarching differences, these models lack the sophistication to detect the diverse forms that tuberculosis can show in an X-ray. Although Convolutional Neural Networks should theoretically be the best at image classification between all of the algorithms used, issues with overfitting likely caused the predictive accuracy to plummet.

Additionally, we used confusion matrices to visually group correct and incorrect diagnoses of which include positives, negatives, false positives, and false negatives. Confusion matrices are useful to visualize the performance of a classification model and they can also be used to determine the usefulness of the model. The confusion matrix for the second CNN model didn’t improve it; however, it got worse in terms of correct predictions. Theoretically, the only change made is a validation_split which should improve accuracy, but in this case, the accuracy decreased. In figures 4 and 5, rows indicate real values and columns depict predicted values. However, for imbalanced data, the matrix may not be accurate since the model primarily predicts each point to be part of the majority class label.

It seems plausible that by increasing the amount of training data or hyperparameters on a bigger dataset, it would be possible to obtain improved results with our architecture. The use of high capacity models is, however, out of the scope of this work



K- Nearest Neighbors (KNN)


Logistic Regression


Naive Bayes






Extra Trees Classifier


Random Forest Classifier


Table 1. Models tested and accuracies in order of increasing precision


The best algorithm we tested achieved a 84% accuracy rating which while good requires improvement before it could be applied in a medical setting. It is a lower accuracy than achieved by powerful industry models which often achieved accuracy in the upper 90s%. We are confident that, if given more resources at our disposal (in more computational power), our models can be fine tuned to become better at prediction without compromising runtime. We also attribute the lower predictive accuracy of our two CNN models to overfitting due to suboptimal data: the models performed well when running on training data but experienced a sudden drop in accuracy when running on the testing data. This juxtaposition in accuracy is likely the result of faulty data image processing and splitting. For example, it is an often recommended practice in deep learning to have an equal number of examples for each label in a training set. For our project, this would have meant the creation of training data with one half of tuberculosis images and one half without. However, our training data had markedly more examples that are labeled no-tuberculosis than ones that do. Because Google Colab had deficiencies in RAM, we also had to compromise on other data-processing techniques which may have increased the efficacy of our models. Primarily, the input data couldn’t be normalized on Colab without stopping the kernel. Finally, in trying to achieve a homogenous image size of 1800 by 1800, we sacrificed valuable data which could have gone towards improving all of our models. Since most images were much larger than 1800 by 1800 pixels in size (coming closer to 3000 by 3000), our intentional reduction got rid of a lot of useful data.

Another source of error in our project was only running each model once. Since computational learning can vary from one try to another, there would be slight variations in predictive accuracy for every model time had each model been run multiple times. Each model should have been run a standardized number of times and had its accuracies averaged to produce the most reliable number. However, lack of time (some models took many hours to be fitted and compiled) meant that this couldn’t be accomplished.

However, we did achieve our goal of using much smaller amounts of processing power. All of the models (besides our CNNs) were designed and implemented via Google Colab, a platform with a very small amount of available RAM for processing data and running models. We are confident that, since our models were able to run on Colab, the models will be available for implementation in developing countries as we originally intended. Although currently inapplicable for consistently and accurately diagnosing patients with tuberculosis, our models are open to a plethora of future fine tuning to achieve higher efficacy. With these improvements, we are confident that our models will be eventually fit for wide commercial usage.

Future Directions:

For our future plans, we will work on experimenting with various hyperparameters on our models. For example, more (and different kinds of) layers can be added to our CNNs. When compiling our CNN models, we could also add validation data, which would theoretically improve the image classification process. Furthermore, we can compare the results of our models with those of well-known general purpose CNNs like Google’s Inception. These models are well known go-tos for their generalized image classification abilities and are also highly computationally demanding. Along with emphasis on the design of new smaller-scale models, more research can be conducted into reducing the sophistication of general purpose CNNs so that they require less processing power. This alternative method to achieving the right combination of computational power and predictive accuracy may prove more complex but also highly rewarding.

Finally, we will use metadata for better model learning. The data set we used also included extra labels for each image, including the gender and age of the patient in each photo. The data set which we utilized originated from Shenzhen, China, but there is another identical data from Montgomery, MD, USA. By including data from the U.S. data set and utilizing the labels in metadata, we can also train our models to be more sensitive to variations across gender, age, and geography.


  1. Candemir S, Jaeger S, Palaniappan K, Musco JP, Singh RK, Xue Z, Karargyris A, Antani S, Thoma G, McDonald CJ. Lung segmentation in chest radiographs using anatomical atlases with nonrigid registration. IEEE Trans Med Imaging. 2014 Feb;33(2):577-90. doi: 10.1109/TMI.2013.2290491. PMID: 24239990
  2. Centers for Disease Control and Prevention (CDC). (2016, April 14). Testing & diagnosis. Tuberculosis (TB) .
  3. Jaeger S, Karargyris A, Candemir S, Folio L, Siegelman J, Callaghan F, Xue Z, Palaniappan K, Singh RK, Antani S, Thoma G, Wang YX, Lu PX, McDonald CJ. Automatic tuberculosis screening using chest radiographs. IEEE Trans Med Imaging. 2014 Feb;33(2):233-45. doi: 10.1109/TMI.2013.2284099. PMID: 24108713
  4. Lakhani, P., & Sundaram, B. (2017, April 24). Deep learning at Chest Radiography: Automated classification of pulmonary tuberculosis by using convolutional neural networks. Radiology.
  5. Melendez, J., Sánchez, C., Philipsen, R. et al. An automated tuberculosis screening strategy combining X-ray-based computer-aided detection and clinical information. Sci Rep 6, 25265 (2016).
  6. Pasa, F., Golkov, V., Pfeiffer, F. et al. Efficient Deep Network Architectures for Fast Chest X-Ray Tuberculosis Screening and Visualization. Sci Rep 9, 6268 (2019).
  7. Ryu Y. J. (2015). Diagnosis of pulmonary tuberculosis: recent advances and diagnostic algorithms. Tuberculosis and respiratory diseases, 78(2), 64–71.
  8. World Health Organization. (2020, October 14). Tuberculosis. World Health Organization.