It’s playful and vibrant. It’s TANGO... because bug triaging takes two.
It is no secret that mobile phone apps are a significant portion of the software market. There are roughly 7 billion mobile users worldwide. 2020 saw annual mobile app downloads hit 218 billion. Worldwide spending on mobile apps hit $143 billion in 2020, up 20% from 2019.
Apps are not simple pieces of software either. Some modern apps can replace desktop applications, and there are 3D mobile games that can hold their own against their desktop counterparts. There is also an increasing trend of tablets (e.g., iPad) replacing laptops. Most tablets are limited to mobile operating systems and apps, so those that switch find ways to replace their laptop applications with apps.
Like all pieces of complex software, apps often have bugs. Bugs are often complicated to reproduce. Therefore, modern apps allow users to report bugs graphically. For example, Android and iOS apps can integrate screen-recording capabilities for bug reporting. Users are increasingly using these capabilities, and more bug tracking tools (e.g., Github) support embedded videos.
Once received, developers must triage a bug report. In other words, developers must determine if the new bug report reports a new or known bug. Triaging must be done for both written and video bug reports. For written bug reports, developers can read the bug report to see if it is a duplicate. With video reports, developers have to watch the video. That is laborious and time-consuming. Cooper et al. (link) conducted a user survey that showed that people spent about 20 seconds watching a video bug report before triaging it. Extrapolating those 20 seconds per report to the numerous bug reports that an app may receive daily, one has to conclude that this is a challenge that needs addressing.
Cooper et al. proposed Detecting Duplicate Screen Recordings of Software Bugs (TANGO) to automate the identification of duplicate bug reports. TANGO leverages both computer vision and machine learning techniques. When a video bug report comes in, TANGO scores it. The score is a similarity score between the new bug report and existing ones. Scoring looks at the video’s visual and textual features.
TANGO is a dual pipeline system. One pipeline deals with the video report's textual features, and the other deals with the video report's visual features. Upon getting a new bug report, TANGO decomposes the video into individual frames (5 FPS). All video bug reports that we use as points for comparison go through the same process. Two pipelines get the frames.
The first component of the visual pipeline is the Visual Feature Extractor. This component, unsurprisingly, extracts the visual features in the individual frames (e.g., edges). It does that using the SimCLR or SIFT algorithms. After the Visual Feature Extractor, the pipeline splits into Unordered and Ordered branches. The Visual Indexer and Video Overlap Identifier components ingest the extracted features.
Following the Unordered branch, we reach the Visual Indexer. The Visual Indexer applies Bag of Visual Words (BoVW) to the extracted features. BoVW is for visual features what Bag of Words (BoW) is for text. BoVW starts with clustering. The Visual Indexer clusters the features using k-means clustering. The centers of the groupings make up the vocabulary of BoVW. After clustering, we make a frequency histogram of the features.
Continuing along the Unordered branch, we come to the Visual Encoder. The Visual Encoder is a simple component. It takes the histograms produced by the Visual Indexer and applies Term Frequency-Inverse Document Frequency (TF-IDF) to them. The term frequency is simply the count of each feature group in the BoVW. Inverse document frequency is how often a feature group arose in TANGO’s training set. The intuition for TF-IDF comes from BoW. In a language (e.g., English), some words are common (e.g., the, a, of). These frequent words are noise. They are not effective for discriminating between documents. So, we look at rare words to identify documents. The Visual Encoder creates a visual vector for each set of extracted features. In other words, the Visual Encoder creates an embedding for each video bug report that represents that report.
Now, we explore the Ordered branch. The intuition behind this branch is simple. First, duplicate video bug reports should have similar videos. Second, differing bug reports may have similar beginnings but should have distinct endings. After all, a bug report video will likely end after the bug occurs.
The first component on the Ordered branch is the Video Overlap Identifier. It extracts the longest sequence between two videos using modified longest common substring (LCS) algorithms. The algorithms use fuzzy matching to compare video sequences. One of the algorithms weights frames closer to the end of a video more heavily since the bug will likely be towards the end of the video. The Video Overlap Identifier component calculates the LCS scores and passes them to the Sequential Comparison Process module. The sequential comparison process calculates a similarity score for two bug reports by normalizing the computed LCS scores.
TANGO’s text pipeline compares the texts in two bug report videos. The texts are from GUI components (e.g., labels, titles, messages). TANGO does not look at texts that are not in a video. That means it ignores any user-written messages included in the video bug reports.
The Textual Extractor uses optical character recognition (OCR) on the individual extracted frames to build a “document” with the texts in a video. The “document” goes through a standard NLP preprocessing pipeline (e.g., tokenization, lemmatization, stop-word removal). The Textual Indexer indexes the processed documents, with Lucene, for faster similarity comparisons. Finally, the Textual Encoder computes an embedding vector for a document using its TF-IDF and index.
The Similarity Computation & Ranking unit is where all pipelines and branches converge. It calculates the similarities for each branch and pipeline. Then, the individual scores are combined. The algorithm for the Unordered branch of the visual pipeline is:
Sᵥᵢₛ = Sₑ
So, the visual similarity is the similarity between the embeddings produced by the Unordered branch. We use cosine similarity to compare the embeddings.
The similarities for the Ordered branch are computed as follows:
Sₗ = Sₑ
Sₗ꜀ₛ = overlap / [ ∑ (i / min) × ((max - i) / max) ]
Sₗ is the similarity of LCS fuzzy matching without weights. Sₑ is the cosine similarity for the embeddings from SimCLR or the visual encoder. Sₗ꜀ₛ is the similarity for LCS with weights. overlap is how much two videos overlap. min is the number of frames belonging to the shorter video, and max is the number of frames belonging to the longer video.
The similarity for the text pipeline is Lucene’s scoring function (i.e., cosine similarity and document length normalization). Combining the similarities from the two pipelines is straightforward:
Sₜₒₜ = (1 - w) × Sᵥᵢₛ + w × S꜀
Sₜₒₜ is the combined similarity, w is a weighting value to favor one pipeline over another, Sᵥᵢₛ is the similarity from the visual pipeline, and S꜀ is the similarity from the text pipeline.