How Do Phones Identify Potential Spam Calls?
If you have a phone in the United States, you’ve probably gotten a call from a recorded voice pitching you a product to buy a cause to donate to, or a free trial to sign up for.
Spam and scam calls are ubiquitous. In 2021, First Orion estimated approximately 110 billion scam calls were sent out in that year alone, which can explain why 90 percent of calls from unknown numbers go unanswered.
The increasingly common spam warning message that pops up on your phone when it rings is part of the ongoing battle against such calls. These warnings are the result of machine learning efforts deployed by voice service providers, device companies, and third-party app makers. Not only can this warn users before they pick up a call, but it can also help catch the scammers.
How Machine Learning Generates Spam Call Warnings
When your phone’s caller ID says “Spam Risk” or “Scam Likely,” that is based on a machine learning analytics engine used by the carrier, Mike Rudolph, chief technology officer at YouMail, said. The big three carriers all partner with different analytics engine vendors: AT&T with Hiya, Verizon with TNS, and T-Mobile with First Orion.
“All three of those guys have used machine learning based upon the data set they operate from to give you that ‘spam risk’ indication on those three mobile operators,” Rudolph said.
The data sets carriers use for this process come from call detail records. Calls made over the phone or via voice over internet protocol systems generate call detail records, which voice service providers log (also called carriers) and telephone exchanges (also known as switches). The call detail records contain basic metadata about the call, such as the call origin, destination, type of media (audio, SMS), call duration, and connection status.
“The behavioral analytics have been trained that a number it hasn’t seen before that makes 50,000 calls at 9 a.m. on a Monday is suspicious.”
Analytics engine vendors typically use behavioral analytics to identify suspicious callers by examining the reach (the number of people a particular number is calling) and the frequency (the number of calls made in a specific time).
Rudolph gave an example of a new number suddenly making tens of thousands of calls within a network around 9 a.m. on certain days.
He said, “We have trained the behavioral analytics to mark a high volume of calls at an unusual time as ‘spam likely.'”
Phones offer built-in tools to identify and label spam calls, in addition to the data in call detail records. This additional data stream can be used in machine learning processes to identify potential spam calls. Apple, for example, has its “Silence Unknown Callers” feature on phones running iOS 13 and later operating systems. The Google Phone app for Android similarly includes caller ID and spam protection options that allow users to mark calls as spam.
Carriers similarly have their systems: T-Mobile has ScamShield powered by First Orion, Verizon has Call Filter powered by TNS’ Call Guardian and AT&T has Call Protect powered by Hiya. Third-party apps like YouMail, RoboKiller, CallApp, and those put out by Hiya, TNS, and First Orion also allow users to mark calls as spam.
At Data Science Dojo, Albar Wahab, a data scientist trainee, stated that they add entries like this one to the database as spam call entries, alongside regular calls. He explained that feature engineering can help in identifying the best indicators of spam calls. Furthermore, he mentioned that we can use traditional machine learning classification algorithms, such as support vector machines, to predict if an incoming call is potentially spam. Additionally, deep learning algorithms like convolutional neural networks and long short-term memory can effectively automate the feature engineering process.
How Do Spoofed Calls and Robocalls Work?
Not all spam calls are robocalls, and not all robocalls are spam, though there can be a lot of overlap. When spammers carry out robocalls, they usually spoof the calls, which means that the number appearing on your caller ID is not the actual number from which the call originated.
Legitimate reasons for call spoofing include situations such as when a doctor uses call spoofing to protect their privacy when calling you back from their phone. However, scammers use spoofed robocalls to avoid being detected and tracked down.
“If you started to collect the wrong data about that number, you could easily mess up somebody’s landline connection and deliverability of those calls.”
“Offshore, and even onshore, less desirable-type companies you wouldn’t want to work with don’t want you to find out who they are, and they don’t want you to call them back on their real phone number, so they spoof a phone number,” said Brian Podolak, CEO at Vocodia, an AI sales and customer service platform.
Spoofing is a growing part of robo-scammers’ arsenals, and this can dull the edge that machine learning puts on scam detection efforts. The short version, as Omer Khan, chief technology officer at Vocodia, put it, is that machine learning suffers from the “garbage in, garbage out” issue.
Spoofed numbers can result in a lot of noise in a spam-detection machine-learning model, Vrabec said. This could result in false signals.
“I could use your phone number and start spamming people with it,” she added. “If you started to collect the wrong data about that number, you could easily mess up somebody’s landline connection and deliverability of those calls.”
Apps to Identify and Block Spam Calls
Privacy laws restrict voice service providers from using call detail records data to identify potential spam calls, while third-party apps and services can provide users with more information about calls.
Audio Fingerprinting Apps
YouMail, a robocall blocking software, uses an audio fingerprint system to analyze the content of a call to identify known and scam-likely robocalls without anyone listening to the call.
“We are 100 percent based on the audio in calls and we do nothing related to reach or frequency of calls,” Rudolph said. “For us, because we are an over-the-top information service, we can train machine learning based upon what the call said. That’s a completely different machine learning.”
YouMail, Rudolph explained, takes the audio of calls and turns them into images using fast Fourier transform, or FTT, and constantQ-transform, or CQT. The resulting image is the audio fingerprint of a call. Using both supervised and unsupervised machine learning algorithms, YouMail plots the auditory differences between sample calls’ fingerprints and those of known scam calls. The less auditory difference between a sample or ongoing call’s fingerprint and that of known scam calls, the more likely that call is to be a scam.
The audio fingerprints can also help identify potential new scams as they occur, either based on a new cluster of very similar content or because of the content itself.
“For example, our machine learning knows some things that are binary,” Rudolph said. “If you get a call that says it is the [Internal Revenue Service] or the [Social Security Administration], that’s unequivocally going to be a fraudster calling you.”
The ability to identify scam-likely calls in progress using audio fingerprints. Also allows for faster reporting to potentially identify bad actors, or at least the voice service provider that carried the call.
Rudolph explained that when YouMail identifies a call with audio fingerprints of known scam calls, it can send it to the Industry Traceback Group within seconds. The Industry Traceback Group can then track the scam call back to the provider that enabled the call. The TRACED Act, signed into law in 2019, requires voice service providers to deactivate accounts that make illegal calls.
The Future of Spam Call Detection
Complicating the use of data to identify spam calls is the current nature of the telephony landscape. Various entities such as voice service providers, device companies, third-party apps, states, and even countries (consider the National Do Not Call Registry) collect call information in numerous databases, each with its spam detection system, leading to slight variations across the board.ns across the board.
“Everybody’s doing it their way and thinks they have the better mousetrap,” Podolak said.
While there are some public registries of information on numbers, most of the carrier- or device-level registries do not interact. Both Khan and Podolak said that bigger advances in the use of machine learning against spam calls would have to change.