Usage-based insurance: Big data, machine learning, and putting telematics to work

  • Print
  • Connect
  • Email
  • Facebook
  • Twitter
  • LinkedIn
  • Google+
By Marcus Looft, Terry Wade, Scott C. Kurban | 06 May 2015

Telematics—the art and science of information involving vehicular technologies, road transportation, and safety, gathered from onboard sensors and wireless technologies—is fast becoming smart business across many different industry sectors. This technology enables more efficient vehicle, trailer, and container tracking for enhanced fleet management, for example. But more importantly, for the purposes of auto insurers, it opens up an unprecedented granular view of the road habits of individual drivers, transmitting a portrait on which risk may be more accurately judged on case-by-case bases, and making possible the effective implementation of usage-based insurance (UBI).

In Europe, a slow but steady turn toward UBI is unmistakable. Italy, an early adopter, presently sees about 4% of its auto insurance business affected by information derived via telematics. Italian insurers are beginning to offer discounts based on this data, for example to low-mileage drivers. Carriers in Italy and elsewhere in Europe, where UBI penetrations hover closer to the 1% mark, are focused on exploring similar issues: offering maintenance plans, vehicle diagnostics, new car discounts, precision discounts for safe driving, and/or making available more attractive and efficient roadside assistance as part of their policies—even providing feedback to parents on their teen drivers.1 In the next five years, the penetration of UBI is expected to rise to around 15% in Europe. And that's a trend that's expected to continue into the future, with increasingly rapid and widespread adoption.2

There's just one problem: There's too much information. While it's arguable that too much information is a good problem to have, the reality for auto insurers is that, in many cases, they are sitting on vast troves of data that they aren't using at all. They simply don't have a way to get anything meaningful out of these resources. Many insurers today have already made the commitment to telematics. They've got the devices and they are gathering the data. Yet for now they are simply letting the gigabytes and terabytes and petabytes grow staggeringly high. They haven't found a way to conquer the problem of extracting useful and actionable information out of that data.

"Water, water, everywhere"

It's an intriguing problem, reminiscent of the old sailor's lament, "Water, water, everywhere, and not a drop to drink." Big data remains the promise of the future, but it turns out that traditional generalized linear model (GLM) techniques can be all but useless with data amounts so varied and massive. With so many unique new variables in play, it can become a very difficult task to identify and take advantage of the most meaningful correlations. In many cases, GLM techniques are simply unable to penetrate deeply into these giant stores. Even in the cases when they can, the time constraints required to uncover the critical correlations tend to be onerous, requiring days, weeks, and even months of analysis.

The potential benefits of telematics and UBI are apparent to everyone, which is part of what makes solving the problem so urgent. They hold the promise of opening up new marketing avenues in which low-risk drivers can be offered discounts and high-risk drivers more quickly and accurately identified, and priced appropriately. Winnowing out those with risky driving habits from safer drivers is becoming more realistic with big data. It's now possible that efficient, cost-effective plunges into these vast amounts of data should be able to uncover a broad canvas of patterns of driving behavior with the highest correlations to lower-risk and higher-risk customers. These critical factors may be as obvious as raw mileage driven per year or average highway traveling speeds or as subtle and nuanced as patterns of braking. More likely, it's some combination of all three, with many other factors involved as well.

Enter machine learning

But it remains crucially important for insurers to get at the most useful and actionable information, a maddeningly difficult and elusive task with traditional GLM. Enter machine learning, a developing computer science discipline focused on algorithms that help machines actually learn from data as they go. For auto insurers, machine learning holds the promise of enabling carriers to explore hundreds, if not thousands, of factors involved in calculating the potential risk of individual customers. Moving beyond GLM and introducing machine learning techniques with telematics data may enable insurers to leverage key competitive advantages.

UBI pricing necessarily supersedes traditional GLM methods because the complex interactions of the factors in play require machine learning to uncover them within a reasonable timeframe in order to be cost-effective. Pricing differences often cannot be fitted by GLM distributions. And correlations between telematics and non-telematics effects will tend to disturb the clarity of results in a single GLM. Distribution over different frequency and severity models only confuses the analysis of differences within telematics policies.

GLM techniques will thus always show how a business differs in terms of its dependencies on a limited set of specific factors, such as age, or how much mileage goes on a car, or other very high-level factors. But if a whole model is taken and everything modeled together to try to understand the risk for all of the policies, in the end there tends to be only a mixed bag of different effects. An insurer still hasn't been able to look very deep into its business.

Machine learning is designed to look for the differences inside of a single population, previously identified as a larger unit, e.g., the customer base. Essentially, UBI pricing is a next step, refining data and data patterns in ways that GLM cannot. Now we are looking for the subtle yet potentially most impactful differences. Where are these policyholders different, and where do these differences come from? With GLM techniques, you are already stuck because by their very nature, they don't allow us to see very deeply into the business.

A tale of not so identical twins

The way we have begun using machine learning to approach UBI with telematics is to first build the risk model, the claims cost model, in order to create a baseline rate. Then, relative to those rates, we look for the differences among individuals within that population.

This is where, for example, we might find a pair of identical twins—exactly the same age, with identical high-performance cars, perhaps even married to wives who are themselves identical twins. Everything appears to be exactly the same. But with UBI data, as illustrated in the table in Figure 1, we could well find that one twin is a far greater risk than the other, even though they appear to be completely the same in terms of the original traditional GLM rate-setting.

Figure 1: Loss Ratios and Associated Factors

Loss Ratio Volume Associated Factors
51% 6.25%

• Moderate mileage
• Mostly highway driving
• Powerful vehicles

66% 5.87%

• Moderate mileage
• Mostly country lanes
• Powerful vehicles
• Specific district excluded
• #Inhabitants small

63% 13.33%

• Moderate mileage
• Mostly highway driving

152% 19.80%

• High mileage
• Yearly kilometers driven are greater than 20,000

157% 5.09%

• High mileage
• Yearly kilometers driven are greater than 20,000
• Multiple long-distance trips
• Powerful vehicles

184% 5.35%

• High mileage
• Yearly kilometers driven are greater than 20,000
• Few long-distance trips

The two twins themselves may not even be particularly aware of any differences in terms of their driving habits. But machine learning can quickly help to reveal them. "Good twin" Adam drives his Toyota SUV to work in the city and commutes to his job every day on the same route. "Evil twin" Bart works as a seasonal construction worker who is required to drive great distances to different construction sites. He often stays at these sites during the week and makes the long drive home only for weekends. As a result, Bart may actually put the same amount of mileage on his car in a year as Adam, again emphasizing their similarities. Mileage over 20,000 kilometers per year is highly correlated with poor loss ratios, as shown in Figure 1, and raw mileage tends to continue to be a primary indicator for most carriers.

But raw mileage driven per year isn't the sole difference separating the risk profiles of Adam and Bart. It may not even be the most significant. Other factors uncovered by machine learning could be at least as relevant. For example, Bart's driving routes change all the time, which suggests much different levels of driving risk and comfort than Adam experiences in his usual daily commute. Putting all the factors together soon paints a picture of Adam and Bart as not really all that identical.

This information would not have been recovered efficiently, if at all, using traditional GLM methods, which often do not even consider some of the very factors that machine learning has shown us are actually significant critical variables. If Adam is sticking to the country lanes for Sunday drives and traveling the same route every day for his commute, and Bart is roaming all over creation in his Toyota SUV, they represent significantly different risks—even if their raw mileage is very similar—and should thus be priced accordingly.

Moving ahead with machine learning and UBI

These new machine learning techniques are useful in another way for insurers, helping them identify which variables are highly significant and which are less so. For example, some insurance companies tend to treat certain powerful types of vehicles as high-risk factors in and of themselves. But in terms of real-world results, as reflected in the analysis of the data via machine learning, it appears more likely that that difference is actually split. In our example of the twins above, it may not matter that both Adam and Bart are driving powerful vehicles. Powerful vehicles are correlated with both good and poor loss ratios, and therefore represent an overall neutral factor for pricing analysis.

It appears clear now that big data is about to change the way many people do business, in areas such as auto insurance, and for reasons that are persuasive. Knowledge is still power, as the old saying goes, and time is still money, which means it's more important than ever for insurers and others to reach a better understanding of exactly what they know and don't know about their customers, and the sooner the better. It's great to set up telematics devices and gather data, but that data is virtually useless if actionable business direction can't be drawn from it. Machine learning techniques are showing an exciting new way forward in the use of big data for insurers.

1PRNewswire (October 21, 2014). Insight Report: Technology in Action - A Roadmap for Insurance Telematics. Retrieved April 29, 2015, from

2Ptolemus Consulting Group. UBI Global Study 2013. Retrieved April 29, 2015, via (registration and/or purchase required).