Transportation planning decisions are usually made based on data that is collected from a variety of sources, such as traffic counts, surveys, and census data.
“The transport sector has a long history of using data to formulate travel plans and strategies,” confirms Peter Lindgren, head of digitisation of transport at TRL. “Data is collected at various points in the transport network, for example at induction loops in the ground, which collect data on the vehicles that drive over them. Transport planners combine this data with other datasets such as the Census data. The transport industry has had to piece together these fragmented datasets for a long time. People have created strategies to deal with that, but the data is still flawed.
“We are now seeing new datasets that can help plug into existing data, for example mobile networks, which provide data on the movements of people. From this data you can derive insights into where someone might have started and ended a journey, and even the modes of transport used.”
In addition to using this data for transport planning, it can also be used to teach intelligent machines to think for themselves. The machines use algorithms and statistical models to analyse and draw inferences from patterns in data. Most algorithms in transport AI are now based on what’s known as ‘supervised learning’. This means that the model is trained using data that has already been labelled in some way.
“Computer vision algorithms are a good example of this,” says Lindgren. “The machine is shown millions of images and videos that have already been classified by a human and eventually it learns to classify to object itself. In automated vehicles, these algorithms are being used to identify objects such as pedestrians or other vehicles.”
“Lots of models in machine learning and statistics study data to make certain conclusions,” adds Mark Bell, senior statistician, TRL. “We call it training data because the model is looking at the data and trying to work out what relationships are present in this particular data set. There's a myth that because datasets are getting bigger, bias is becoming less important. It's actually the opposite. If anything, bias is becoming more important. If the training data has biases in it, the model and the results from the model will carry that bias forward.”
With machine learning, the quality and accuracy of the model is determined by the quality of data that is fed into it. “If the data hasn't been classified correctly or is weighted towards a particular demographic, the model won't know that,” says Lindgren. “It will create its inferences based on the data that it is fed. Therefore, it might not be accurate and it might not be fair. Also, these models are constantly learning, so even when actions are taken to minimise these biases, the models are susceptible to becoming biased later on. This is known as Model Drift.”
Types of data bias
There are several different types of bias that can impact the quality of a machine learning algorithm. Sampling bias is where training data is not representative of the population or target sample. So, for example, if the training data only includes data from mobile phones, that’s only sampling people who own mobile phones.
Temporal bias is where the training data is skewed by a particular time period. “The census data is a great example of this,” says Lindgren. “It's only collected once every 10 years and the last census was in 2021, when we were in the middle of the pandemic. The key question within the census is ‘how do you normally travel to work? So, of course, that data set is only going to reflect how people were travelling - if they were travelling at all - when Coronavirus was preventing people from moving around.”
Other biases include measurement bias, where the method of measuring may not be representative, or selection bias, where the training data is not randomly selected, or is biased towards a certain type of data. Geographical bias is also important, where data is skewed towards one particular region, as is demographic bias, where the data doesn't reflect the demographics of the population. Demographic bias includes gender bias, which Bell says is inherent in the transportation industry (For more on gender bias in transport don’t miss Intertraffic World 2024).
“If datasets are biased towards one group or another, they may not meet the needs of other groups,” says Lindgren. “If we take journey planning for example, we know that women often feel safer on certain routes, and they'll choose different routes to men. If the systems we're creating don't take these things into account, then the outputs that they suggest may not be suitable for some groups. Therefore, those groups may not adopt those technologies or may have more resistance to them. Transportation is an enabler in society, so we need to make sure that the outputs of these models and the technologies we are producing are serving the whole of society.”
Mitigating data bias
It is important to be aware of the potential for bias in machine learning models, and to take steps to mitigate it. “It all comes down to ensuring the quality of the data and the approach,” says Lindgren. “It’s about making sure that there aren't biases in the data, or that biases are being addressed. And that the people involved aren't willingly — or otherwise — imposing biases in the system through their own subconscious bias.”
There are several ways to do this. The first is to use a diverse training dataset. This will help to ensure that the model is exposed to a wide range of features and experiences. In transportation, the data is not always available, so to address holes in data sets, Lindgren suggests using data augmentation.
This is where you take one real world example and you use other processes to create loads of other plausible examples. “You also need to make sure that you're processing the data correctly, before feeding it into your machine and that you're cleaning it correctly, you're using the right pre-processing techniques,” he says. For more on data augmentation methods don’t miss Intertraffic World 2024
“It is also important to use fair evaluation metrics,” adds Bell. “This will help to ensure that the model is not being biased towards certain groups of people. A vital part of this is making sure that you have a lot of diversity within the group of people working on these models to make sure biases don't creep in that way.”
It is essential to be transparent about the model's development process and to monitor the model's performance over time. “It is important to have regular reviews and audits to help identify biases and ensure the model is performing as intended,” says Lindgren. “Even once a model has been trained and has been put into production, it's constantly learning and so can change its inferences and we need to make sure then that biases don't creep in. Consistently applying the best practices, as they evolve, is important as well.”
Bell believes it's important to note the difference between modern machine learning algorithms and traditional statistical modelling techniques, particularly with regard to transparency.
“A statistical model will usually want to understand the relationships that you have,” he says. “For example, if I was asked to model the relationship between the volume of traffic on the roads and the number of KSIs [people killed or seriously injured], the model would be very explainable – if we increase the traffic flow this much, we would expect this many more KSIs. You get a concrete and explainable output from the model in terms of that relationship. You can also validate the model using historical data.
“With machine learning, there's less of a focus on quantifying relationships. Machine learning models are mainly judged on their predictive accuracy. To assess how good that algorithm is, it will be heavily trained and then tested on previously unseen data, and if it performs well on the unseen data, that will be taken as a success.”
One of the challenges with machine learning algorithms is that it is difficult to always know why the model has behaved in a certain way. “They are known as ‘black box’ algorithms,” says Bell. “If we try to look under the hood to explain what's going on, it's not easy. We know that the model has identified an object as a pedestrian, but we may not know why it's saying that. Machine learning can vastly outperform classic statistical approaches. It’s an approach that makes sense in lots of ways. But because you can't explain what's going on under the hood, that often makes you less able to quantify and understand any biases that the model may have.”
“It’s important that we try to understand what the models are doing, and are able to explain and assess this,” adds Lindgren. “At the moment engineers write the code, but the models train themselves and create their own connections, which are opaque to everyone else, even to the engineers.”
TRL believes there is a need to implement a set of standards in the AI and machine learning industry, to overcome data bias. “A set of standards would allow everyone to benefit from best practice,” says Lindgren. “It would mean that entrants to the market have a starting point. It would also help to prevent organisations going down paths that appear to be generating results, but end up being wrong due to biases that they didn’t know about.”
“A set of guidelines is important now and will becoming increasingly important as datasets grow and we rely on them more and more,” adds Bell. “All of these machine learning algorithms are becoming a bigger part of our decision-making processes, so bias will become an even more important factor going forward. This increases the need for a rigorous set of guidelines to be in place.”