With the recent awareness boom and artificial intelligence (AI) use cases, the conversation about data quality remains important. It’s easy to become awestruck by many of these new AI systems, forgetting how the data used to train them affects their performance and limits their usefulness in real-world scenarios.
With AI, you get out what you put in. After all, AI and machine learning (ML) models are just really powerful extrapolation machines. They “understand” patterns in the data you’re training them on, and then identify those patterns in data instances that are new but similar to the data they’ve seen before.
As with athletes and their nutrition, when the quality of the input is reduced, the quality of the output (in this case, performance) is similarly reduced, no matter how talented the athlete might be. If the athlete only eats fast food for a week, their mileage will increase and they may be more likely to sustain injuries. Likewise, companies cannot successfully implement AI without high-quality data.
Which companies are the most important providers in the field of AI and hyperautomation? Click here to view the top 10 AI/hyperautomation acceleration economy shortlist selected by our expert team of practitioner analysts
Data quality issues and bias
To understand how data quality impacts AI utility, let’s look at attempts to automate the hiring process. A few years ago, Amazon tried to streamline the recruitment process using AI. It created AMZN.O, an AI system that took hundreds of resumes and spit out the best few candidates, allowing human recruiters to avoid the menial work of filtering through underperforming resumes and quickly picking the best ones. Sounds like real automation, right?
Not quite. The company trained the underlying model on all the resumes the company had received in the past, most of which were from male candidates. As more men were hired at Amazon, particularly in certain areas such as software development, the system has associated the language and attributes prevalent in male resumes with being a better candidate. Conversely, with less training data on female resumes, the system tended to demote or misinterpret attributes of female top candidates.
This is a clear example of poor data quality leading to biased AI systems. Other hiring, bank lending, and mortgage originating systems have also been labeled for their bias toward certain demographics. The intended outcome of these systems could add tremendous value to the organizations that use them – and make the loan or mortgage process easier for consumers – but the quality of the training data prevents widespread adoption.
Data Profiling: How H20.ai Cleans and Verifies Data
H2O.ai, one of the top 10 companies in the Acceleration Economy Hyperautomation, is a strong proponent of “data profiling” or cleaning and verifying data from existing sources. This means that H2O.ai not only carries out a quality control of its own AI cloud platform, but also of the data fed in.
Data profiling can take many forms. Even before you get close to the dataset, it’s important to evaluate your own assumptions – what are you testing, which datapoints indicate the desired state, how datapoints can best be labeled, how datasets can be filtered or expanded to better include Edge cases and so on. Then you can leverage platforms like H20.ai’s that have machine learning to filter and fix bad data points. A preliminary analysis can be performed to identify weaknesses in the data or to discover biases that may not be desired.
How “synthetic” data increases data quality and data volume
These post-hoc data sanitization tactics aren’t always enough. Sometimes you just need to collect better data from the start, which can be an expensive and difficult proposition. That’s why companies are turning to synthetic data to train their AI systems. Unlike real-world data that needs to be captured, synthetic data is created by algorithms or even generative AI models.
For example, companies building the AI systems behind self-driving cars require vast amounts of visual data about driving situations. This visual data — literally every last angle, lighting, and weather condition a car might encounter while driving, for example — is not only tedious to capture and store, but often includes faces and license plates that must comply with a variety of privacy regulations . To avoid this, researchers at MIT have developed an algorithm that creates video clips using 3D models of roadside objects, people, and unique traffic situations. The models they trained with these synthetic video clips actually performed better than models trained with real video clips of drivers.
This is partly because you can include more remote situations in synthetic datasets that would otherwise rarely occur in reality, helping the model deal with unusual challenges. You can also avoid real-world biases or random correlations that might affect the final model, such as B. the granting of smaller loans to people of color.
Synthetic data allows us to build models that drive the world the way we want it to. This can be used for good and evil. For example, if we want to increase economic mobility in Region X, we could ensure that our training set includes people in Region X who receive larger loans. Of course, this exercise can also be used in reverse.
Diploma
To reiterate, AI is just an extrapolation machine. The results often only reflect how things were identified and handled in the past. Many of our human prejudices carry over.
But it doesn’t have to be that way. Through focused efforts to improve data quality and build strategic training datasets, we can build AI systems that push the world in the direction we want to go. These systems will improve the way we live and add value to the organizations they build.
Are you looking for real-world insights into artificial intelligence and hyper-automation? Subscribe to the AI and hyperautomation channel: