Why data remains the greatest challenge for machine learning projects

To further strengthen our commitment to providing industry-leading data technology coverage, VentureBeat is pleased to welcome Andrew breast and Tony Baer as regular contributors. Watch out for their articles in the Data Pipeline.

Quality data is at the heart of enterprise artificial intelligence (AI) success. And accordingly, it remains the main challenge for companies that want to use machine learning (ML) in their applications and operations.

The industry has made impressive strides in helping organizations overcome the barriers of sourcing and preparing their data, according to Appen’s recent State of AI Report. But there is still work to be done at various levels, including organizational structure and company policies.

The data costs

The enterprise AI lifecycle can be divided into four phases: data acquisition, data preparation, model testing and deployment, and model evaluation.

Advances in computing and ML tools have helped automate and speed up tasks such as training and testing different ML models. Cloud computing platforms make it possible to train and test dozens of different models of different sizes and structures at the same time. But as machine learning models increase in number and size, they will require more training data.


Low-Code/No-Code Summit

Learn how to easily build, scale, and manage low-code programs to create success for all on November 9th. Sign up for your free pass today.

Register here

Unfortunately, obtaining training data and annotating still requires significant manual effort and is largely application-specific. According to Appen’s report, “there’s a lack of data for a given use case, new machine learning techniques that require larger amounts of data, or teams don’t have the right processes in place to get to the data they need easily and efficiently.”

“Accurate model performance requires high-quality training data; and large, inclusive datasets are expensive,” Sujatha Sagiraju, Appen’s chief product officer, told VentureBeat. “However, it is important to note that valuable AI data can increase the chances that your project will move from pilot to production. so the expenses are required.”

ML teams can start with pre-labeled datasets, but will eventually need to collect and label their own custom data to scale their efforts. Depending on the application, labeling can become very expensive and labor intensive.

In many cases, companies have enough data but cannot deal with quality issues. Biased, mislabeled, inconsistent, or incomplete data reduces the quality of ML models, which in turn hurts the ROI of AI initiatives.

“If you train ML models with bad data, the model predictions become inaccurate,” Sagiraju said. “To ensure their AI performs well in real-world scenarios, teams need to have a mix of high-quality datasets, synthetic data, and human-in-the-loop evaluation in their training kit.”

The gap between data scientists and business leaders

According to Appen, company leaders are far less likely than technical staff to view data acquisition and preparation as the main challenges of their AI initiatives. “There are still gaps between technologists and business leaders when it comes to understanding the key bottlenecks in implementing data for the AI ​​lifecycle. This leads to a misalignment of priorities and budget within the organization,” according to the Appen report.

“What we do know is that some of the biggest bottlenecks for AI initiatives lie in the lack of technical resources and executive buy-in,” Sagiraju said. “If you look at these categories, you can see that the data scientists, machine learning engineers, software developers, and executives are spread across different areas, so it’s not hard to imagine a lack of aligned strategy due to conflicting priorities between the different teams within of the organization.”

The diversity of people and roles involved in AI initiatives makes achieving this alignment difficult. From the developers who manage the data, to the data scientists who tackle specific problems, to the executives who make strategic business decisions, everyone has different goals in mind and therefore different priorities and budgets.

However, Sagiraju sees the gap slowly narrowing year after year when it comes to understanding the challenges of AI. And that’s because organizations are better understanding the importance of quality data to the success of AI initiatives.

“The emphasis on the importance of data — particularly high-quality data consistent with use cases — for the success of an AI model has brought teams together to solve these challenges,” Sagiraju said.

Data challenges are not new in the realm of applied ML. However, as ML models grow in size and data becomes more readily available, scalable solutions must be found to assemble high-quality training data.

Fortunately, a few trends are helping organizations overcome some of these challenges, and Appen’s AI report shows that the average time spent managing and preparing data is trending down.

One example is automated labelling. For example, object detection models require specifying the bounding boxes of each object in the training samples, which requires significant manual effort. Automated and semi-automated labeling tools use a deep learning model to process the training samples and predict the bounding boxes. The automated labels aren’t perfect, and a human labeler needs to check and adjust them, but they speed up the process significantly. In addition, the automated labeling system can be further trained and improved as it receives feedback from human labellers.

“While many teams are beginning to manually label their records, more teams are turning to time-saving methods to partially automate the process,” Sagiraju said.

At the same time, there is a growing market for synthetic data. Businesses use artificially generated data to complement the data they collect from the real world. Synthetic data is particularly useful in applications where obtaining data from the real world is costly or dangerous. One example is self-driving automakers, who face regulatory, safety, and legal challenges in sourcing data from real roads.

“Self-driving cars need incredible amounts of data to be safe and prepared for anything once they hit the road, but some of the more complex data isn’t readily available,” Sagiraju said. “Synthetic data allows practitioners to consider edge cases or dangerous scenarios such as accidents, crossing pedestrians and emergency vehicles to effectively train their AI models. Synthetic data can create instances to train data when there is not enough human-derived data. It is crucial to fill in the gaps.”

At the same time, the evolution of the MLops market is helping companies address many machine learning pipeline challenges, including labeling and versioning of datasets; Training, testing and comparing different ML models; providing models to scale and tracking their performance; and collecting new data and updating the models over time.

However, as ML plays an increasingly important role in companies, human control becomes more important.

“Human-in-the-Loop (HITL) assessments are essential to provide accurate, relevant information and avoid bias,” Sagiraju said. “Despite what many believe humans are actually taking a backseat to when it comes to AI training, I think we will see a trend toward more HITL assessments to empower responsible AI and have more transparency into what organizations are doing in build their models in to ensure models perform well in the real world.”

VentureBeat’s mission is intended to be a digital marketplace for technical decision makers to acquire knowledge about transformative enterprise technology and to conduct transactions. Discover our briefings.