Enhancing your company’s data with LLM

How to maximize your data utility with LLM and deal with data-related issues when building ML models and adopting AI in business — read team analytics and ValueXI cases.

June 20, 2024

Artificial Intelligence and Machine Learning allow you to get the most out of your data and client information, turning it into profit. However, the ideal scenario of having accurate, well-labeled, and ample data is rarely met. How can we address these data-related issues when launching an AI project?

Data quality: the cornerstone of AI

Frequent challenges in data processing, such as data scarcity, incomplete datasets, data errors, insufficient data on specific examples, and unreliable labeling, can jeopardize a project’s success. Poor-quality data almost invariably leads to subpar results, regardless of the sophistication of the development team or the algorithms employed. We will share some practical experiences from our projects to ensure your AI initiatives start on solid footing.

1. Data scarcity or no data

Data scarcity is a significant barrier in training effective machine learning models, as insufficient data can lead to models that do not perform well or that fail to generalize beyond their training sets. Large Language Models (LLMs) like ChatGPT and Llama offer a solution by generating synthetic data that mimics real-world scenarios, orby enhancing the model with real, contextually relevant data samples from open sources. These capabilities make it possible to enhance datasets in situations where it’s impractical or impossible to collect more data.

Using LLMs can improve the robustness and accuracy of machine learning models. This approach helps overcome data shortages and accelerates the development and refinement of ML applications across various fields.

However, LLMs have their downsides. They can sometimes lose context, hallucinate, respond inappropriately, and negatively affect business processes. Privacy concerns are also significant. To successfully implement an LLM-based project significant investments in both intellectual and financial resources are required, including skills in prompt engineering.

ValueXI case When no data

Client: HR unit of a major German marketing agency.

Business goal: ML-based resume processing

Collect relevant resumes as part of a candidate search project.
Facilitate preliminary analysis of a larger number of resumes to understand how much a job seeker fits a vacancy.

Solution: A machine learning model was created to evaluate resumes and rank them by how well they match the vacancy.

The client did not have a database to train the model, so we decided to use the LLM model to generate job requests and to analyze incoming vacancies in the first phase of the campaign.

Since the client works with personal data, LLM Llama 2 was used, which totally complies with the requirements of the General Data Protection Regulation (GDPR).

Result: The solution helped to analyze resumes effectively and select the most suitable candidates X2 fast and accurate.

2. Incomplete data

A common issue occurs when the available data, while seemingly adequate, does not fully cover the problem due to the subject area being too narrow. There are two ways to address this:

Iterative development: Launch a new development phase to refine the ML system.
Comprehensive initial data provision: Provide a complete set of expected data at the project's outset to prevent issues.

For example, we developed a ValueXI-powered system predicting vehicle routes in a designated area, which enhanced dispatchers’ ability to manage and respond to incidents, thereby reducing accident rates and environmental impact. However, post-deployment, the client wished to apply the system to a different geographical area, which led to performance issues because the system was not initially designed to adapt to varied locations.

We initiated a distinct project to update the system to align with the specific requirements of the project. But this problem could have been avoided by supplying comprehensive initial data.

3. Data errors

Data errors, particularly in systems where information is manually entered, are a common challenge that can compromise the integrity of machine learning models.

To combat this, it is essential to undertake meticulous data cleansing which allows for the identification and correction of errors before they influence the training process. Implementing these stringent checks ensures that the data used for machine learning is as accurate and reliable as possible, paving the way for more dependable model outcomes.

4. Limited data on specific examples

High data volume does not guarantee a functional ML model if little is known about the samples. Common solutions include:

Data augmentation: Acquire more data to enhance model training.
Task segmentation: Break down the primary task into smaller, manageable parts.
Goal reframing: Shift the project focus to a more achievable objective if necessary.

We faced such a challenge in one of ValueXI projects implemented for a travel agency. The company’s internal system was overwhelmed due to rapid client base growth, leading to reduced conversion rates and potential customer dissatisfaction.

Initially we were aiming to identify customer requests where clients were most inclined to make purchases. However, this goal proved to be impractical, as the data we had access to only included the time of request, departure and arrival countries and cities, as well as the type and class of the flight, with no details on the clients themselves.

Consequently, we shifted focus to a different metric: estimating the likelihood that customers would promptly answer calls. This new approach prioritized high-potential requests for quick follow-up by agents, resulting in a significant reduction in lead response time and a substantial increase in sales.

5. Inaccurate labeling

Machine learning models depend heavily on accurately labeled data for training. Inaccurate labeling can skew the learning process, leading to poorly performing models.

To address this issue, it is beneficial to involve your in-house experts in the labeling process. Collaborating closely with external developers, these experts can ensure that labels are both accurate and relevant, significantly enhancing the model’s effectiveness and reducing the likelihood of errors that could arise from completely outsourcing this critical task.

ValueXI case Inaccurate labeling

In a project for a medical equipment supplier, we had to label ultrasound images for a carotid artery screening solution on our own, which required domain expertise and led to increased costs and time expenditures, which could have been reduced with in-house expertise.

CASE

6. Data privacy concerns

Businesses rightly treat their data as a valuable asset and may hesitate to share it with third parties. Anonymized data can be efficiently used to develop a proof of concept without exposing sensitive information. This approach ensures data security while allowing the development of a reliable model.

ValueXI case Data privacy concerns

Client: A leading global manufacturer of household appliances and electronics (Forbes-2000).

Business goal: Develop a tool that provides analytical information based on dialogues between client's technical support staff and customers.

Solution:

Structured extraction of entities from source dialogues using ChatGPT:
Vectorization of extracted entities (topics)
Clustering of vectorized entities
Interpretation of clusters
Formation and visualization of results

Data preparation:

All dialogs were translated into English for LLM efficiency (Global Support Unit)
All dialogs were anonymized in order to exclude any personal data (LLama)
Prompt extraction adaptation for the current LLM

Result: Dialog analysis tool helped to:

Identify the most frequent questions and add them to the customer’s FAQ
Assess customer satisfaction rate
Evaluate the performance of technical support staff

CASE

ValueXI comes into play

One of the great things about ValueXI is its ability to tackle various data-related challenges:

It ensures thorough data preparation and checks the dataset for completeness and quality, which saves time and money.
It automates the identification of unusual data patterns.
It supports data privacy by enabling anonymization or keeping data entirely internal.
It offers control over how results are interpreted and ensures their transparency, helping to avoid mistakes in the final data output.
It opens up opportunities for companies short on data or those handling sensitive data that cannot be shared, as well as for businesses looking to test hypotheses with LLMs.
It continually checks how the model performs with new data.

ValueXI’s Dataset Validation feature lets you check if your dataset is suitable and ready for AI training. The Dataset Analytics report gives you detailed insights into your dataset and suggestions for improvement. With just a click on the “Start Training” button, ValueXI guides you through a proven project creation pipeline. Furthermore, our Data Science team stands ready to support you and resolve any outstanding issues. It’s as simple as that!

Recap

Well, no surprise that Data Scientists consider data preprocessing as the most time consuming — and there fore expensive — part of DS projects. Adopting AI requires a strategic approach to data handling and model management. Properly processed, well laid-out data and efficient, domain-specific features are key success factors.

Embrace AI with ValueXI, and watch as data transforms into your most valuable asset! Submit for the Demo to see how LLM can be used to solve data-related issues when adopting AI with ValueXI ML platform.

Accelerate AI services integration X3 fast and X5 cheaper with ValueXI

Request a demo