BLOG 3 ”Wrangling data – from raw data to value”: Data accessibility in AI-driven B2B sales

The value of data can be fully realized through careful preprocessing and exploratory analyses. Achieving this requires continuous collaboration between data scientists and data owners. The process begins with understanding the data, followed by analysis and interpretation of results.

In the current blog, we focus on data pre-processing steps, specifically data wrangling. These steps are essential before data can be transformed into data-driven insights. To maximize the production value from data (Figure 1.), it’s crucial to ensure that the derived insights are actionable and aligned with business priorities.

This article is Part III of the InnoSale blog series, where we explore different perspectives on data usage and sharing within collaborative research projects. In this blog series, we have previously discussed use cases (Part I) and stakeholders, whilst the next blogs will tackle the confidentiality of the AI model (Part IV) and business benefit (Part V). Blogs are published at https://www.innosale.eu/. Please also join our webinar 29.5.2024 14:00-15:30 Finnish time (13:00-14:40 CET), registration link.

Efficient data wrangling accelerates exploratory analytics, enabling more in-depth analyses and a smoother transition to production. On the data value funnel (Figure 2.), data wrangling is closely related to preparing data sources for exploratory analyses, ultimately delivering valuable insights.

Assisting salesperson requires data from many sources

In the context of our project, we consider a use case where the aim is to support an inexperienced salesperson who wishes to present a preliminary offer to a large customer using a data-driven tool. The customer has provided only a high-level description of their needs. The data-driven tool recommends a product configuration to the salesperson that aligns with the customer’s requirements based on all earlier orders.

Former orders from the same customer can be retrieved from the CRM (Customer Relationship Management) system, as well as details of delivered products from the ERP (Enterprise Resource Planning) system. Since this is the first time the customer has requested tailoring, we need to determine if any existing customers have had similar requirements. Additionally, we’ll explore whether there is any additional information or even blueprints available.

To achieve this, we analyze former discussions between salespersons and the supporting engineering department from the IT Service Management ticket database. This use case demonstrates the versatility of real-life data, which consists of both numerical and textual information. It also highlights the required functionalities needed by the data-driven tools.

In the following sections, we present a simplified overview of the data wrangling steps, starting from data discovery and followed by structuring, cleaning, enriching, validating, and publishing.

Data discovery

The data discovery process involves collaboration between use case owners, data owners, and data scientists. Use case owners define business problems, data owners provide access to relevant datasets, and data scientists analyze the data to generate actionable insights, fostering iterative collaboration to align analyses with business goals. In the context of our project, notable effort has been put into identifying use cases with added value through the adoption of Artificial Intelligence methods (part I in the blog series) and stakeholder communications (part II in the blog series).

Structuring

When data is obtained from real-world sources, such as data lakes, it can be in unstructured or arbitrary formats. However, before analysis methods can be applied to the data, it must be transformed into a structured format. Depending on the case, this may require a large amount of working hours and effort. Combining data from different sources into a structured format might also be necessary. The basis for all this preprocessing work lies in understanding the data. If data from different sources belongs to the same entity (e.g., sales orders), a way to combine the data instances must be implemented, such as through ID-number mappings. Additionally, data types must be decided when converting data into a structured format. Data fields can be expressed in various forms, including freetext, categorical values, or continuous values. Examples of structured data formats include SQL databases, CSV files, and JSON files.

Cleaning

Data cleaning is a crucial step in the data wrangling process, as it lays the foundation for meaningful analysis. It’s important to note that the approach to data cleaning is inherently case-specific, as each dataset comes with its unique challenges and intricacies.

One fundamental aspect of data cleaning is the identification and handling of missing values. Detecting instances where samples or features are missing is important. Depending on the severity and context, one viable option is to remove those incomplete samples. However, a more nuanced strategy involves considering various imputation methods to replace missing values. This approach ensures that the integrity of the dataset is maintained, taking into account the specific requirements of the analysis at hand.

Data cleaning also involves identifying anomalous values. For instance, timestamps that project thousands of years into the future are likely erroneous and might require attention. These anomalies can distort the integrity of the dataset, emphasizing the need for a review.

In scenarios involving free text, another layer of complexity arises. Errors in inputted data, whether due to typos or other inaccuracies, may manifest and warrant correction or, in some cases, exclusion. The key is to strike a balance between preserving valuable information and ensuring data accuracy.

The goal of data cleaning is to produce a cleaned dataset that is more amenable to subsequent stages of the analysis pipeline. By addressing missing values, anomalous entries, and inaccuracies in free text, the result is a cleaned dataset that serves as a solid foundation for more accurate and meaningful insights. This curation not only enhances the reliability of the data but also streamlines the analytical process, facilitating more robust and insightful outcomes in analysis.

Enriching

Data enrichment is the process of enhancing or improving the quality of a dataset by adding additional information or context. The goal is to make the data more valuable and informative for analysis or machine learning. This additional information can come from various sources and is mapped to the data fields of the original dataset. Data enrichment is particularly valuable in situations where the dataset is incomplete or lacks certain details deemed necessary for analysis or model training. By enhancing the dataset with additional information, it is often possible to improve the performance and accuracy of machine learning models and gain more comprehensive insights from the data.

Examples of means to enrich a dataset using external data:

Geospatial Data: The original CRM data incorporates customer countries and sales addresses. By leveraging an open map service, it is possible to extract more detailed information, such as the province or municipality of the customers.
Temporal Data: The original CRM data contains sales dates. Gathering “ambient” economic information, such as the GDP of the country at the time, may enable a better understanding of the success probability of a sales case.
Firmographic Data: The original CRM data contains the name and VAT number of each client. Utilizing a B2B prospecting platform (e.g., Vainu or Crunchbase) enables automatic generation of key firmographic information, including revenue and trends.

Data enrichment not only enhances the reliability of the dataset but also provides a solid foundation for more accurate and meaningful insights in subsequent analyses.

Validating

After the previous steps, an initial version of the dataset exists. However, to ensure the correctness of the dataset, it must undergo validation. This process involves going through the dataset using a set of rules and conditions. These checks include verifying data types (numerical, alphabetical, etc.), ensuring that values fall within specified numerical ranges, and confirming the existence of all required data fields or files. The validation procedure is case-specific and must be adapted to the dataset at hand and its unique characteristics.

Publishing

Publishing refers to the process of making the cleaned, transformed, and validated data available for analysis, reporting, or sharing with others. This step involves creating a final, polished dataset that is ready for use by analysts, data scientists, or other stakeholders during the analysis phase. The key aspects of publishing include:

Documentation: Any changes made to the dataset in the previous steps are reported. Metadata describing variable names and types are also included.
Data Formatting: The final published dataset is converted to a format that is compatible with the storage system.
Accessibility and Sharing: In this step, access permissions and sharing platforms are defined.
Version Control: Any changes made to the dataset are documented to enable transparency and traceability.

By following these steps, the dataset becomes a reliable foundation for subsequent analyses, ensuring accurate and meaningful insights.

VTT research aims

At VTT, our commitment to advancing AI applications is grounded in rigorous applied research that leverages cutting-edge AI methods. In this project, our focus lies in identifying use cases where AI can deliver added value. We collaborate closely with industrial partners to implement AI components that specifically support salespersons.

The central research question guiding our efforts revolves around how to provide timely, well-formatted information to the right user. Our solutions are designed for broader applications across diverse domains and industries. By harnessing the power of AI, we aim to empower sales teams and enhance decision-making processes.

Best practices

Effective communication between data owners and data scientists is crucial for improving data quality and defining specific goals for utilizing data in greater detail. This ongoing dialogue also fosters trust between the involved parties, which is essential for achieving planned targets.

Below are examples of how technical analysis challenges were addressed in the INNOSALE project:

Understanding Data Fields and Completing Missing Elements: In the initial data delivery, data scientists encountered challenges understanding the meaning of all data fields, and some relevant elements were missing. To address this, collaborative discussions were held to clarify the contents of the data. Missing data elements were identified and iteratively added to the distributed dataset.
Sensitive Data Anonymization: Certain data fields contained personal or highly classified information. Prior to data delivery, data scientists engaged in discussions regarding sensitive fields. They devised a solution for anonymizing data using different levels of anonymization. While high anonymization protects data owners by concealing actual information from data scientists, it also poses challenges during development and analysis. Interpretation of results becomes limited to the data owner. By carefully considering various anonymization models and their consequences, an appropriate level of anonymization was chosen to balance safety and efficient development.
Refining AI Goals through Detailed Data Analysis: Initially, the goals for applying AI were stated at a high level, lacking concrete targets for development. Subsequent meetings between data scientists and data owners focused on analyzing different aspects of the data. The analysis results provided deeper insights, enabling more precise planning of AI tools that are both useful and realistic for implementation.

By addressing these challenges, the INNOSALE project demonstrates the importance of effective communication, data refinement, and goal-oriented planning in successful data analysis and AI implementation.

Authors

Sari Järvinen, Arttu Lämsä, Tuomas Sormunen, Jussi Liikka, Johannes Peltola and Marko Jurvansuu from VTT.