Transforming Data
June 18, 2024

Combined Data Sets

Walkthrough of combined data sets in left join format

Most APIs output singular data resources.  Most of these data resources reference other related data sets, and this is usually done through an identifier. One will see this identifier, a unique string of characters that points to a related data set.

Data resources in applications exist separately

Most APIs typically do not return a combination of these related data sets by default. Instead, they present one with a single, distinct data set.

For a clearer picture, let's consider a concrete example.

Suppose you're working with a customer resource that includes data for each customer such as:

  • Name
  • Email
  • Sign-up Date

This resource might be critical for your company to track information about clients. At the same time, another equally vital resource might be a list of all transactions made by these customers, with each transaction featuring details like:

  • Transaction date
  • Amount
  • Product or service purchased

In a typical scenario, if you query your API for customer information, it will return just that - the data from the customer resource. Similarly, if you ask for transaction details, you're likely only to see the pertinent transactions data. They’re distinct data sets, each with its own exclusive API response. Each query lives in its silo, autonomously presenting standalone, unrelated data sets.

But what if you want to see both together? What if you need a comprehensive summary of each customer and their transaction activities to analyze consumer behavior or make forecast predictions? Then, this disconnectedness between data sets presents a challenge.

The left-joined combo data set is the solution.  It integrates the matching entries from both data resources.

This perspective gives you a broader, more complete view of your data, and especially essential when you can't afford to drop any entry from your first data set.

Contuining with the example, working to get insights out of a combination all your customers (left data set) and their transaction activities (right data set). A left join comes in handy to ensure that you retain all customers in the final data set, including those who have not made any transactions. You can visualize the impact of every customer and strategize your outreach even if they're yet to be a part of the transactional data.

Left-joined combined data set

Here's an example: you have a table 'Customers' listing unique customer IDs, and another table 'Transactions' containing transaction details linked to customer IDs. When you perform a left join on 'Customers' with 'Transactions', the result is a comprehensive table that maintains all customers, paired with their corresponding transaction details where applicable. Customers who have not transacted will still exist in this final dataset, with 'NULL' or empty values in the transaction columns, signifying the absence of a transaction. You have now successfully combined your datasets while preserving all your invaluable customer data.

The best ETL solutions will provide pre-combined data sets so that you can get more insights.  Less automated solutions will let you manually blend data sets, albeit through a complex process of mapping and validating.

Recent blog