Crosspost: Data discovery doesn’t belong in ad hoc queries

Data teams may struggle to quantify the benefits of good data documentation. But running countless ad hoc validation queries can incur both computational and cognitive cost.
data
workflow
elt
crosspost
Author

Emily Riederer

Published

July 18, 2024

Credible documentation is the best tool for working with data. Short of that, labor (and computational) intensive validation may be required. Recently, I had the opportunity to expand on these ideas in a cross-post with Select Star. I explore how a “good” data analyst can interrogate a dataset with expensive queries and, more importantly, how best-in-class data products eliminate the need for this.

My post is reproduced below.


In the current environment of decreasing headcount and rising cloud costs, the benefits of data management are more objective and tangible than ever. Done well, data management can reduce the cognitive and computational costs of working with enterprise-scale data.

Analysts often jump into new-to-them tables to answer business questions. Without a robust data platform, this constant novelty leads analysts down one of two paths. Either they boldly gamble that they have found intuitive and relevant data or they painstakingly hypothesize and validate assumptions for each new table. The latter approach leads to more trustworthy outcomes, but it comes at the cost of human capital and computational power.

Consider an analyst at an e-commerce company asking the question “How many invoices did we generate for fulfilled orders to Ohio in June?” while navigating unfamiliar tables. In this post, we explore prototypical queries analysts might have to run to validate a new-to-them table. Many of these are “expensive” queries requiring full table scans. Next, we’ll examine how a data discovery platform can obviate this effort.

The impact of this inefficiency may range from a minor papercut to a major cost sink depending on the sizes of your analyst community, historical enterprise data, and warehouse.

6 Preventable Data Discovery Queries

1. What columns are in the table?

Without a good data catalog, analysts will first need to check what fields exist in a table. While there may be lower cost ways to do this like looking at a pre-rendered preview (ala BigQuery), using a DESCRIBE statement (ala Spark), or limiting their query to the first few rows, some analysts may default to requesting all the data.

select *
from invoices;

2. Is the table still live and updating?

After establishing that a table has potentially useful information, analysts should next wonder if the data is still live and updating. First they might check a date field to see if the table seems “fresh”.

select max(order_date) 
from invoices;

But, of course, tables often have multiple date fields. For example, an e-commerce invoice table might have fields for both the date an order was placed and the date the record was last modified. So, analysts may guess-and-check a few of these fields to determine table freshness.

select max(updated_date) 
from invoices;

After identifying the correct field, there’s still a question of refresh cadence. Are records added hourly? Daily? Monthly? Lacking system-level metrics and metadata on the upstream table freshness, analysts are still left in the dark. So, once again, they can check empirically by looking at the frequency of the date field.

select max(updated_date), count(1) as n
from invoices
group by 1;

3. What is the grain of the table?

Now that the table is confirmed to be usable, the question becomes how to use it. Specifically, to credibly query and join the table, analysts next must determine its grain. Often, they start with a guess informed by the business context and data modeling conventions, such as assuming an invoice table is unique by order_id.

select count(1) as n, count(distinct order_id)
from invoices;

‍However, if they learn that order_id has a different cardinality then the number of records, they must ask why. So, once again, they scan the full table to find examples of records with shared order_id values.

select *
from invoices
qualify count(1) over (partition by order_id) > 1
order by order_id
limit 10;

Eyeballing the results of this query, the analysts might notice that the same order_id value can coincide with different ship_id values, as a separate invoice is generated for each part of an order when a subset of items is shipped. With this new hypothesis, the analyst iterates on the validation of the grain.

select count(1) as n, count(distinct order_id, ship_id)
from invoices;

4. What values can categorical variables take?

The prior questions all involved table structure. Only now can an analyst finally begin to investigate the table’s content. A first step might be to understand the valid values for categorical variables. For example, if our analyst wanted to ensure only completed orders were queried, they might inspect the potential values of the order_status_id field to determine which values to include in a filter.

select distinct order_status_id
from invoices;

They’ll likely repeat this process for many categorical variables of interest. Since our analyst is interested in shipments specifically to Ohio, they might also inspect the cardinality of the ship_state field to ensure they correctly format the identifier.

select distinct ship_state
from invoices;

5. Do numeric columns have nulls or ‘sentinel’ values to encode nulls?

Similarly, analysts may wish to audit other variables for null handling or sentinel values by inspecting column-level statistics.

select distinct ship_state
from invoices;

6. Is the data stored with partitioning or clustering keys?

Inefficient queries aren’t only a symptom of ad hoc data validation. More complex and reused logic may also be written wastefully when table metadata like partitioning and clustering keys is not available to analysts. For example, an analyst might be able to construct a reasonable query filtering either on a shipment date or an order date, but if only one of these is a partitioning or clustering key, different queries could have substantial performance differences.

Understanding Your Data Without Relying on Queries

Analysts absolutely should ask themselves these types of questions when working with new data. However, it should not be analysts’ job to individually answer these questions by running SQL queries. Instead, best-in-class data documentation can provide critical information through a data catalog like Select Star.

1. What columns are in the table? And do we need a table?

Comprehensive search across all of an organization’s assets can help users quickly identify the right resources based on table names, field names, or data descriptions. Even better, search can incorporate observed tribal knowledge of table popularity and common querying patterns to prioritize the most relevant results. Moreover, when search also includes downstream data products like pre-built reports and dashboards, analysts might sometimes find an answer to their question exists off the shelf.

2. Is the table still live and updating? And are its own sources current?

Data is not a static artifact so metadata should not be either. After analysts identify a candidate table, they should have access to real-time operational information like table usage, table size, refresh date, and upstream dependencies to help confirm whether the table is a reliable resource.

Ideally, analysts can interrogate not just the freshness of a final table but also its dependencies by exploring the table’s data lineage.

3. What is the grain of the table? And how does it relate to others?

Table grain should be clearly documented at the table level and emphasized in the data dictionary via references to primary and foreign keys. Beyond basic documentation, entity-relationship (ER) diagrams will help analysts gain a richer mental model of grains of how they can use these primary-foreign key relationships to link tables to craft information with the desired grain and fields. Alternatively, they can glean this information from the wisdom of the crowds if they have access to how others have queried and joined the data previously.

4. What values can categorical variables take? Do numeric columns have nulls or ‘sentinel’ values to encode nulls?

Information about proper expectations and handling of categorical and null values may be published as field definitions, pointed to lookup tables, implied in data tests, or illustrated in past queries. To drive consistency and offload redundant work from data producers, such field definitions can be propagated from upstream tables.

‍5. Is the data stored with partitioning or clustering keys?

Analysts cannot write efficient code if they don’t know where the efficiency gains lie. Table-level documentation should clearly highlight the use of clustering or partitioning files so analysts can use the most impactful variables in filters and joins. Here, consistency of documentation is paramount; analysts may not always be incented to care about query efficiency, so if this information is hard to find or rarely available, they can be easily dissuaded from looking.

Beyond a poor user experience, poor data discoverability creates inefficiency and added cost. Even if you don’t have large scale historical data or broad data user communities today, slow queries and tedious work still detract from data team productivity while introducing context-switching and chaos. By focusing on improving data discoverability, you can streamline workflows and enhance the overall efficiency of your data operations.