Joseph Naso Joseph Naso

The Data Ecosystem is kind of a mess?

Every time I look at data-focused communities online, a new tool is being launched or a new funding announcement is being made. It’s both impressive and astonishing. It kind of feels like the wild west - or, more accurately, a land grab.

It seems that the influx of VC capital has lead to the commoditization of the “data stack”. A good thin insofar as making access and use easier; a bad thing in that it adds a lot of noise.

The commoditization of software has generally been viewed as a good thing. Hopefully the data ecosystem will follow suit.

Read More
Joseph Naso Joseph Naso

An Opinionated Primer on “Data” (Part 1)

I’m throwing this together for a friend who doesn’t work in a traditional tech role, but is interested in “data”. A broad subject indeed. Here are the basics; if you’re looking for more information or if I missed a something entirely, let me know! I have plenty more to share.

This is by no means comprehensive, and purposely skips over a great deal of historical detail. Why? Because chances are 1) you don’t need to know about the history of distributed computing, 2) we want to get you up to speed on what’s worth knowing now, and 3) any one of the topics in these posts could be their own series of content. Let’s keep it simple.

We can bucket the concepts in data land into two categories - Principles & Frameworks (Part 1) and Infrastructure & Tools (Part 2). We’ll get to those in a minute. First, let’s lay out of ground rules.

Pre-script: For anyone reading this that thinks I am oversimplifying things and glazing over details a practitioner should know, you are right; that’s exactly what I’m doing and it is intentional. This isn’t meant for data practitioners; it’s meant for the rest of the internet who just wants to learn a thing or two and not be confused when someone starts talking about the data ecosystem.


Data Systems vs. Application Development

Web Applications

The apps and websites you use online are overwhelmingly “transactional”. Some might literally be processing a transaction (for instance, buying a book or small appliance on Amazon, or sending money to your friend using Venmo), but a “transaction” in a web application context doesn’t require that money is exchanged. Instead, it means that a record of some sort is created and stored in a somewhat complex chain of events that starts with you signing up and ends with that order confirmation email you receive after that online purchase you just made. This same concept applies to visiting websites like Facebook, Reddit, or even when you log into your Account on The Wall Street Journal or New York Times - a record is stored and later modified. When you come back to the site, the “state” of your account reflects how you left it. It picks up from where you left off.

The most important thing here is that web applications care about how things are right now.

It’s similar to how you receive an Excel spreadsheet from that team you’re working with on the new big projects; you open it in Excel and make some changes using the same document format in which it was originally created. You’re mimicking the way a web app works.

Data Systems

This type of application differs from typical data systems found in businesses ranging from mega-corporations to small startups. Are there cross-over examples? Of course there are, but we’ll get to those later. The most important thing to note is that data systems don’t typically follow this transactional pattern. Instead can use any of these designs:

  1. Timeseries

  2. Change logs (sequence of changes over time)

  3. Event-based

  4. Completely unstructured blob of information that might be used later on

    1. Some people call this a Data Lake. More on that later

  5. A combination of any of the above

Most of this write-up will touch on the basics of what these bullets means and how the pieces come together for companies to make use of them.

Data Applications & Cross-overs

There are plenty of applications you touch on a regular basis that mix transactional patterns and data patterns. Facebook is a great example. Instagram and Twitter, too. Ever wonder how IG shows an approximate number (1.1M, for instance) of likes for super-popular accounts but your pictures show the exact number (97 likes, for instance)? Or how Twitter knows exactly which tweets to show in your feed despite ~5M tweets getting sent out every day? Data systems, and their specific design patterns, power that stuff. How that stuff works is a topic for another day.


Principles & Frameworks

Structured & Unstructured Data

Unstructured and semi-structured data is everywhere. The books you read and the websites you visit fall into the semi-structured category. Things like the music you listen to and UGC (user generated content) fall into the unstructured category. This stuff makes up a lot of the internet. A lot.

Structured data, however, is what powers the online experiences you know and love. The photos on Instagram, the products you view on an online store - doom scrolling and retail therapy are powered by structured data. What does that look like? Here’s one example.

Most simply, APIs are what power the internet as we know it, and their job is to provide structured data from one place on the internet to the place you’re viewing it (usually a browser or app). API stands for “Application Programming Interface” - a software intermediary that allows two separate systems to talk to each other. It’s common for them to spit out JSON - that’s how systems frequently exchange data. That Igloo Coolers link above is a bunch of JSON of their individual products.

There are different ways to make use of each of these categories of data. Some are better for ML, some for powering online experiences.

There is plenty online about APIs and their use cases. For this article, just know that all of these categories can be consumed and transformed into something useful in the modern data ecosystem.

The challenge isn’t usually where to find the data or how to make use of it; the challenge is how to make use of it well.

Warehouse Decoupling

In the past, databases were less user-friendly than they are today. Many operations that we now expect from our DBs were not available. Processing JSON like the Igloo Coolers link mentioned above is a good example of this.

Today it’s common place; 5-10 years ago? Not so much. Some of this very well might be tied to the advent of the “the cloud” and it’s growing popularity.

Similarly, our data processing and storage patterns have changed as a result of the migration to the “the cloud”. Where once, the technical requirements to implement a database and its surrounding systems required numerous Database Administrators (DBAs) and Infrastructure Engineers - the precursor to DevOps - now, a few clicks on the AWS console gets things going.

What does getting things going entail? In the past, it meant provisioning actual physical space and hardware, setting up the machines and connecting them to the internal network. Now, it’s just a matter of creating an AWS or Google Cloud account, selecting the type of database you want, adding some security rules and hitting the launch button.

Days, possibly even weeks, of work distilled down to mere minutes.

Both types of databases have their associated costs, though. Storage has become increasing cost-efficient, especially with the file-like storage systems of AWS s3 and GCP’s Cloud Storage. Two Dropbox-like file structures in the sky, the only significant operation differences being the scale and the fact that 99% of user interactions comes from machines. But after you store the data, you need to retrieve it.

A database is essentially just a computer with a lot of storage and a specific set of rules its users must follow. And as the cloud continues to overtake on-premises infrastructure, this is where things really start to evolve.

DBs 1) require storage and 2) need to process the data. We refer to these as “storage” and “compute”. In traditional data systems, those two operations were tightly coupled. The engine used to load new data and process it was the same as the engine used to query your financial metrics or product usage reports. A setup that is ripe for queues of work and frustrated end users.

This is where tools like Snowflake and BigQuery have made their mark. The systems essentially split the way they store data (in this giant file systems in the sky) from how they access data (compute). By separating the two, you supposedly get cost efficiency, and you certainly introduce greater flexibility to the end user. No more worrying about resource contention, at least in theory.

This separation of concerns means processes can dump new data into your warehouse, without impacting the queries you’re running for your quarterly reports or the feature engineering being run for your new ML project.

Is this new setup a panacea? No. And in fact, some poor implementation choices can cost you tens of thousands of dollars a month, but it does enable us to handle data in a new way.

ETL vs ELT

There is plenty of information about the Extract-Transform-Load paradigm that data systems have employed until recently. In short, you grab some data from a database or vendor, modify it on the fly, and then dump it into your warehouse in its modified state.

Why isn’t this the standard anymore? For a few reasons, including:

  1. More transformation logic earlier in the process means it’s super difficult to rewind or triage bugs

  2. Your processing logic has to be perfectly in sync with the system sending you the data

  3. It violates the best practice of maintaining separation of concerns and single responsibility

So, what is Extraction-Load-Transform? It’s the same process, but redesigned to be cleaner. We swap the order of loading and transforming, something that is made easier with the wide availability of our new cloud warehouses (storage + compute separation). This ELT pattern is better for a few reasons:

  1. We retain the original data before making any modifications to it

  2. Processing logic can be rerun or rebuilt from scratch

  3. It more closely resembles software engineering best patterns and practices

  4. We take full advantage of the separation of storage and compute available in warehouses like Snowflake and BigQuery

Dimensional Modeling

In software engineering, it is common to design each element of a system to maintain single responsibility, should be extensible without requiring modifications to the original implementation, and rely on abstractions when appropriate, among other principles. Collectively, these are referred to as SOLID design principles (more on that here). In layman’s terms that means:

  • The code you write should be responsible for one thing - for instance fetching bank balances or logging in the user. By connecting a handful of these individual parts, you’re able to support the full application experience In Venmo’s case: Log in, request money from a contact, process the payment, send the payment to your bank)

  • New features shouldn’t replace/ modify existing functionality. If Venmo wanted to add a new feature to pay with crypto, it should be a supplement to the existing “process the payment” step. We don’t want to require all payments on Venmo use crypto

  • Similarly, systems should make it easier to interact with these components by not requiring low level operations. If Venmo introduces this new crypto payment feature, we don’t want users to have to provide their credentials every time. Instead, the app could enable the user to log in once, but access their bank or crypto assets securely repeatedly.

In data work, we follow a similar set of implementation details, but without the official catchy acronym. Most important are a separation of concerns and the use of abstractions.

Data systems rely on abstractions heavily, in many cases because they are a reflection of their data sources. Dimensional modeling is a structured way to reflect datasets in a fashion that

  • Maintains a separation of data

  • Allows flexibility and extensibility through the introduction of datasets and tables

  • Provides abstractions by doing a bunch of the work for the end-user (aggregations like summing and averages, generating statistical coefficients, defining complex logic to be simple yes/no fields)

So what does this all look like in real life? Here’s a simple example showing a hypothetical set of tables related to some Venmo accounts.

We can mimic Venmo's design using 4 tables

  • Account
  • Balance
  • External Transfer
  • Inbound Transfer

Note: The table format (YAML-esque, sorta) below isn't typical, but I dont want to introduce a bunch of code in this post.

table: Account
---
field:          id  
type:           integer
description:    Unique identifier for the account 

field:          username
type:           text
description:    Plain text name of the account

field:          current_balance
type:           Balance.current_balance
description:    Total cash balance available in-app

table: InboundTransfer
---
field:          id
type:           integer
description:    Unique identifier for this transfer

field:          account_id
type:           integer
description:    Account.id of the account making the transfer

field:          recipient_id
type:           integer
description:    Account.id of the recipient of this Transfer

field:          total_amount
type:           decimal
description:    Total dollar amount of this transfer

field:          is_processed
type:           boolean (true/ false)
description:    A true/ false indicator of whether this payment has been processed

table: ExternalTransfer
---
field:          id
type:           integer
description:    Unique identifier for this transfer

field:          account_id
type:           integer
description:    Account.id of the account making the transfer

field:          recipient_id
type:           integer
description:    Account.id of the recipient of this Transfer

field:          total_amount
type:           number
description:    Total dollar amount of this transfer

field:          is_processed
type:           boolean (true/ false)
description:    A true/ false indicator of whether this payment has been processed

table: Balance
---
field:          id
type:           integer
description:    Unique identifier

field:          account_id
type:           integer
description:    Account.id

field:          total_inbound
type:           number
description:    SUM of all processed, inbound transfers

field:          total_outbound
type:           number           
description:    SUM of all processed, outbound transfers

field:          current_balance
type:           number
description:    total_inbound - total_outbound

In this super basic example, we have 4 tables - Account, InboundTransfer, ExternalTransfer and Balance. Each has a set of dimension (fields) with various types, and various table relate to each other because they specify relationships such as

Account.id = Transfer.account_id

and

Balance.account_id = Account.id

There is a lot more that would need to exist under the hood but a model like this is dimensional modeling at its core. Each table can be extended to include more information, without breaking the relationship between the current tables.

Additionally, we can make this easier for end-users by adding more fields to these tables - for instance, defining an “active” field so we can aggregate total active accounts, or adding fields to determine if an account is delinquent (outbound transfers > inbound transfers).

This example is quite simplistic and could easily be refactored (fancy word for revising code design + implementation) for use few tables, but that’s for another day.

Database vs Data Warehouse vs Data Lake

This is a bit of a square/rectangle situation. All squares are rectangles, not all rectangles are squares. Like hot dogs and sandwiches, or cereal and soup.

Just kidding. I don’t believe either of those. But the analogy still stands.

All data warehouses are databases, but not all databases are data warehouses. Nor should they be. For the most part, the progression for “modern tech companies” (or anyone aspiring to be one) follows a flow like this:

  1. The Production DB gets used for analytics - some execs or others start getting curious and asking questions. Most commonly, it’s MySQL or Postgres

  2. The queries are too intensive and use too many resources (technology + people). If you have some read replicas, you might be able to delay the next step for a while.

  3. You implement the easiest + quickest warehouse solution - Postgres if you’re really trying to slum it; Redshift if you’re on AWS; BigQuery if you’re on GCP

  4. You hit scale problems. This is very unlikely if you’re on BigQuery.

  5. You migrate to another data warehouse. Things are peachy - you don’t bog down production with your analytics work.

  6. The team wants more complex, obscure or highly specific datasets. Perhaps the machine learning bug bit your Execs. You start implementing a data lake.

  7. Everything you read online treats them as different things, except for tools like Databricks, Firebolt, Snowflake.

  8. You’re confused because deep down you know it can’t be as complicated as people make it out to be.

By the time you reach step 8, you’re probably at a decently mature stage in the company’s evolution.

The distinguishing factor between a warehouse and a database is primarily how it stores and processes data. In traditional databases (MySQL, Postgres, etc), data is stored in a single record - like a row in a spreadsheet. That record might be modified a bunch of times; if you change you profile picture on Twitter, that updates one part of that row of data.

With data warehouses, data is stored in columns, rather than rows. You might even hear them called columnar databases. This isn’t a great format for repeated updates (at least not yet), but it makes it way easier to do aggregation and special lookups.

You know how Excel or Google Sheet will apply a formula to an entire column if you enter it in the first cell, but that same thing doesn’t apply for the row? Same idea here.

The actual under-the-hood details are a bit complex - and some of the details get blurry based on the tool and scale - but the major takeaway is that the way the data is stored matters. And data warehouses make it easier to do stuff like add numbers over a period of time, find the average value for data that matches a set of filters, or do complex math on data in a time-series format.

The odd man out is the Data Lake.

Data Lakes - or Datalakes, it doesn’t really matter - are somewhere in between a design pattern and a technical solution. Frankly, Snowflake, Firebolt and Snowflake could all likely serve as your team’s “datalake”. This isn’t because they are highly specialized tools that require a PHD or years of engineering know-how to use.

They just make it easy to store a bunch of data in a variety of formats in ways that are relatively easy to access. It all goes back to the idea of decoupling store and compute.

Datalakes provide 1) a storage mechanism for data and 2) a way to access it. These are storage and compute - literally.

Most commonly, people are referring to the system design of a datalake more than they are the referring to the storage and access. What does that mean? It means that they want a relatively centralized place with some lightweight structure indicating where they can find certain data sets they need or want. Many times this is all facilitated by a system like Snowflake.

So, fear not. Datalakes aren’t always super complex systems. They’re just a place for all kinds of data to be stored without too much operational overhead.

Machine Learning

I’m including this under the Frameworks section primarily because the number of tools that fall into this space are too large to inventory. That, and many machine learning systems pretty much follow the same structure. There are numerous flavors of ML systems - supervised, semi-supervised, unsupervised learning - but there are some common themes throughout. Here’s the super brief primer on how ML works -

  1. You need a decent sample of data that is applicable to your use case. You can’t use Twitter threads to predict whether or not that photo you’re evaluating is of a dog or of a cat.

  2. Depending on the type of machine learning method you’re applying, you likely need your data to be tagged/ labelled. You’ll use this for training your predictive model.

  3. Speaking of data and training, don’t use up all the data you have at your disposal to train your model. Set some aside for testing

How might this get used in the real world? Many, many different ways. But, we’ll stick with our Venmo example from earlier.

There are millions of transactions happening on Venmo everyday. It’s impossible for a person or group of people to monitor those transactions accurately. This is where ML comes in.

A common use case for machine learning is to detect fraud, or more accurately, the likelihood of fraud. Using various data points - perhaps things like location, Venmo account lifespan, number of connections, the type of bank, frequency of linked bank account changes - an ML model can be used to sort the “good” transactions from the fraudulent ones. It’s never a sure thing, but by identifying patterns and similarities, an ML model can evaluate more data points for more transactions on Venmo than any human could possibly dream of.

Data Governance

Data governance is a fancy sounding term for knowing where specific data lives, knowing who can access that data, and knowing under which conditions that data will change.

This topic (along with probably each section in this post) could be its own individual write up. But let’s keep this pithy.

There are a lot of legal standards and government-imposed frameworks that impact data governance.

GDPR, SOC2, ISO-27001, HIPAA.

The list goes on.

Despite there differences, a few topics/ talking points will take you a long way:

  1. Right to be forgotten: A method for users to be “purged” or removed from a system (often, a software product)

  2. Right to access your data: Do you want to see what data this product has collected about you? This idea let’s you access it

  3. Principle of Least Privilege: Does your role require that you have access to secure financials or highly sensitive patient data? No? Well, we’re not going to grant you that access

  4. Data Providers and Processors: Knowing where this dataset comes from (for instance, a vendor) and knowing how it’s used

  5. Defined ownership: Which person/team/ entity is responsible for operations and maintenance of this dataset?

Baked within each of these bullets at potentially dozens of process and checks-and-balances that have to be orchestrated across the entire data org at a given company. There is a reason this is a huge concern for companies large and small, and a particularly big focus for tech companies in hyper-growth. It’s easy to slip up.


Alright, that wraps up Part 1. Stay tuned for more in Part 2.

If there is something interesting in this write up that you want to hear more about, or a topic that I’ve yet to touch on that you find intriguing, let me know.

You know where to find me.

Read More