Opportunities in Data Versioning

Published in

rippleventures

7 min readJun 13, 2022

⚡️ Introduction

Ripple Ventures is an early-stage venture fund focused on enterprise software, developer tools, and web3 infrastructure & tooling. Within each vertical focus, our team conducts in-depth research on areas of interest to find the best companies to partner with. This post is part of a new blog series where our team shares internal research on categories that we’re actively investing in.

If you are a founder or investor building in any relevant categories and want to share notes and discuss, please reach out to our team here: https://airtable.com/shreRM7YdROtmEi5r.

You can follow us here: https://linktr.ee/rippleventures.

💡 Data Versioning

The data tooling ecosystem has been a large focus across every company building software in the past years. As companies grow, they run into challenges with maintaining and scaling their databases as their existing infrastructure isn’t supporting their increased volume. We believe that successful tools in space will be targeted point solutions rather than generalized software for a variety of use cases that can scale with companies.

One of the first areas where we see an opportunity is in data versioning systems for machine learning applications. We’re excited about this space because databases are typically at the top of the technical hierarchy, next to CI/CD (continuous integration and deployment) and MLOps (machine learning operations). Most data versioning tools and systems are focused on the top layer (MLOps) when users are interacting with the data to build and deploy ML models (or other applications) but there is a lot of interesting value in moving further down the stack and getting closer to the data stores themselves.

🔭 Point of View

Why We’re Excited:

Data versioning will become an embedded component in data storage systems which will log details related to how data is being used and how it is being changed. This creates more use cases for versioned data, seamless integrations with core database technologies, and easier/cheaper storage that works with versioned datasets.
Applications begin to be written to scale around the data that they are built upon, rather than building applications and fitting data into the schema dictated for the application to work. ML algorithms eventually reach peak performance with meaningless variances beyond underlying data and the volume/complexity of data reaches a critical threshold such that it is required to have them be the core of the model.

Where We’re Cautious:

Data versioning may simply become a commodity of other end-user applications. It may become a prerequisite, built-in step of many data systems with minimal value-added overall as a standalone offering. MLOps tools can begin to win a larger market share, building versioning tools connecting the other way into databases instead. If ineffective, the data storage costs can become even more expensive and infeasible to version data systems meaningfully.
Data versioning becomes unimportant as most ML-powered applications will be derived from a handful of large pre-trained generalized models that are applied to several different domains effectively. There have been significant technological advancements and cost advantages from GPT3-like ML models. Further, the lack of data scientists and lack of value to build completely custom models for each individual use case.

🧠 Our Market Observations

An overview of how the different segments interact with each other:

MLOps: This is the services-based approach to managing machine learning developer systems. This can (but does not have to) include data versioning as a step within this process but it does require integrations with the databases systems and the CI/CD systems in real production settings. The MLOps market is incredibly saturated and there isn’t too much differentiation between players in the space except in the methodology and user experience or in relatively minor technical differentiation.
Databases: These are the core central data repositories and storage mechanisms for data used in ML applications as well as other end-use cases. They are the backbone where data largely lives separate from core application logic/code. Data versioning can be easily integrated into these areas in an effective manner right alongside the store of the data itself. Database tooling integrations are a very attractive investment opportunity. Databases have seen new innovations in recent years that have helped them specialize in certain types of data storage. Data tooling is also seeing a rise in popularity and will continue to grow as individuals struggle with wrangling the vast amounts of data they have to deal with.
CI/CD: The continuous integration and delivery tooling, along with the core code base versioning, are the primary mechanism that organizations can create standardization and scale for their development efforts. Implementing a data versioning application in this layer can be challenging because it involves creating and executing code in a pipeline fashion where these systems are typically designed for relatively minimal code execution. However, this can be a great system to directly connect to the databases and end operations tooling before actually deploying to the final output systems. CI/CD integrations for the data versioning area are also a very attractive market opportunity. CI/CD integrations can seamlessly bake into the versioning system that teams are using for the code versioning today. This makes it much easier to embed data versioning into code updates that individuals have to do when they are deploying code to production systems.

🚀 Industry Drivers

ML Adoption: The pace at which companies are adopting ML has skyrocketed over the last 5–10 years. More than 80% of Fortune 500 believe that ML will be a critical component in their business in the next 3–5 years. These trends have only accelerated with the COVID-19 pandemic. ML as a core technology has proven to only be as valuable as the underlying data that is used to train the models. The mantra of “garbage in, garbage out” is true in creating ML systems. One critical area that many struggles with when working in ML is the data processing application side of projects. Data scientists always complain about the quality of data and working with data as two key challenges they face.
Amount of Data: 2.5 quintillion bytes of data are being produced every day. Further historical data is not being destroyed which only means that the amount of data available to developers is astoundingly large and growing by the minute. This data volume challenge is having a direct impact on many applications created for web2 and for ML use cases. It is estimated that a highly complex model developed by OpenAI required more than $10M in processing power due to the amount of data that was used to train that model. However, this massive volume of data is a gold mine in opportunity for businesses and will prove critical to the success of ML in these organizations.
Unstructured Data: Image, text, and audio data sets are increasingly becoming a standard set of technical resources for many businesses. As more complex ML models are developed these unstructured data sets are critical feeds into the ML applications and have a huge impact on the performance of these organizations. Unstructured data simply can't fit into many existing data storage applications and requires a new approach to structure and storage.

🗺️ Market Map

MLOps: The market is incredibly saturated. Not too much differentiation between players in the space except in the methodology and user experience or in relatively minor technical differentiation.
Databases: Tooling that integrates with core databases is a very attractive investment opportunity. Databases have seen new innovations in recent years that have helped them specialize in certain types of data storage. Data tooling is also seeing a rise in popularity and will continue to grow as individuals struggle with wrangling the vast amounts of data they have to deal with.
CI/CD: Integrations for data versioning across this use case are also a very attractive market opportunity. CI/CD integrations can seamlessly bake into the versioning system that teams are using for the code versioning today. This makes it much easier to embed data versioning into code updates that individuals have to do when they are deploying code to production systems.

🌐 Market Dynamics

Companies spending large amounts of IT and services budgets on cloud infrastructure today
Large budget for cloud services and data storage today. Versioning is only a minor additional cost with large benefits from workflow automation
Users looking for ways to integrate data versioning into existing workflows, especially for ML
The most common way to version is simply to duplicate the entire database but this has high costs associated with it
Data tools typically are quick to get ramped up and develop new solutions towards
Many new tools developed over the last <10 years and have established themselves as huge companies today
Data versioning as a problem has been around for quite a while but technologies are only now trying to tackle this as it has become unfeasible to sync and backup massive datasets needed for ML applications
Beginning to see some stronger adoption in the market as open-source tools like DVC are catching on among many users

🔗 Connect With Us

If you are a founder or investor building in any relevant categories and want to share notes and discuss, please reach out to our team here: https://airtable.com/shreRM7YdROtmEi5r.

You can follow us here: https://linktr.ee/rippleventures.

rippleventures

Opportunities in Data Versioning

⚡️ Introduction

💡 Data Versioning

🔭 Point of View

🧠 Our Market Observations

🚀 Industry Drivers

🗺️ Market Map

🌐 Market Dynamics

🔗 Connect With Us

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in rippleventures

Written by Ripple Ventures

No responses yet