Happy Internal Customers through Diffgram’s Development System

Anthony Chaudhary
14 min readOct 12, 2022

--

Article for CTO, CIO, VP of Engineering, VP R&D, Director MLOps, or similar. For Training Data (Also called Data Labeling, Data Annotation, Data Ops).

So, the concept of Training Data has come to your attention. At this point, you likely already have a general impression of what Training Data is. And perhaps your teams are already evaluating tooling.

  • Do you get a feeling the choice is more important than it first seems?
  • Why is there so much attention around this area?
  • Is it more than just another tool?

Like a city, Training Data has many complexities that take time to unwrap.

Pictured, New York, Sept 2022

Diffgram Development System

The story to surface here is that Diffgram is not just a routine tool purchase. Diffgram is a Development System you use to create and maintain your own platform.

The Development System is a way to use, combine, manage, order, and maintain Training Data for your ML Programs. From tried and true methods to the latest Training Data innovations. Not a single one off implementation. The Diffgram Development System is a way to keep your internal customers happy, today and tomorrow.

Why should you listen to me? After 4 years of work, over a hundred installations, over ~500,000 lines of non-library code, and being the most popular fully open source technology, we know a thing or two about this area. You can read some of existing features here.

In this article I present the case for why you need a way to develop your platform. Then how the Diffgram Development System works. And finally tying some of the benefits back to your Internal Stakeholders.

Why you need a way to develop it

It’s tempting to think of Training Data as just something to leave up to the data science team, but the reality is this is a choice that affects your entire company.

As a growing area, there are unknown unknowns. You need a way to be able to control and adapt as needed. A way to be proactive for upcoming needs, and react under your own control to problems.

Why does Training Data not feel like just another IT tool?

Consider that most of the following is probably true for the your business:

  • Training Data touches your most sensitive raw data
  • Training Data is a key part of your R&D, or even core product, efforts
  • Training Data touches nearly everyone in your IT department, from engineering through infra and data science.
  • Training Data, has changing business requirements and a growing volume of users across over multiple business units.
  • Training Data, as a growth area, has an increased level of executive awareness

A lot of data, change, unknowns, people, and attention.

That nagging sense of something being “off”

As you have been exploring options you have likely seen slick websites, videos, sales presentations. Yet you may feel a nagging feeling of “if it’s really so good, why isn’t it flying off the shelves?”. You may be also wondering why “all-in-one” solution for training, modeling, etc. also aren’t taking off more?

Something doesn’t add up…

Well it would be like asking your team use Paint instead of Photoshop. The problem space is simply too complex to have single pre-defined approaches.

A few moment glance through community help sections of all of the vendors shows many features are far from ready for prime time. But it’s more than any specific set of marketing claims or features. Let’s try to zoom out of any product specific thing and think a little bigger.

The reality is that most tools cater to only a handful of your Internal Customers. The most focus is usually on Data Science. However, Training Data is an area that touches so much more. Let’s expand on the areas mentioned before and map those to an Internal Customer group

Internal Customer to Area of Training Data Map

This table maps an Internal customer group and leader to the areas training data affects.

Map of Internal customer group to the areas training data affects.

Just a glance at this table and it’s clear that this is an important purchase. That it’s not only a data science tool, rather a system for the whole company.

Your marketing tool may access a database or two, but it’s not looking at your most sensitive documents. Your technical CI/CD suite is not being used by line of business users or subject matter experts outside of IT.

Many People, a Major Choice

Training Data is software that touches:

  • tens of people in an administrative way
  • tens and hundreds of end users today
  • In the future, a noticeable percent of your companies activity will in some way involve ML programs, which will in some way involve training data.
Training Data competency affects cascade through your company

The only other types of software that touches so many people are things like email. Which by the way people can still have problems with… so just imagine in this complex, new, rapidly developing area the volume of challenges.

Probably no other new system in your company, in the next 10 years will have as much effect as your choice of Training Data software.

The Software is Not There Yet — Out of the Box

What most everyone in the Training Data industry knows, but rarely says, is that there is no such thing as a complete, out of the box, solution yet.

However, there are solutions like Diffgram that offer a wide range of existing features, and a path for you to develop your own.

An analogy here is that the area is at least as complex as Adobe + Salesforce. And we are still years, if not a decade, away from that level of product maturity.

No vendor, including Diffgram, will work exactly as you need, across every division of the business, out of the box, to the degree that you can sustain heavy R&D or revenue on it. Diffgram is the best starting point, and here I want to show you why the Development System is the right strategy to get the final pieces in place.

Training Data software is still being Developed.

As a small analogy, after 30+ years of development, Postgres is widely considered one of the best database options. But it doesn’t develop itself, you still have to develop on top of it.

Let’s be very specific here: in order to sustain heavy R&D or revenue. Yes many things will work. Yes your Proof of Concept may go far. But are you comfortable running multi-million dollar or even billion dollar workloads on it?

The development cycles for ML Programs are long, this is not something you can swap last minute to a different system. (Not to mention the above that such a system doesn’t exist.) So you need to have a plan starting today, a strategy, for handling this development at your company.

Unknown Unknowns

As a growing area the requirements are expected to change over time. How can you be a strong position to adapt to those requirements? Just trust the vendor? You have been around the block so I’ll let that one speak for itself…

There are unknown requirements that won’t be clear until many months and years from now. In many cases there is knowledge built up around how a system works and switching after the fact can be harder.

The solution is the Diffgram Development System. Something that your team has a degree of direct control over, something that goes beyond the moment in time configuration choices.

This area is far too complex to develop your own from scratch. It would take years of risk and millions just to get to the baseline of what Diffgram is out of the box.

So what do you do? I think the best strategy is a Development System, which I will explain in more detail below.

Summary need for a Development System

So to summarize this

  1. Training Data is more than a routine IT purchase
  2. Training Data affects the majority of your internal customers
  3. Training Data software is not there yet, so a Development System is needed.

Ok, so let’s say you are convinced that there is a need for a Development System here, what does that mean in practice?

Development System

Here’s how it works.

Your team uses the Diffgram Development System to fill in the gaps need for supporting heavy R&D or revenue bearing needs. We supply you with the global best in the baseline platform, and sub components to develop it to be your own Training Data platform.

The illustration uses scaffolding as a metaphor for AI’s quest in unearthing the underlying logic and structure of complex organic matter. Artist: Khyati Trehan.

This further enables you to invest in the technology space to create your own unique IP separate from your competitors.

Adopting the Development System is a strategic direction. There are immediate benefits, with many more benefits accruing over time.

ML programs are a combination of many things. This includes UIs, data concepts, ML concept, integrations, and more. The Diffgram Development System is a way to repeatedly develop, and maintain these ML programs.

First I’ll talk about how it works a bit, and then how those concepts provide benefits for Internal Customers.

Blueprint

The Development System has these parts:

  1. Baseline Platform, with End User level customization
  2. Develop Within Diffgram Frameworks
  3. Develop Novel Software using Diffgram Components
  4. Develop Novel Software on top of Diffgram
  5. Prioritized feature requests (Diffgram Developing)

Each option has trade-offs depending on the need, going from end user realm through custom engineering led by your team.

Baseline Platform

Diffgram offers a standard set of features similar to others. Diffgram has one of the most customizable and configurable End User accessible platforms. What is described in this article is “extra”. I won’t spend any more time on the Baseline Platform or End User Configurations in this article, those are both well described elsewhere.

Diffgram Frameworks

In theory everything is an integration. In reality, as you know, the degree of pre-existing work matters greatly.

Diffgram divides this problem into manageable frameworks for classes of use cases.

  1. Scripts for Frontend
  2. Workflows for Backend
Workflow example apps to install

Examples of using those frameworks there are 1st class supports for:

  1. Install Open source ML Programs (Ex. Hugging face, Deepchecks)
  2. Big tech cloud providers (ex. Google Vertex AI)
  3. Your own in house methods (Example, Interactive engagement by calling API from script).

By Diffgram focusing on the “Install”, beyond just integration, standardizing the connection and data processes we are able to support all of these methods.

Back to the Photoshop example, yes there is a one time install process, and yes Photoshop needs it’s own updates. But then you get what you want — Photoshop — not paint.

With Diffgram you get all the ML programs you want, “installable” on your own Diffgram environment. These installs go beyond what is reasonable with a normal API/SDK integration. For example including series of standards, checks, controls, UI components, primitives, real-time options, etc.

Develop Novel IP using Diffgram components

The Diffgram Frameworks provide standard ways to install important sub processes. What happens when you want to start working with the UI, or other parts, and developing directly in novel ways?

We are rolling out new public Storybooks for our UI components.

This means concepts like dataset selectors, annotation screens, and more are visible and modifiable in Storybook. I’m not saying Storybook is rocket science. But we are the only one in our space to offer it.

Storybook, Example of a Dataset Selector with Mock Request.

This is a small example of what we mean by develop on top. We are not just providing configuration, but true development, true engineering capabilities to your team.

PS If you aren’t familiar with storybook, it allows you to take individual components out of their context and explain them. Sort of like this scene from the Matrix:

Construct Scene from the Matrix

Continuing the postgres analogy, we aren’t asking you to contribute to postgres itself, or to invent materialized tables etc. But rather that we are providing those components to your team to build as you see fit.

This is beyond “configuration”, as it’s still free form programming, but is within the limits of existing components.

Directly Develop Novel Software from Diffgram

Optionally, we offer the most flexible possible, which is engineering directly on top of Diffgram source code.

Diffgram, being open source, means that your team can directly learn and make actual contributions directly to the codebase (or maintain your own private repo, inheriting updates from the primary).

It’s worth considering here that Diffgram is the only option that reasonably allows you to do this.

This means that your existing staff can be re-allocated to work on this, you don’t have to pay custom consulting to a vendor.

By Developing, to be very clear here, I don’t just mean the normal API/SDK integrations or in application configurations. I mean actually developing the core software itself to suite your needs.

This is just an option, and it provides a clear outlet to address needs that can’t be addressed in easier options. It is especially relevant for the earlier mention unknowns. So you always have an option within your control.

Vendor Developing

We have been continuing to refine our process for working directly with your team on Prioritized Feature requests. Please note this is not blanket consulting, rather we take your needs, and abstract them into more general forms. This allows our development to be grounded in business need, and lowers the development cost for you. I won’t belabor that process here, but it is much more budget friendly than regular consulting add on costs.

Happy Internal Customers

So how does this all come together to keep your internal customs happy? What are the benefits?

Happy CFO

Investing in your own Diffgram platform keeps the CFO happy in a few ways:

  1. Keeps custom development costs under your control
  2. Keeps predicted future costs more under your control

Development Costs

Being able to have more control over development costs, and dividing and conquering sub costs.

For example, maybe something needs a more expensive modeling thing one day but then doesn’t the next day. That choice can be made independently, so you aren’t locked into trying to figure that out all upfront.

Predicted Future costs

Pretend for a moment that you sign up with someone that offers $x/y unit of something. Maybe the price seems reasonable today, but what happens when another unit joins on?

What happens if a new ML program comes out that means you want to use 10x or 100x more units? In such a new area there are many unknowns for capacity. Yet many vendors still push for these type of unit based prices. Diffgram offers Unlimited usage. We can do this in part because the software runs on your own hardware. And because we have a different cost structure.

Happy CISO

Security has many layers. Trusting a SaaS solution just doesn’t make sense for this type of sensitive data and company reaching affects.

Read our detailed CISO Brief for more. A super short summary of that the security team can develop and configure custom controls, and lean more on their existing controls. Essentially creating a DevSecMLOps process.

Happy CEO

  1. Better Control
  2. Better Aligned Economics

Better Control

With Diffgram, your team can invest in your own platform.

  1. Your own existing staff can be engaged for better control
  2. You have a great degree of control, meaning you can trust high value workloads to it
  3. The future usage is less directly dependent on a 3rd party vendor

Of course we will remain the stewards of Diffgram’s overall direction, and continue to develop it, but that’s a lot different from being dependent on a specific SaaS vendor.

Better Aligned Economics

There is a growing consolidation in this space. The effect is that

  1. Some vendors are being acquired by an annotation firms (two in the past few years.)
  2. Some vendors will likely be acquired by defense contractors

Usually acquisitions naturally reduce, or even kill, the innovation and increase costs.

In the context of this article, I have been talking about the need for custom development.

In the annotation firm case, the desire is to sell outsourced labor, meaning a technology customization that helps your company, but doesn’t help sell more outsourced labor, is unlikely to get much focus.

As you know commercial interest and defense interests rarely align. They are rare enough they have a name ‘Dual Use’ tech. Do you want your Development System tied to companies (ScaleAI, Labelbox, etc) likely to be acquired by a General Dynamics or Raytheon?

With Diffgram we intend to keep our technology direction for the foreseeable future.

Of course things change, but we are stating clearly we see the technology direction itself as an important thing to protect.

Being able to configure so much yourself gives you a level of control that doesn’t exist in other systems.

Happy CTO

Own Installation by Default

Diffgram is based around you running your own installation. This means we have first class support for it, and that the community continues to test and discover ways to improve the install and update processes.

Happy Data Scientists and SEMs

Of course we cannot forgot one of the most central users of the system, the data science teams and subject matter experts. Many innovations are directly for these groups. Most of this article has been focused on the other groups that aren’t normally discussed as much.

Happy Engineering and Development

At Diffgram we take pride in continually refactoring and improving the codebases and documentation. The Storybook example is one small sample of that. Our aim is that Diffgram is a pleasure to install, maintain, and develop on. More generally, this creates a deeper level of engagement, and potential opportunities for using Training Data, that otherwise would not be as obvious.

Happy Internal Customers

As most of this article has been dedicated to, the idea is that investing in the Diffgram Development system is a process to plan for Happiness for all of your stakeholders.

Summary

The Diffgram Development System has emerged out of years of work. It’s a bow tie, wrapping many concepts that solve individual business unit problems. We plan to expand the Diffgram Development system over time.

  • Diffgram is a Development System you use to create and maintain your own platform.
  • Training Data affects the majority of your internal customers. The Software is not there yet out of the box by any vendor and some of the needs are still unknown. Meaning that a Development System, a strategy, is needed.
  • The Development System gives your team higher levels of Development control and other benefits than customization or API integration alone.
  • There are many tangible and obvious benefits to your Internal customers. From managing development costs, and future costs for the CFO, security customization and re-use of existing controls for the CISO, unlocking new product cases for Engineering and R&D, and overall providing a better control and direction to the CEO’s office.

Getting Started Adopting the Diffgram Development System

Here’s the best news. Getting started is as easy as choosing to adopt the mindset of the Diffgram Development System. The technical implementations and team training etc. can be adopted incrementally over time as needed.

When ready, your team can take the first steps of getting a baseline installation up with Open Source Diffgram.

We encourage you to contact sales to setup a discovery call for a guided experience.

Thanks for reading!

--

--

No responses yet