Learn more about

Scaling ML Algorithms for Enterprise

David discusses how he enjoys switching hats between ML and software, and why he finds Treasure Data’s “extensive ecosystem” so much fun.

Photo of David Landup
With
David Landup

As he’ll tell you himself, David Landup, Staff Machine Learning/Software Engineer at Treasure Data, loves to wear many hats. In addition to his full-time role, he’s a part-time contractor for one of Google’s teams and has recently become a Google Developer Expert for open-source contributions. He’s also developed Yomu, a text analysis and language learning app that he describes as a “minimalist Japanese reader.”

This desire to juggle multiple roles and switch back and forth between specialties was what drew David to Treasure Data. “ Treasure Data had an opportunity I was looking for at the time,” he explained. “I’ve worked both as a software engineer and as a machine learning research engineer, and these are very different hats in very different fields. But at the same time, they’re also very similar fields in some regards. I felt that switching between those two hats and helping teams bridge those gaps . . . was really [the best] use of my efforts, if that makes sense.”

So you have the exciting, unknown ML research, where you don’t really know if you can find a solution. You don’t even know if you’ve formulated the problem well. And then you have nice, stable, predictable software development . . . you can actually follow a nice routine there. I found that doing only one of those either gets predictable or too stressful.

Stepping into leadership

Before coming to Japan, David worked as a remote engineer from his hometown in Serbia. He moved to Tokyo in 2023 and was employed as a machine learning research engineer working on combining traditional statistical machine learning models with physical constraints and priors, before accepting his current position with Treasure Data, which was just what he was searching for on multiple levels.

“I wanted to step more into engineering leadership,” said David, “to see if I can train people, define vision, processes, spend more time with my team, and if I can take more research into production. I was looking for a not super-large place, kind of a medium-sized company, that does both research and production at the same time, where I could switch between those two. [That’s all] surprisingly rare, and Treasure Data was one of the places that had that balance I was looking for.”

What Treasure Data does

“The primary focus for Treasure Data is creating a foundational platform for customer data,” said David. “It’s breaking down the silos between different services and databases people and companies use, to unify them into a single foundational piece.

“Then we essentially have a suite of products on top of that foundational data. Whether that’s more on the software side, more on the well-trodden paths, or something which we’re now experimenting with, ML, you can plug services and products on top of that foundation.”

In the current product (AI Signals), Treasure Data doesn’t train and put out its own machine learning models, David clarified. What they offer to clients is the infrastructure, platform, and solutions needed to collect their own data and extract its value. “We don’t own the models internally, but instead we create end-to-end machine learning pipelines for training and prediction, that are being run by customers on their own data.”

If they have a data scientist on the team, then the data scientists can work with the data as is. But also for less technical audiences and customers, for example marketers, we also provide the more prebuilt, plug-and-play experience as well.

The ML team

The ML team at Treasure Data juggles many different aspects of the product. “We do the modeling and the software side and the infrastructure side, all within the team. That was actually one of the first things I did when joining the company, [defining] how this tech stack works together.”

“When I joined,” he explained, “ML was still relatively young in the company. There was an initial attempt at building ML into the platform, and it was more of an exploratory [project] where we wanted to hear how clients used it, how they planned to use it, to give them a tangible taste. . . . I joined right around the time when we stopped collecting the feedback and started planning for the next phase.”

The role that I was filling at the time was to sketch out the interactions between all these systems, and to make proposals on how we could do both the distribution for the models themselves as well as how we would maintain the overall interactions. That was basically the first thing I did when I joined the company.

David’s first big task

Because David started working for Treasure Data in November 2024, he didn’t have much time to get feedback. “We got into the December slowdown where half the people are on PTO. You don’t have many people to review your work. We had a release freeze, as well. . . . I actually used that time to do everything.”

“I made all the architecture design documentation, all the system/process proposals, sent them for a review, and started working on the POC before people came back from the year-end PTO. By the time they were back from PTO, we had a working MVP deployed internally for dogfooding.”

Most of the work was done by David himself, but he wasn’t operating entirely alone. “I also had a chance to work with one of our SREs, actually Tyler, who was interviewed before, to help set up a lot of things regarding deployment and distribution within the company. We also already had the MVP deployed online within the internal environment for us to experiment and play with. And once we were satisfied with how it looked internally, we then slowly started making it more external and then actually opening it to customers.”

Scaling ML algorithms

David initially made it sound like a smooth process, but later stressed that wasn’t the case. “There was a lot of time and a lot of effort required to really dive into how both Treasure Data itself works, because it’s a large ecosystem, as well as how these problems are addressed in the industry.”

We’ve realized that we’re actually doing something that’s not super common in the industry. We do what we like now to call ‘level two ML ops,’ which is that we don’t deliver models as artifacts. We don’t even deliver pipelines to run internally. We deliver a continuous integration of end-to-end pipelines on client account environments, which is something we haven’t really been able to find information about online, because very few companies actually do that.

“It comes with its own set of challenges that we’re still trying to fully overcome,” he said.

One of those challenges has been scaling. “The libraries you use in ML are oftentimes not very scalable. They’re kind of built from a research perspective, and even when they’re labeled production ready, they often have some assumptions that you wouldn’t normally have in most software systems.

“To achieve enterprise scale, we needed to go from taking hours to train and predict a few million profiles to doing billions of profiles multiple times per day. A lot of these libraries tend to break at scale, or they have some architectural assumption or hard limits that makes it very difficult to scale them . . . Sometimes it’s very, very difficult to patch these, because if it’s one of the core tenets of the development of that specific library, you can’t really patch it as it would require re-writing half of it from scratch. You have to find a good way to work around those.”

In the end, we managed to achieve a multi-billion profile scale with ~2h runtimes at a fraction of the cost. During MVP development, for solutions like RFM, we stress tested the new architecture to 10B profiles concurrently over hundreds of runs simulating many customer requests per minute.

When asked for more detail on these manual interventions, David had a ready response. “One example would be a library that shipped its own CUDA kernels, but had a hard requirement for CUDA 11, while the hardware that we’re relying on and images ship with CUDA 12. Furthermore, the library encoded a quadratically scaling vector into an int32 representation, which imposed a hard limit on 2^32 indices (2.14 billion), but our vector during training could skyrocket to much higher scales, causing an integer overflow. This is done at a CUDA kernel level, so we also had to think about how to re-compile the dependency for installation.

“What we ended up doing is shipping containers with CUDA 12, performing dynamic patches on the internals of the library by re-assigning original methods into modified ones, and “hacking” it into thinking it’s actually running on top of CUDA 11. I’m kind of surprised it actually worked, but those patches decreased the time it takes to make predictions by >80%, and decreased cost per profile by 40% for our Next-Best-Product solution.”

Bringing down costs

While overcoming these challenges, David still managed to find a way to reduce the costs and complexity of the new product. “Between the time of our first product and the new product we were trying to build, we started off by reusing the infrastructure from the previous product. We’ve had some overhead there, primarily from changed requirements.”

Over time, they’d gained a better understanding of their clients’ requirements. “So we were really looking for a vastly different way to build things at that time, even though in the transition period, we already started building the new product on top of the old one. One of the things I was doing in the beginning was defining how we’ll be building the next generation, but [I was] also migrating the old solutions onto both the new infrastructure and into the new engineering paradigm we defined for our new architecture. So there was a lot of planning there as well. A lot of back and forth.”

But that then eventually grew to be a pretty general compute platform that we can use for a lot of different things. It’s fully homegrown in the sense that we’ve built everything from scratch, and it allows us [the flexibility] to have more products on the line as new requirements come in.

Though it obviously paid dividends, it’s remarkable that Treasure Data permitted David so much freedom so early on—at least, that’s what David himself thinks. “There was a lot of trust and respect placed [in me] at that time and I was very grateful for that. Admittedly, I don’t know why they put so much trust [in me]. I think I would be way more skeptical of somebody that just came in . . . but I’m definitely very grateful for that.”

David’s Treasure Data map

Perhaps Treasure Data’s well-organized onboarding process helped smooth David’s way. “In the first two weeks, essentially, it’s mostly getting to know people,” he said. “We had a lot of one-on-ones at the time. . . . [My manager] introduced me to a lot of people, and in most of those discussions you clarify who the person is, where they work, and what sort of things we come to them for.”

David really appreciates Treasure Data’s communication style. “We have a culture of people just asking questions. Channels are very open. . . . Seeing a lot of other people asking questions and getting answers was definitely encouraging for me as well.”

Treasure Data has a strong culture of service owners. Each team owns something. . . . It was really in the beginning kind of like making this internal map of the organization, [and] getting to know a couple of contact points from each of those teams, which made it easier to reach out to them.

Developing that “internal map” has been one of David’s favorite experiences at Treasure Data. “[The company] has a lot of working and moving parts,” he said. “Getting to work in a pretty extensive ecosystem I think is always fun. You get to learn new things, you get to interact with a lot of people. You get to ask questions and dive into new code bases.”

We have a lot of very experienced people, and we have a highly-concentrated talent pool. Some of those people have been coding for basically as long as I’ve been alive, so getting to talk to them and exchange experiences has been a really fun, illuminating and gratifying experience.

Shaping AI policy

As Technical Advisor to the company’s internal DevAI Unit, David is helping shape the company’s policy on AI use, particularly by exploring the intersection between AI productivity tools and engineering. “We’re mostly all aware, I think, of the new wave of the so-called “vibe coding” and the effects of Claude Code and other coding tools. . . . There have been some very mixed stories on how these have panned out in different organizations.”

We decided to try to be one step ahead there, and set up a unit that would essentially keep track of [AI use] and impact, create guidelines, and build a small group of people that have enough experience in this new emerging field to help guide us in the right direction . . . to make sure that the AI adoption within the company doesn’t go wrong. There are many ways it could go wrong.

David describes himself as a skeptic of vibe coding. “Primarily because I work with these models on a slightly lower level . . . from that bottom-up perspective, I’ve generally been very skeptical of LLMs. I do see a lot of use cases for them, and I think there are definitely ways you can use them tastefully. But I think there are ways where you just don’t want to use them at all.”

“There are obviously the overarching looming ethical questions of how these models are trained,” he added, “how they’re evaluated, the obvious competing interests of the benchmarks and funding for these, which does give it quite a bad taste, and which is one of the reasons why I’ve initially been very hesitant to adopt them.

“But I think we have been able to find tasteful ways to augment an engineer’s capabilities and work on new fields through the use of AI, within a measurable framework which allows us to understand AI’s impact, review promises and perils, and shape the policy going forward, without forcing or propagating unreasonable expectations”

Who Treasure Data will hire next

Treasure Data is hiring now, and being a fairly recent hire himself, David has a good grasp of what sort of applicants the company is looking for. “The [developers] that we’re generally expecting to join are big data engineers and product engineers, because we do a lot of product engineering here, and especially [those] with . . . experience in working with Japanese clients.”

I think the atmosphere is pretty good. We’re primarily engineer-oriented within the Research and Development department that I’m in. We have a lot of flexibility to work on things. There’s a lot of respect in the air, especially given the talent concentration that we have.

“There was absolutely no specific reason for anyone to trust me coming in,” said David, “especially as a lot of the people here are way more experienced than I am. But I’ve been given a lot of trust and respect from basically day one. . . . It’s simply the culture that we have here.”

Open Jobs at Treasure Data