Funny, I joined Scale two years ago, but I just wrote this article about why I joined this startup. I’m reposting this from my company’s blog because I think it has some of my strongest feelings about engineering and building in general.
Aerin is an engineering manager at Scale AI, leading a Catalog ML team. Below are Aerin’s stories about what she does and why she loves working at Scale.
What brought you to Scale?
Before I joined Scale, I worked on a project called NL2SQL which stands for “Natural Language to Structured Query Language” at Microsoft. NL2SQL is a translation problem that automatically converts a user’s spoken-language query into SQL.
Just a few months after my team had started working on it, we were beating the SOTA (state-of-the-art) performance on research benchmarks. We achieved this by evaluating different ML model architectures and improving them. In other words, we somewhat solved the NL2SQL problem in research after only a few months of work.
However, in production, we weren’t breaking the state of the art. The model still made a few mistakes on the user’s input. And it became clear to me that changing the model architectures would only get us so far, and that the most crucial thing we could do to improve the performance of NL2SQL was to improve the training data. This was done by generating edge cases synthetically and therefore increasing the training data coverage.
So, we started synthetically generating training data (NL-SQL pairs) by following syntax requirements, snowflake schema, etc. However, the data that we have artificially created couldn’t possibly cover every possible scenario or edge case. Because the true edge cases are ones that you can’t predict or plan for, like the questions that people have typed into our search query bar. Another way to increase the coverage of the training data was to label the user inputs and feed them back to the training data. Taking care of that lifecycle, also called an auto-labeling pipeline, was quite an operation. Even though I was given a lot of resources to manage the auto-labeling pipeline, I couldn’t undertake any other engineering work because it required so much overhead.
So I started looking for the best third-party training data provider in the market, and that’s how I met Scale, and the rest is history.