Otter.ai (or Otter) is an AI-powered speech-to-text translation app that captures, transcribes and makes audio transcriptions searchable. Simon Lau, its VP of Product, joined the company in 2017 one year after it was founded. Otter started out in B2C before evolving to B2B, and today it is doubling down on helping its users to capture everything said during meetings. That includes everything from virtual meetings to physical ones in coffee shops all the way up to large conferences such as TechCrunch Disrupt and Web Summit.

I’m unashamed to say that I’m something of an Otter superfan. I’ve been a professional journalist since 2012 and writing for the web since 2003, which has involved countless hours transcribing interviews recorded on dictaphones by hand. As such, I was skeptical that any online service would be able to take my recordings and accurately transcribe them to the point where I would only need to give my transcriptions light edits for accuracy. After striking up a conversation with Otter’s VP of Product Simon Lau on Twitter one day in 2019, I was convinced to give Otter a try – and my professional life was instantly transformed.

I asked Simon for a chat to find out more about how Otter’s patented technology and how the company plans to continue differentiating itself from its competitors. And – yes – the interview was transcribed in Otter. Very meta, I’m sure you’ll agree.

Hi Simon. You’re a one-man dynamo helping Otter’s users and promoting the company on Twitter. What is currently your biggest challenge?

Simon Lau: Of course – we are very scrappy! I wear many hats and try to juggle many things – and it is like that with Otter when you are a small company. Everybody’s role is not defined in a little box – we go beyond the box and fill in the gaps.

One of the challenges we face is hiring – we are based in the San Francisco Bay Area, which is a very competitive market for recruiting. We have been very lucky in attracting the right talent, and it’s a blessing that we have developed a very good company culture with hiring in just the right talent with the right mindset for us to move faster with fewer people.

As the company grows bigger, we may need to slow down slightly to put in the right processes and make sure that communication is still aligned in terms of our goals. But we are enjoying this period when we have that sweet spot when we are small enough so that everybody is on the same page and remains well-connected. It means that we can focus on solving a single problem, unlike a larger company that would struggle to do that.

How long did it take to develop the technology behind Otter?

It took us about a year to get to the point where we had something that we could put out into the market. We started off with iOS, Android and web. The first version of Otter was completely free, and our objective was to get the product out there and see if there was market demand. A few months later we defined a paywall for people who use more than 600 minutes, calling it Otter Pro. That’s for people who love and have a need for the product. Fast-forward to the next year and we have Otter Business, which is geared toward small teams and businesses that want to purchase a few licenses all at once – from companies to startups and schools.

In what areas are you scaling the company?

We are continuing to raise investment having raised a Series A round and more recently a Series B. As we grow our user base we go through the growing paints of making sure that our technology and server infrastructure can scale to meet the demand of growing traffic. Growing the business, revenue and users while scaling up our technology infrastructure is continuous and marks four years of our journey so far.

What makes Otter’s speech-to-text translation technology different from its competitors’?

To develop accurate transcription, our cofounders Yun Fu and Sam Liang decided to build our own technology instead of depending on third-party speech APIs like Google and Amazon, which our cofounders evaluated and decided wasn’t good enough for the purpose of long-form multi-person conversations. When we launched, a lot of technology out there was only good enough for short commands such as ‘Ok Google!’ used to check the weather and get traffic updates. As such, Otter’s initial technology challenge lies around transcribing longform conversations that involve multiple people – such as meetings.

Otter’s transcription of my interview with Simon

How easily could another company copy your technology?

We have patents around our IP. Due to the proliferation and democratisation of data, there is certainly a lot of data training out there. It all depends on what type of data you collect, how you clean the data and how you train your data model and so on – so the methods and approaches going to be unique for each company. Companies are probably patenting their own technique of how to generate the speech models. But the barrier of entry is still big enough that we can continue to extend our lead in our speech recognition technology.

Does Otter get ‘smarter’ the more data it collects?

Yes, absolutely – through mixing the data that we collect in public domains with the data that we collect from our user base. The policy around that is opt-in, so you’re opting in to share content that you want to contribute towards that training data. It also learns when users tag speakers, so after a meeting is over, any edits that a user makes are recognised by Otter, which picks up the voice sample and trains to perform better by recognising voices from that user. Any edits and corrections to a user’s transcript also benefits the rest of Otter’s user base.

How are you improving speaker identification in Otter?

We already support and have been focusing more on speaker identification. As Otter is transcribing the conversation we are having now, it is not labelling it ‘Simon’ versus ‘Kane’ in real-time. Currently, after a meeting has finished, Otter will go through the transcript and start labelling who said what. As a post-processing step, we are working on doing that in real time which is a very tough problem. In the speech industry there isn’t really any standard benchmark around identifying speakers.

There are benchmarks around transcription accuracy and word error rate, etc, but speaker identification is a new frontier, and we are committed to being the leader in providing accurate speaker identification without the need for a microphone array. Otter currently gets a mixed audio of you and I speaking in our respective microphones. If this were a larger meeting with five to 10 people, Otter would aim to separate one mixed signal to get all the different speakers, accurately identifying them in real-time. This is the second technology hurdle that we’re building our IP around.