TTS Generation Audio Player

Transcript

In this chapter, we're going to wire up the entire text-to-speech pipeline. We're going to learn how to self-host Chatterbox text-to-speech using FastAPI and Python on a model serverless GPU. This will provide us with a typed API client that we can then use in our TRPC procedures in a fully type-safe way. We're also going to learn how to create audio using that chatterbox server and automatically store it into our Cloudflare R2 storage. Then, once we have that wired up, we're gonna focus on building the UI.

We're gonna use wavesurfer.js to create cool waveform visualizations of the audio generated, as well as seek controls and download functionality. So by the end, you will be able to type a text, pick a voice, hit generate speech, and it will play back the generated speech using the voice you selected with cool waveform visualization and all of these other controls. So the key concepts are gonna be self-hosting Chatterbox text-to-speech, TRPC mutations, using FastAPI and Python to create open API type generation, signed URL proxies to listen to generated audio, storing audio in Cloudflare R2, and using a cool package called Wavesurfer.js to create amazing waveform visualization. So the build order is going to be first self-hosting Chatterbox text-to-speech and generating typed API, then building the text-to-speech TRPC generation procedure, and Finally, WaveSurferPlayer for waveform and controls. So the first step is learning how to self-host an open source model.

So first let's discuss what model we are self-hosting. We are going to be self-hosting Chatterbox text-to-speech. Chatterbox is a family of three state-of-the-art open-source text-to-speech models by Resemble AI. And more specifically, we're going to be using Chatterbox Turbo, their most efficient model yet. So, where are we going to self-host that?

We're going to self-host it on Model. Model is an AI infrastructure that allows us to self-host, train, and do many, many other cool things with open weights or open source models. So, let's go ahead and create an account on Model. And once you create an account, you will see a screen like this. So the first thing I want to clear out is the pricing.

So at the top you will see your credits remaining. At the time of me making this tutorial, Model actually offers $30 every month for your free account. That's right. But right now it says $5. That is because you have to add a credit card to unlock the rest.

So if you want to, I would recommend adding a credit card. You can see that they will only charge a small one-time fee and then refund it back to you. And then you will unlock additional $25 and it will renew every month. So, a pretty good deal if you ask me, but if for whatever reason you cannot do that, you should still have $5 available. What we have to do next is we have to install model on our machine.

So before you run any of these commands, I'm gonna go ahead and show you what that looks like when it's finished. So I have an ability to run model ____ version. Now, I'm not a Python developer and I'm not experienced with Python tooling. So for example, you can see on my machine I don't have PEEP. I don't even have Python, but I do have Python 3 and I do have PEEP 3.

Now as I said, I'm not a Python developer, so I'm not sure if you can just use pip3 install model and pip install model, is that the same thing? But one thing that I found that made it so much easier for me as a JavaScript developer to install model is this package called pipx. And it's purposely named like this because if you scroll down a bit to what is pipx, it's basically a tool that helps you install and run end-user applications written in Python. It's roughly similar to macOS's brew and javascript's npx, which is exactly the experience I want. So, the way I installed model was using pipx install model and the way you can install pipx is by following their guide.

So the package name is pipx And in here you can see how to install it on Mac OS, for example using Brew, running PipX and SurePath and then Pseudo PipX and SurePath Global. Then on Ubuntu, Fedora and Arch Linux. And also on Windows of course using Scoop. So whatever method you prefer. If you know Python tooling and you know how to do this, feel free to do that, right?

So, just make sure that you have model installed in the end. You can see the version I am on. Again, I don't think you need to be on the same version as me. I just want to make you aware what the version was at the time of me making this tutorial. And if you do it using pipx, I think you can also run a pipx list.

And then in here you will see all the packages installed using pipx. All right, once you have model let's go ahead and run model setup. And in here it will basically ask you to create a token and you will select your workspace and you will click authorize. And then in here you are now logged in into that new account that you have just created. Okay, so if you want to go ahead and change your account, if you want to log out, you can just do model setup again.

Now as I said, if you did this using pip, you're going to have to do Python 3-M model setup, right? But if you did it using pipX, you can just run model from now on. I think that's easier but again I'm not experienced with Python tooling so be careful, okay? I mean be careful. Research what works, Try what works for you.

That's what I meant to say. We're not doing anything dangerous here. So what should we do now? Well, let's see if this works. I'm going to go ahead and copy this small snippet, which computes the square of 42 using serverless functions.

So what I'm going to do is inside of my project here in the root of my app, I'm just going to go ahead and run this get started dash Y. So get underscore started dot P Y. And I'm going to paste the script in here. So import model app model dot app example, get started at function. And in here we have a simple function to do squares.

Okay. So we don't have to do this part because we just did it using an editor. And now what we should do is try and run it. So let's try and run model run getstarted.py And you can see that this is now building an image and running an app. And here it is, the result.

And now in here, you can see that this has changed because we successfully did this. And inside of your stopped apps, you will see that example right here. All right, so that was a super simple example. So what should we do next? I recommend visiting their documentation page and in here click on the guide.

Okay, so in here they actually have a more advanced guide. For example, this one teaches you how to self-host Quen, an open source model. It's more specifically an LLM inference. Okay, The one that we are going to do is right here under examples where you can find a bunch of cool examples of what you can self-host on Model. And you can select Audio here.

And you can see everything that they actually offer you. You can make music. You can fine-tune Whisper. What we are interested in is deploying a text-to-speech API with Chatterbox. So you actually don't need to worry if you don't find this documentation page, because I will provide you with the final Python code for running Chatterbox text-to-speech API, simply because we will have to modify it from what it currently exists.

The reason I'm showing you this documentation page is so that you can learn how this works. Okay? What are they doing differently here from what we will be doing? So let me try and collapse this. So one very important thing to understand, they are using their own storage here called model.volume.

They're using that for simplicity sake because they have the functionality to do that and this is actually where you have the download button for all of those voices Which I told you to add to your repository right so inside of scripts We have system voices keep in mind that this file does have some other folders Basically, you just have to find the audio files and just add them here. You don't need the other things that are inside the folder. And they basically upload all of those voices to their volume. And then in their Python script they connect to that volume. So we are not going to be doing that for a very simple reason that we need our voices to be able to be loaded in our Next.js app.

So because of that, I figured we should either use AWS or R2, basically some reliable S3 storage. So what we're going to do is we're going to take this script that they created right here, but instead of connecting to their volume, you can see right here, we are gonna be connecting to R2 storage. Alright? And then we're just gonna go ahead and test if this works or not. So, how do we test this?

Well, here's the thing. I'm not a Python developer, so I don't feel too comfortable teaching you Python. So instead, what I'm going to do is I'm going to provide you with the source code of the final script and I will go ahead and explain the changes I have made from their script so you can understand the code that you are running. So, in the root of your app, you will have chatterbox underscore text-to-speech dot python. Okay, and you can remove the get started one.

And you can also remove the python cache folder if you want to you can just add it to git ignore and then it should be omitted all right and inside of chatterbox underscore text to speech dot Python. This is where you would usually add their example right to the documentation examples audio text to speech and in here you would just copy the entire script But we are going to do something different. You can go to my source code, link is on the screen, and go ahead and find in the root of my app chatterbox-text-to-speech.python. So this is essentially a more advanced version of what they are offering in their documentation. It is optimized for the following things.

The first thing you immediately notice here is we are using R2 cloud bucket mount. So in here we should add our R2 bucket name and in here we should add our R2 account ID. And you also have to verify you remember this we already had this a couple of times in our code, right? If I search through my code, you can see I have three instances of that, one in the seed script and one in the R2 lib. So just confirm that you are not using the .eu one.

If you are, no problem. You will just have to modify the script to be that as well. OK. And you can also see that we have to add some secrets here. Now, what is the difference between this secret and .environment file?

Well, this script right here will be running on model serverless functions. So in order for this file to access secret variables, we need to upload those secret variables to model. Okay? And in fact, they even have that process in this basic documentation right here. So you can see they teach you how to create volumes, which we don't need because we have our volumes uploaded to R2 storage.

But later here somewhere they should be teaching you about hugging face. Where is it? Model secret? Yes, you need to provide a hugging face token using a model dot secret. I'm just not sure where exactly is the script for that, but never mind.

I know what the script is. All right. So what do we have to do next here? Well, let's do the easy thing. Let's go ahead and rename the bucket name and the account ID because we already know the information about that.

So I'm going to go inside of .environment file here and I'm going to copy the account ID and I'm going to add it here. We're going to go ahead and copy the bucket name and I'm going to change the bucket name here so resonance app account ID and in here we simply now mount to that R2 bucket so this bucket with this ID and the secret which we are going to upload to model. So I think this is this. I'm pretty sure this is the secret access key that it needs. We're going to upload that later.

Then in here we go ahead and we set up a very slim Debian image. And in here we define a couple of packages. Chatterbox, Text-to-Speech, FastAPI, and this package, I'm not sure what it is. We go ahead and we do the normal instance of ModelApp as they do in their example right here. But what we continue to do forward is we protect this Python server with something called xAPIKey.

Basically, once we deploy this to model, I want to make sure no one can access this code, can access this server, except those who have chatterbox API key. So we are protecting API in general, every single endpoint with that API key. And the only app that's going to have it will be our Next.js app. So our Next.js application will be the only one who can communicate with this model serverless function. Another thing that we do, which is not shown in this documentation, is we enable all of these prompts.

I mean, all of these params. So besides the prompt, which is basically the text users want to generate audio for, we also add a voice key, temperature, top p, top k, repetition penalty, and loudness. And if those sound familiar, it's because we have them inside of our features text-to-speech data sliders. Here they are. Let me go ahead and find them.

Temperature, Top P, Top K, Repetition Penalty. So those are the properties which we allow our users to modify, right? Those translate to creativity, voice variety, expression range, that's what those fields are. They are not exposed in their basic example here, but by reading the chatterbox text-to-speech source code, you can find that they do offer those fields. Okay.

So what do we do then? Well, for example, this part is assigning a GPU instance to this serverless function. I'm not exactly sure what this represents, it's just the name of the GPU that we are using and it's the exact same that they are showing in their example. So it's the one that's supported by FreeTier. They of course do have more powerful ones.

If you are more experienced with this, you will probably know that, right? And this is a scale down window. This basically means how long until this server function goes cold. Once it goes cold, it means it will take some time to start up during the initialization. You can completely eliminate cold starts, but you will have to upgrade to a higher tier on model.

So for free tier, you will simply have to live with cold starts, which I think is a fair deal given the amazing service they are providing. So, now we have the secrets. They also have secrets, but they only use one. They use HuggingFaceToken. HuggingFaceToken is free to obtain and it's needed for Chatterbox.

Okay, so we have that as well. But we extend it with also requiring Chatterbox API key, which is basically the protection so that only our Next.js app can communicate with this serverless function, and Cloudflare R2 secret so that this Python server can connect to our Cloudflare R2 storage where we actually have the system voices. All right. Then we define Chatterbox Turbo Text-to-Speech from pre-trained device CUDA. So exactly the same as they do.

You can see this is obviously modified from their instance, right? And then in here we define the OpenAPI Swagger documentation. We set up course middleware. So what I do here is I allow everyone to connect here. It works for me, I mean, it works for us for a tutorial.

Later, you could probably only change this for your deployed URL, but that's kind of already protected because of the chatterbox API key here. Then we define some endpoints. We create a POST endpoint for forward slash generate, and in here, we simply attempt to find the voice key inside of our R2 Cloud Storage that the user has selected. So if we cannot find the voice user has selected within the R2 Cloud Storage, we raise an exception, voice not found. Otherwise, we go ahead and we generate it using the prompt, the voice path, the temperature, top P, top K, repetition penalty, and normalized loudness.

And we return back audio response. If anything fails, for whatever reason, we throw a generic 500 fail to generate audio. And down here we have a local entry point, which usually I'm not sure is a practice for fast API servers at all. Again, I'm not a Python developer, I'm not sure about how best practices and tooling is done here, but they have this example and I think it's a cool example because it will allow us to test if this works using CLI. It's basically a copy of the code above, but not scoped to an API endpoint, but rather to a local entry point you can run with CLI.

Alright, so I wanted to go through the code because I told you to copy and paste it, so I wanted to show you that it's not some random thing I found. I just modified it with AI assistance and with my general knowledge of API endpoints and how I mean APIs and how they're built. But as I said, I'm not a Python developer. So definitely for serious production instances, I would recommend checking this code once again. But for tutorial purposes, it is more than enough.

And it is generally very safe because only our Next.js app will be able to access it. No one else. If it doesn't have access to your API key, it is completely protected from any malicious actors. Alright, so once we have that, we have to go ahead and add some secrets. So, what secrets should we add?

Well, the first one that we have to add is Hugging Face secret. To obtain the Hugging Face token, we need to create a Hugging Face account. So using the link on the screen, you can go to Hugging Face and go ahead and create an account. Make sure that after you create your account, you also verify your email. And since this is mostly an AI hosting and theorist website, don't be surprised with a huge amount of verify your human prompts.

They are normal. Once you're logged in, go ahead and find your profile at the top here. Okay, I just zoomed in too much. Find your profile and find access tokens right here. In here, click create a new token.

I'm gonna go ahead and simply call this chatterbox text to speech. And since I'm not really sure which ones we need, I tend to select everything. So let me go ahead and actually let me try and find exact one that we need because I'm not sure if we should select all of these or only some of these. So the permissions I have given it in my first iteration of building this app was write. So go ahead and select write for the access token.

So let's go ahead and click create token here and let's go ahead and copy the value of this token. Now if you want to just preserve it you can add it to your .environment file even though it's not exactly going to be used in this project. It will be used in a model, right? But you can add Hugging Face Access Token or you can just store it as Hugging Face Access Token like this. And you can store it here simply because this is not committed so you don't lose it.

And let's click done. There we go. So now we have that token and now we should go ahead and learn how to add it so you can see that in my project here I do have something called secrets but I don't have any secrets here so let's go ahead and learn how to add a secret so model secret create HF dash token and then hf underscore token and then in here paste your token. Let's go ahead and see why we need to use this exact name. So, it's not made up.

It exactly matches the content of chatterbox text-to-speech Python. Okay. So let me try and find FH-token. Here it is. FH-token.

That's what we need. FHF-token, not FH. Alright, and let's go ahead and add that to see if it works so created a new secret token HF token with the key HF token So you should now be able to, well I'm just searching for something here but I don't think I'm going to find it, never mind. Now if you go here and refresh, you should see your secret. Here it is.

And they even show you how to use it. And from here, I'm not sure if you can see the value of it or not. But basically, yes, you should now have access to this secret. Okay, that's the first one. Now we need to add the second one.

The second one is gonna be model secret create chatterbox API key, chatterbox API key in capital. And in here, you simply write your super secret key. This should definitely be something generated. If you are going to make this into a real product please don't forget to change this. I'm making something, I'm adding something simple now just for tutorial purposes so it doesn't get lost.

Okay, so let's just go ahead and add this secret and let's immediately go to our .environment file here because this one actually will be used here and let's go ahead and add chatterbox underscore api underscore key And let's just copy super secret key value. That's the one we're gonna use. And you can see now we have two secrets here, okay? And then there is one more that we need. And that is the following.

So we are going to add model secret create CloudFlare R2. OK. Let me try and do this without autocomplete, sorry, because it's interrupting me. Cloudflare-r2. Go ahead and add a backwards slash.

Maybe it will be simpler for me to just, because you have to do it all at once, but I'm not too experienced with command line interfaces to tell you how do you go into the next line without submitting. So what I'm going to do inside of Chatterbox text-to-speech Python, I will add this. And I'm going to comment it out. And I will simply write, use this to add R2 tokens. Okay, so you will be able to see this in the source code, okay?

So go ahead and uncomment this, copy it, comment it again, and then very simply paste it. Okay? And now we're gonna go ahead and modify it. So, AWS access key ID should match R2 access key ID. And yes, we are using AWS even though we are working with R2, simply because the Cloud Bucket Mount that model provides us with to connect to other instances reads from those variables.

So they need to be called like this, OK? So I'm going to go ahead and copy the secret key, secret access key, and I'm going to paste it here. And then I'm going to copy the account ID. And I'm going to paste that above. So basically, maybe you could add these separately, but when I tried, the moment I run Cloudflare R2 and submit a single one, I can no longer append it with another one.

That's why you have to do it at once. In case you already submitted, no problem, you can always delete a key from here. Right? So let's go ahead and add those two keys. Here they are.

We now have AWS access key ID and AWS secret access key and now we are ready to try this out. Perfect. So the first thing that we're actually going to do now just make sure this is uncommented because otherwise it's going to go ahead and break your app. The first thing we're going to do is we're going to deploy the chatterbox. So model chatterbox text-to-speech python.

My apologies, model deploy chatterbox text-to-speech Python. And let's go ahead and see if this succeeds or not. So for the first time it might take a while because it's building the entire image, it's downloading all the packages it needs to download, and it is uploading to model. So let's give it a moment to do that. And once it successfully deploys, you will see your endpoint right here.

And here's a little tip. So the first time you attempt to load this, it will take a while. So don't be surprised. That is because of the cold start so this will definitely take a while just go ahead and leave it here but for now you should see your live app here it is chatterbox live app So let's go ahead and wait for this to deploy, I mean to load, and then we will see exactly what we can do with this. So my container kept failing and No matter how long I waited, the errors kept increasing.

And I think I know why. I'm not sure if you will have the same thing happened. Maybe you will, because I told you to try and do a GET request using your browser, but I forgot that we actually protect everything using the API key. So yours probably failed as well, But mine failed for a very specific reason. You can see that all of these failed at runtime initialization.

And that told me that something was wrong here. So I went to look through the logs and I looked at the error message, which was written and looks like the AWS S3 is unable to connect. Unauthorized. And then I simply, you can see I was doing some testing like why is it failing, but then I went back to this. My model secret create Cloudflare R2.

And I noticed that my AWS access key ID is incorrect. I have added the wrong one. I added my account ID here. As you can see, that's my account ID. That's why it's incorrect.

So my AWS secret access key is this and my AWS access key is this so in case you are getting the same error it could very much be this So I'm gonna try and just run this command again. Maybe I will get an error. There we go, Cloudflare R2 already exists. So what I'm gonna do now is I'm gonna go inside of secrets, I'm gonna find Cloudflare R2 and I'm gonna delete it entirely. Like this.

Alright? And then I'm just gonna go ahead and do this again, but this time with the correct ones. And here's what you can do then and what you should do then. Go ahead and write model, is it app list? There we go.

Model app list. And find the ID of the deployed one. And do model app stop and give it this ID. And then do model app list again and all of them should be stopped. Okay?

And then do model deploy chatterbox text-to-speech python again. Okay? This should be much faster than the first time. But this time, hopefully, I will not have all of those errors. So I'm going to go ahead and still try and do this via browser, simply because I want to see if now I'm getting the proper error or not.

So let me close this and focus on this one. So is it still the same thing? Or is this not failing because of that? So I would still be okay if this fails. But for a different reason, I expect this one to fail because we are missing the API key.

Okay, so let me see. This is at 46, so it looks like I still have the same issue. Okay, I'm going to go ahead and debug further because I thought that would fix it, but looks like that still didn't fix it. I'm not sure if you have the same problem, but I think this is a good opportunity to debug something that is unknown to us. So I'm gonna go ahead and show you exactly what I'm doing to debug this.

Well, I debugged it and you can see I again made a mistake with adding secrets. Okay, this time, hopefully third time's the charm. I should be able to do it. I'm again going to go inside of my secrets and I'm going to delete Cloudflare R2. And this time I'm going to remove the invalid one, the AWS secret access key.

So I'm purposely not cutting this part out simply because you can see these mistakes happen. Okay, let's try again. Again I'm going to go ahead and list all of my apps. I'm going to find the ID of the deployed one. I'm going to forcefully stop it.

And then I'm going to do model deploy chatterbox text to speech for the third time and see if that fixes it. Okay. I'm gonna do the same thing that I always do. Attempt to load it. Oh, that looks like it's green.

I don't think I've seen that before, so maybe it works now. We'll see. I will keep track of the containers to see if there are any errors. I'm going to keep track of what is going on with this and let's see what will happen. And this time it works!

Well, kind of. It's 404 not found but the server is live. The server works. Let's see if I can go to forward slash documentation here. I'm not sure if this will be protected.

Looks like it is not protected. Perfect. So we don't have to authorize this. And you can see our generate text to speech post request and you can see everything it accepts. That's what I wanted to show you.

So automatically generated swagger here. And here's another thing you can do. So instead of docs, right, so this is just your normal URL, try going to open API. What's the proper one? Open API dot...

I can't... Just a second. I completely forgot which one it is. It's openAPI.json. Why would it be .typescript.json?

And in here, we get a super important thing. We get a type safe open API standard of our entire deployed Chatterbox server, which will be super useful for our super type safe Next.js TRPC server. So that's what I wanted to show you. After you have deployed, you should be able to go to this forward slash open API dot JSON, and you should be able to go to docs. So you should be able to do that.

So now, how about we try generating audio? That should be fun, right? So what I'm gonna do immediately is I'm gonna go inside of my .environment files right here and let's go ahead and add the following along with chatterbox API key I'm going to add chatterbox API URL and I'm just going to paste the what we now know is fully working and fully correctly deployed chatterbox. You can see it finally succeeded, finally no errors, meaning we are mounted to our R2 storage which would mean that we should be able to test if this works. So how do we do that?

Well, we go ahead and do model run chatterbox text to speech Python. And again, I'm going to go ahead and copy this and I will add it to the source code simply so you don't have to figure out how to do it. So I will add use this to test locally and okay I'm not going to do it like that I'm going to do it like this. There we go. And we're gonna comment it out.

Like this. Okay? So comment that part out. Copy it. Comment it out again.

And paste it here. And what you have to change here, you don't have to change the prompt, but you do have to change the voice key. And you can find voice keys inside of your NPX Prisma Studio. Okay, so I think I'm already running that here. Here I am.

So I'm going to open Prisma Studio here. I'm going to go inside of voices and just pick a random voice. Make sure it's English and go ahead and copy just the last part or you can copy the entire R2 object key and then in here go ahead and paste it so this will now use Aaron's voice to do this model run chatterbox text-to-speech Python So let's go ahead and let it be right now and let's just see if it will throw us an error or if this will actually generate a voice. So I'm not sure if this will render an output. We'll see.

If it does, I think it will just add a file to our repository here, like output or something. So let's just let this be, and let's just test if it works. So if you enter an incorrect voice key, it should throw you an error because it cannot find that in R2 storage. For this example, you don't need the access key, the secret key, simply because we are literally running the server locally. So that's why you don't need it.

In a moment, I'm also going to add another comment here. So you can test a curl call with the access key. And if it works, this should be the result. Audio saved to temporary chatterbox text to speech. So if you actually go to that file, which is somewhere in your computer, temporary right here, you should find output.wav and you can play it and it should sound like the prompt you have added.

Keep in mind that AI is non-deterministic so it can be different every time. All right, so that works perfect. We have another thing we have to do, and that's the HTTP test. So I'm gonna add that here too. I'm gonna comment it out, right?

So all of these are for CLI things. I will see, I'm probably gonna leave this here, but I will also have that in the readme file so you will be able to find these commands somewhere. And none of these are really important for the functionality of the project. I just want to make sure that you know that the project works. So let's go ahead and just slightly modify this script by getting our real API URL.

Then I'm gonna go ahead and replace that, this part. All right, so our URL forward slash generate. We're gonna replace this with super secret key. And I'm going to go ahead and just copy a system voice. So let's go ahead and replace the voice key part.

So only inside of the quotes, right? Just paste this. All right, let's copy this. I'm gonna comment it out and let's go ahead and paste it. So you should have super secret key.

You should have a prompt. You should have a voice key and it should lead to your working URL forward slash generate and output dot wav. And what this should do is it shouldn't fail. That's the only thing we're testing right now. So if you can do it from the command line, it absolutely works for our TRPC server and all the other instances that we are going to use it in.

So what should happen now is it should generate output.audio file inside of this project right here. And this is how it looks like when it's finished. And let's go ahead and find and here we have an output.wav So you can go ahead and listen to it. So my example worked. It said hello from chatterbox and it did a slight chuckle.

So definitely works. Okay, we can now go ahead and remove that audio file. We don't need it. And we can go ahead and focus on, well, connecting this to our TRPC server. All right, so the only thing I'm gonna do now is I'm just gonna go ahead and change this back to something empty so I don't forget when I push this so I don't use my voice keys.

All right and I think everything else should work. Your API key here. All of this works perfectly fine. Brilliant. Okay.

So, I'm just going to change this to be use this to test curl. You don't have to modify this file at all. So it's provided from the source code. We went over to explain how it works, of course, and I'll just left this comments here so you can uncomment, copy it, and paste it in your terminal. Beautiful.

So what do we have to do now? Well, let's go ahead and use the fact that we have this openAPI.json and use it to create strict types within our project. So in order to do that, we're going to need to install openAPI-fetch. That's the first thing. That's how we're actually going to communicate with Chatterbox.

And the second thing we're going to do is install inside of dev dependencies openAPI-typescript. Technically, you could just do this with a normal fetch, right? But it's definitely not as type safe. So no reason to do that if we have this amazing open API from FastAPI. So what we have to do now is we have to go inside of scripts and we have to create a new file here, which we are going to call synchronize API.

So synchronize-api.typescript. And inside of here I'm just going to add a little comment. So this fetches the OpenAPI spec from the Chatterbox text-to-speech API and generates TypeScript types. You can use it by adding Chatterbox API URL and then run npm run synchronize API. All right.

Or if you have an environment file, you just run this. Since this is just a script, feel free to go inside of the source code and copy the content, simply because I'm mostly going to copy and paste this anyway. So we are importing the packages we need to work with our file system. We're importing dot environment to communicate with the environment file. And then we are using open API TypeScript to actually connect to this TypeScript server and to change it to parse it into strings.

So let's go ahead and define dir name here and in here we define the output path. So the output path will lead from this file inside of source and inside of types which we don't have yet and then chatterbox api.d.ts. Then in here we're going to go ahead and create an asynchronous function main. I'm gonna go ahead and define API URL to be processEnvironmentChatterboxAPIURL. Always make sure that you actually have it in your environment file, chatterboxAPIURL.

In case you don't have it, we're going to go ahead and throw an error. It's required if you want to generate types. Now we need to modify this URL and add openapi.json to get this format you're seeing on the left screen. And I'm just going to add a console log here. We are fetching the open API specs.

Now let's go ahead and run this function for that spec file. And let's go ahead and turn it to string. And then we're going to go ahead and make sure that the output directory exists first and if it doesn't we're gonna go ahead and create it. I'm gonna go ahead and add a header comment to the file then so we understand what this is. So this header comment will simply say this file is auto-generated by scripts synchronize API.typeScript do not edit manually instead run npm run sync API to regenerate and then we're gonna show the URL that we use to generate this and the date and time it was generated at.

So your users know that this shouldn't be modified by hand because it makes no sense if they do that. And then we're gonna add an actual write file sync to the output path, use the header and combine it with the contents we have parsed. And we're gonna log that out. Beautiful. And let's go ahead and actually execute that main function and catch any errors if they appear.

Beautiful. So I just guided you through this synchronized script, but of course you can just copy and paste this from the source code since it's not exactly... I mean, it is a learning material, but it's not mandatory, you know, to complete the project if I can say so. Alright, now let's go ahead inside of source, lib file and let's create chatterbox-client.ts this is what we'll use to communicate with the API we just created so let's import createClient from openAPI.fetch, let's go ahead and import pats from types chatterbox-api which doesn't exist yet, Let's go ahead and import environment from .environment and let's export const chatterbox. It's going to use create client.

It's going to pass the paths from the generated types, which we're going to do in a moment, and it's going to open an object for this create client function. We're going to pass in the base URL to target environment chatterbox API URL. And to the headers, we're going to assign xAPIKey to be environment chatterbox API key. And because of this, our app will be the only one who can communicate with that deployed server because in the chatterbox text-to-speech Python, we are very strictly requiring the X API key. So unless the credentials get stolen, no one except your app will be able to communicate with that server and that's what makes it safe and that's why you don't have to worry too much about the rate limiting or anything like that simply because you can create you don't I mean you don't have to rely on your Python knowledge to do that you can rely on your JavaScript knowledge to do that from now on because your app is the only one who can communicate with chatterbox ok your back end brilliant so now we're gonna go ahead and add the seed script to package.json here.

Let's go ahead inside of scripts and we're going to add synchronize API TSX scripts forward slash sync dash API dot TypeScript. Make sure you don't misspell scripts or sync API. So I always like to check scripts sync API dot TypeScript looking good. And then we have to go inside of .environment file, my apologies, source lib environment file. And in here we have to add chatterbox and it has to be required, so let's go ahead and add minimum of one.

And we need the chatterbox API key which is also required. Oh, we don't need minimum of one because we are using z.url here. My apologies. All right, I think that should be it. So what should we do now?

Well first, let's just check that Chatterbox client is not throwing errors for the environment file. You can see that now it works because we added those two. So if you have any errors, it means you have incorrectly typed this. Now we have to do this part. So how do we do that?

Well, very easily. We now have a sync API script. So I'm going to go ahead and do npm run sync API, which is going to communicate to tutorial mailing chatterbox, basically the URL forward slash open API dot JSON. So again, this can take a while, especially if the server has cooled it down, which probably is because we just wrote the entire thing. And that should generate the types folder and the actual generated types.

And then this file should finally stop being read because it should be able to load the paths and we should have a fully type-safe API so we reduce the chances of mistakes by a huge margin because you can see that, I mean, you can see the Python server, right? Chatterbox. You can fail very easily by accidentally adding a typo here. So this will be super type safe using the chatterbox client instance because we are going to use the open API type spec to make sure that we don't misspell any of this, right? It just makes sense to do it this way because it's an industry standard way to do it and our server is already super type safe.

And just like that, no more errors. So now when you go ahead and call chatterbox, you will be able to call only specific ones. For example, get doesn't work. Let's try post. Let's see, you can see automatically auto suggests generate because that's the only one we have.

Now I'm not sure what else can we do this data temperature. I'm not sure how to use it right now from the top of my head, but you can see that it's already type safe. If I go ahead and try something like this, it will say that this doesn't exist. I mean, it won't auto-suggest it. So that's what we achieved with that type safety.

Amazing. So, here are all the files that we have changed. So this is an auto-generated file. You can see how it looks inside of source types. It's an auto-generated file, do not edit manually.

And in here you can see we have every single detail, you know, what request, what does this request accept, what does it return, What kind of errors does it throw? So we are very aware of the server, and it will not surprise us. It's almost like an extension of our TRPC procedures. Brilliant. So this was a very long lesson, and we barely touched the second part, but it's an important lesson for us to do.

So now let's go ahead and build the TRPC procedure, which is going to communicate with our newly created and hosted chatterbox client. So let's go ahead and make sure our app is running. So npm run dev, and we can now stop looking at model and we can go ahead and focus on localhost 3000. So I'm going to close everything else now. The current state of text-to-speech is that it can accept a value here, here, and we can call generate speech, but it doesn't amount to anything, right?

So what we have to do now is we have to make this actually submit because right now nothing is happening. Okay? In order to do that, we need to build the generate procedure. There are no new packages we have to add at the moment so we can immediately go inside of trpc, routers and we can just go ahead and build generations.ts. Let's go ahead and add our imports.

So it's going to be Zod TRPC error from the TRPC server package. Chatterbox from our newly created lib Chatterbox client, which will allow us to communicate with the deployed TTS. Prisma, upload audio helper from R2 lib. Then we're going to go ahead and add a text maximum length from features text to speech data constants, and then we're gonna add create TRPC router and organization procedure from init file. Let's go ahead and export the generations router using create TRPC router.

In here we're gonna first add get by id organization based procedure. So whoever wants to request a certain generation, first of all, what is a certain generation? That is the history tab, right? Here in the history tab, you will see a list of all the previous things you have generated. And when you click on it, that will trigger the getById.

So the user needs to know the id of the generation they are trying to fetch. This will be a query and it's going to be an asynchronous query with input and context here. So the first thing we're going to do is we're going to use Prisma to find a unique generation. So let's go ahead and do await Prisma generation find unique. And in the where, we're gonna go ahead and use a combination of input.id, referring to this, but only if also the currently logged in user's context organization ID is the organization ID referenced for that generation.

So we shouldn't allow users from outside generations to access someone else's work when they don't belong in that organization. And since this one is purely presentational, GetById is only used to listen to a previous generation. We can omit organizationId and r2 object key just so we don't spill too much information if we don't need to, right? Let's not make a malicious user's job easier, right? Let's omit those two anyway.

If such a generation isn't found, we're gonna throw not found. Otherwise, we're gonna go ahead and do the following. We are going to return the results of the generation found, but we're also going to add audio URL to go to forward slash API, forward slash audio, and then to generation.id. You will see what this is in a second, but I will give you a quick hint that is the signed URL proxy so we can't just load an audio from Cloudflare R2 and play it We need to sign it and we need to stream it in a certain way that allows audio elements in HTML to play it. That's why we need to do it this way so we make the backend do at least this part for us.

Okay, That's it for the getById procedure. Now, we need to go ahead and do a getAll procedure. A getAll procedure is going to be an organization procedure with no input. So it's just going to be a direct query, so we only have access to the context here. And in here, instead of using find unique, we're going to use generations and find many.

So what should we query by here? Well, by the only thing we can, which is the currently logged in as user organization. So the user doesn't see anything that doesn't belong to them. Let's order by newly created. And again, let's omit the info that we don't need.

And this one is much simpler. We can just return generations here. Why? Because the user loading this endpoint doesn't intend to listen to all of them. This will be used for this when we just need to list a bunch of generations.

That's why we don't need to do any modifications. And now we come to the one we've been waiting for, the create procedure. How do we create a new generation? Well, this one is a bit bigger. The first thing we're gonna have to do is create a pretty large params object.

So which params should we put here? Well, all the params that we accept in our Chatterbox text-to-speech server. All of these params are expected. Now it doesn't really matter that they are written the same right now. You can see they use different casing from us.

So we can use our own terminology for this. For example, we can use text instead of prompt. Let's make it a string, let's make it a minimum of one and a maximum of text maximum length. Then let's go ahead and add a voice ID, let's make it required. Let's go ahead and add a temperature.

And in here I'm using maximum 2, default 0.8 simply because that's the exact scenario here. Default 0.8 less than or equal to 0.0. So that's the same thing here. So that's how I'm assigning these values. Then we're going to have top p, then we're going to have top k, and then we're going to have repetition penalty.

Okay, brilliant. So now let's go ahead and add mutation, asynchronous, input, context, and the first thing we're going to do is we're going to attempt to find the voice that user attempts to use to generate something. Does this voice even exist? So we're gonna go ahead and open a where and we're gonna first add input voice ID. So we know what voice ID you are trying to use.

But should we allow you to use that voice? You might think, okay, just put this, right? Well, not exactly. What if the user wants to add a system voice? Right?

What if the user selects something like Andy, which is a system voice that doesn't belong to anyone. Well, in that case, we're going to use something called OR. So the first scenario will be if the variant of that voice is system. If it is, you're good to go. But the second scenario will be if the variant is custom.

And in that case, we need to check the user's organization ID. All right, I'm going to collapse this so it's easier to read. So basically, if system, we allow it to anyone. If custom, only if you belong to the organization that has created it. And then let's go ahead and sparingly select the info we need.

We need the ID, the name, and we need the R2 object key. Then if the voice wasn't found, we're just going to go ahead and throw. Let me see if I did this correctly. I did. There we go.

Otherwise, let's check if voice r2 object key doesn't exist. So this is a specific scenario where we have an orphaned voice. So voice audio is not available, meaning the generation will already fail. No need to waste GPU time calling chatterbox because it's going to fail here anyway. Because we did the same thing here.

Voice not found in that Cloudflare R2 mount path. Okay? And now we come to the cool part. So you can destruct data and error from await chatterbox.post. It will auto-complete the generate endpoint for you and you can go ahead and open BATI here and you can see prompt needs to be input.text.

Voice key is voice R2 object key and so on and so on And that is the type safety benefit I was telling you about. You might think, well, okay, but we could have just been careful. Sure, you could. But this is how it's done in industry standard environments. And more importantly, nowadays, you cannot avoid agentic work anymore.

You've probably been using AI to some extent. And the more type safe you are, the better results AI will produce in your codebase. The more you ignore types, the more mistakes AI makes. If you add AI to this code base, it will super safely communicate with this server because it can read the entire types file. So it knows exactly what to expect.

So it's not just that it helps you, it also helps your team, but it also helps AI. So it's a overall win to be type safe. Now let's check if we have any error being thrown from the chatterbox. If we do, well, we have to throw our internal server error here as well. Then, let's check if whatever this server has returned isn't an instance of ArrayBuffer.

Okay, If that's the case, we also have to throw because it's an invalid Audi response and currently we expect a buffer so that we can do this. We can create a buffer from that data and then we're gonna have to upload that generation to Cloudflare R2. So let's first prepare the following. Let's prepare the generation ID and let's prepare R2 object key. Then we're gonna go ahead and open a try and catch block.

In try block, let's go ahead and create the generation first. So let's add this to our database. So the data that this is going to accept is going to be the following. I'm going to go block by block. Organization, using currently logged in user's organization.

Text from input.text. Voice name, which is a hard coded string, but also a voice ID. Now, I'm not sure if you remember, but that's what we wrote when we developed the generation model. Why would we need a separate voice name when we have a very visible relation with voice? Well, because of this onDeleteSetNull.

When this voice gets deleted, we decided that the generation that was done using this voice, which is now deleted, should still exist, but it simply won't have info about the deleted voice. This is where voice name comes in handy. So even if a voice is deleted, we can still show to the user this was built using that voice, which no longer exists, and it was called Andy Antonio Aaron whatever. So that's why we are adding a voice name, a long voice ID. And now we just add the rest of the properties like temperature, top p, top k, and repetition penalty.

So then when user tries to click something in history, it's immediately gonna change the text to what it was, it will change the creativity, the voice variety, it will basically be a time machine to go back to exactly what was selected at the time of choosing that generation. Brilliant! So let's go ahead and create that and let's only return the ID back after we create it because this will then return something we only make it return the ID no need to add anything else here. The only thing we didn't add right now is the upload key of the generation of the uploaded not voice but the prompt that user wanted to hear right so the first thing we can do is we can populate generation ID now so let's append generation ID to the newly created generation dot ID But we still have a question of the R2 object storage. So now, the reason we needed to create generation first is because the format in which we are going to store generations in R2 is generations forward slash organizations then organization ID and then in here generation ID.

So we need to have a generation ID before we can upload it otherwise we wouldn't know to who it belongs to. Right? So we are deriving from the generation ID to know the ownership and to what generation it belongs to. So let's create R2 object key, generations, forward slash orgs, the org id, and then generation id. And then, once we have r2 object key, we can do await upload audio with the buffer we have created and with the key r2 object key.

And once that is done, we can go ahead and do await prisma generation update where generation id and simply append new r2 object key after we upload now these uploads are limited to like 10, 20 megabytes. So because of that, you don't have to really worry about this failing too often. But still, I would still recommend, you know, if you are planning to scale to millions of users, you should probably do this in some kind of background job, retry action, right? Make it a little bit more reliable. I think this can work up to, I don't know, 100, 000 users, no problem, because we're limiting how short the files can be.

So all of this is pretty fast, but chances are the more users you get, the more of these orphaned generations you will have because audio upload failed, but we did this first, right? But still, if we happen to fail here, so if something happens in this try and catch, we're gonna check if we have the generation ID. If we have it, it means we have already created a database record inside. So when this fails, we're simply going to assume the only thing that could have failed was this. I mean technically it could have been this too, but again the only thing we can do at this point if something fails is to go ahead and delete the generation we've just added to our database.

Obviously, you can handle this in a more elegant way with background jobs and retries, but for now, this should be just fine. And then outside of this if clause, but still inside of the catch block let's just throw an internal server error. Great! Once we do that let's go outside of the try and catch so still inside of the create method mutation but outside of the try and catch here go ahead and check if there is no generation ID or if there is no R2 object key it means something somewhere went wrong we have to throw that as well. Otherwise, we have successfully created a generated audio and we have uploaded it successfully to R2.

Amazing, so that is the create procedure. Now, let's go ahead and let's do this. So if you remember in my getById, I created this. But this currently doesn't exist. API audio generation ID doesn't exist.

So I'm going to go ahead and create that now. Instead of API, I will go ahead and create a new folder called audio and in here I'm gonna do generation id and then in here route.ts. So let's go ahead and import out from Clerk Next.js server, Prisma from lib database and getSignedAudioURL from lib R2. Why are we not doing this instead of a trpc query? Because trpc cannot produce the stream that we need right now.

We need to stream this to the browser so it can be played in audio elements. So let's export asynchronous function get. We can skip the first parameter, which is a type of request. And we can immediately go to the next parameter, which are params. So the params are going to be a promise of an object which holds the generation ID, which is a string.

Make sure that this generation ID 100% matches what you type in the folder name. That's how routing works in Next.js. If you misspell this, this will forever be undefined or vice versa. And there's no type safety. Okay, so you have to be careful.

And this is one of the reasons I prefer adding trpc to Next.js because I can't handle accidentally typing something like this and then in here forgetting to do it. So that's why I really like trpc because it's super type safe. Let's open this function and let's first verify the user. So await out to get user ID and organization ID. If any of those two are missing, let's throw 401.

You are unauthorized to proceed from here. Then let's go ahead and first check. For this audio we want to stream to the browser we are using a generation ID. Does that generation ID even exist? And we also have to await params to get to the generation ID.

So using Prisma Generation Find Unique, we are using where ID is generation ID and the organization ID from the currently logged in user matches. So again, we are not letting anyone access what they shouldn't access. Let's go ahead and throw an error. If generation doesn't exist, then let's go ahead and see if the generation has an R2 object key. If it doesn't, so don't forget an exclamation point, audio is simply not available.

Perhaps it is still in this await upload audio phase, right? So that's a scenario that could happen. So it doesn't mean it's orphaned, it could mean it's just not uploaded yet. Check again in a moment. But if it is, we can go ahead and do await get signed audio URL, which we have defined in R2, right?

So we added an import for that and we pass in generation R2 object key. Then in here, we can finally do audio response by awaiting a normal fetch to the signed URL. Let's check if the audio response is not okay. So if not, audio response okay, failed to fetch audio status 502. And finally, let's go ahead and return a new response.

And inside of here, we're going to add audio response dot body. And let's add headers. And for the headers, we're going to give it a correct content type but we're also going to give it cache control so it loads faster the next time the user tries to do it in the same session. Brilliant! So that's what's going to happen when user loads an individual generation and attempts to play it.

It will go through this proxy which will stream to the browser and get a signed URL from Cloudflare, right? Because just by default, just by knowing the R2 object key, you shouldn't be able to just fetch things, right? So only by using our app, can you listen to the audio generated. And the moment this R2 signed audio URL expires, which is in one hour, even if you manage to inspect element and steal the mp3 WAV file, it's not going to work anymore in an hour. So it's super safe and no one can leak your audio files.

So you don't have to worry about that either. All right. Now let's build the interface for that. So we just did the second part and now let's go ahead and start developing the third part. So let's start with source, app, dashboard, text to speech and let's create a new folder in square brackets, generation ID and then inside of that page.tsx and in here we're going to go ahead and start with the good old import TRPC hydrate client and prefetch from our TRPC server.

The text to speech page this one so this is a different this is a text to speech page to like create something, that's this one. But this one, under generation ID, will be called text-to-speech detail page. So let's go ahead and first define the function and the types. Export default asynchronous function text-to-speech detail page. Accepts params and those params are a promise with an object of generation ID.

Again, make sure there are no misspellings here. You can copy this and then go ahead here, paste it just to confirm you didn't misspell it because otherwise it will always be undefined. Okay, and now in here first things first let's go ahead and await params so now we have generation ID and then we're gonna go ahead and prefetch some things so let's prefetch the TRPC generations oops we forgot to add generations that's a super easy fix let's go inside of TRPC routers underscore app. Let's go inside of generations and let's do generations router from dot slash generations the exact same way we did voices and there should be no problems here. Alright so just make sure your TRPC routers underscore app has the generations.

Perfect. And now you can do generations, getById, queryOptions, and this time they're not empty because we need to specify exactly which one are we trying to prefetch. And for the voices, we still need to prefetch all of them. Because this will look identical to this. So this text-to-speech view will be almost identical to what we're going to build now with text-to-speech detail page.

So we still need the voices. Because otherwise, how would we load the selected voice that was done during the time of the generation? Right? Alright, and now we just have to return hydrated client and in here we are going to go ahead and do text to speech detail view with generation ID, generation ID prop The only problem is we don't have text to speech detail view. So now we're going to go ahead and develop that.

Let's go inside of features, text to speech, not components but views, and you can copy text to speech view and paste it and rename it to text to speech detail view. Okay. Then what we're going to do here is we're going to rename it. So text to speech detail view. And this one will not operate via initial values.

Instead, it will operate via Generation ID. Okay? So, let's stop here in Text-to-Speech Detail View. Make sure you have renamed the function and you have changed the props. And let's go back inside of our dashboard text-to-speech generation ID page and let's import this file.

There we go. So we are done with the app folder and we can focus on building in the features folder. Okay. So now, instead of getting the initial values through props, this one will work differently. This one will be able to load the generation.

So we already have useDRPC here. So what we're gonna do is we're gonna change useSuspenseQuery to be useSuspenseQueries. So multiple of them. And you can remove the other one so use suspense queries and in here we're going to go ahead and open an object and then we're going to add queries and then we're going to add the first one in the array like this and then that will change the result here to be the following let me just change this to an array too so the first one is going to be generation query and the second one voices query So let's go ahead and add a trpc, generations, getById, queryOptions, id, generation, id. Beautiful.

So when you hover over generation query in here you should see audio url id basically a single object or an error and in the voices query you should see custom and system all right so that works Now let's go ahead and do the following. Let's add data here. So data is generation query dot data. And for this one let's just do. Voices query dot data.

So slight modification. All voices can then stay the same. Fallback voice logic can stay the same. Let's go ahead and look at the resolved voice ID this will need some changes so this is now no longer using initial values this is now using data so data dot voice ID there let me just see data voice ID and all voices, all voices dot sum. Let's do data voice ID.

But we also need to check. No, okay. I think we're good here. And data voice ID here. Great.

And now for the default values, they're going to be completely different. So you can delete that and you can do this instead. Text coming from the data, voice ID coming from the result, Voice ID, Temperature from Data Temperature, Top P from Top P, Top K from Top K, Repetition Penalty, Repetition Penalty. Then there's one more thing we have to do. In case a voice In this generation was deleted, we need to do the following.

We need to use the denormalized voice name snapshot instead of the populated voice relation. So the preview always shows the voice name at the time of generation, even if the voice was later renamed or deleted. So that's what I was explaining to you earlier, right? We're gonna do this. Generation voice.

ID will be data, voice ID or undefined in case it's deleted, but we will always have a snapshot of the voice and how it was named at the time. Beautiful. So now let's go ahead and see what we have to do here. So this is text to speech voices provider, text to speech form. What I want to do here is I want to give it a property key generation ID so it resets on every ID change.

The default values can stay the same. This is the same text input panel can stay the same. Settings panel can stay the same. Text to speech form can stay the same. So right now we can't really try this simply because generate speech cannot do anything but if you go ahead and go to text to speech and go to like 1, 2, 3 you should get an error.

Let me just see if it's what I expect. Okay yeah it's an error because it cannot find the getById, but it doesn't throw you the generic not found. That's what I wanted to check, okay? So we can't really test it yet, but basically what we're doing is almost exactly the same as the text-to-speech view page with the difference that we are not using initial values, but instead we are loading the values from the database. That's the biggest difference here.

And we don't need default text-to-speech values import here at all. And we're not yet using the generation voice. So what I think we should do next is the following. Let me go ahead and go here so I think we should test this out first so let's go inside of source features text-to-speech components and let's go inside of text-to-speech form right now nothing happens if we submit right So I want to do this first so we can test out if what we developed works. So toast from Sonar is an import that we need.

Use router from Next Navigation is something we need. Then we need use mutation from tan stack react query. We also need use the RPC from at forward slash the RPC client. And then let's go ahead and add this. Let me just go ahead and see where should we do this.

Inside of text to speech form. This is where we should do that. So TRPC, then let's add the router. And then let's go ahead and add create mutation to call use mutation. And in here, we're going to call the RPC generations create with empty mutation options inside.

All right. And then inside of the on submit, we can delete this comment and we can open a try and catch inside of the catch one. Let's first go ahead and define the potential error message so depending on the instance of the error which we have to extract depending on the instance of the error we can either load the error message directly or a generic failed to generate audio and then toast error message so that's the try now let's go ahead and do that's the catch now let's go ahead and do try so we're going to call mutate async so we can await that so create mutation uses generations create which we just used to connect to the text to speech using open API specification and in here we just have to pass the parameters so text value not text trim the voice ID value voice ID and yes we need to add value here. Okay. Then we need the temperature.

Then we need top B. After that we need top K. And after that we need the repetition penalty. And then we're going to throw a toast, audio generated successfully. And using the data from this async mutation we can push the user to text-to-speech data ID which will basically redirect the user to generation ID where we're going to load that generation and pre-fill the data with all of those things.

So I think we should be able to try this now. So I want you to prepare the following. I want you to have NPX Prisma Studio running and right now we have no generations, right? So I'm going to do hello world and I will simply select Aaron, maybe do some modifications. I'm going to do some drastic ones just to easily recognize them to see if it's working.

Okay, like this. And I'm going to click generate speech. This will probably take a while. Okay, because it's actually doing the entire thing right now. So it's currently uploading this to...

It's communicating with model right now. You can see it's pending. So right now it's trying to generate our Hello World. So if you get any errors, maybe open model so you can see why it fails. For example, if you added an incorrect inside of chatterbox client, if you added an incorrect API key, if you maybe misspelled this or added an incorrect base, you're right, even though this should definitely work if you manage to create the types, right?

So it's actually building that now and then it's going to upload it to Cloudflare as well. So have Cloudflare open too and in here in the resonance app you will see a new folder called the generations and you will actually be able to listen to the result. So basically what we previously tested in the command line interface, let me show you that chatterbox which was this right We tested this and we tested this C URL. Oh, it works. So, okay, I have to just stop here.

We just basically created a UI for this. That's what we just did. We created a TRPC and UI for that. So now you can see that, look at my URL. I am on localhost 3000 text to speech generation ID.

And in here, I have hello world, I have Aaron, and I have these dramatic changes, right? And if I refresh, you can see that I'm on the exact same result. So if I go back to dashboard and then back to text to speech, it's completely empty. But If I go inside of my Prisma Studio now, inside of Generations, and find the Generation ID, and then go ahead and append that Generation ID, I can load that previous generation. Not only that, but when I refresh the Cloudflare, Hopefully, there we go.

Generations folder for this specific organization. Someone just created an audio. Go ahead and download it and it should say Hello World in the Aaron's voice. So now I'm going to go ahead and try and just do another voice style just to see if it works and yes subsequent requests should be faster. You can see that this one is faster.

Let me refresh. There we go voice style works. Beautiful! So what we have to do now is we have to create some kind of basic audio display here and then we're going to change it to a full-on waveform selector. Let's develop the voice preview panel which will appear here instead of this placeholder for generations which have an audio.

So I'm going to go ahead and close all of these other files. I'm going to go inside of source, features, text to speech, components, and I will create a voice preview panel.tsx. Let's go ahead and mark this as use client. Then let's add some imports. So use ref, use state, use effect from React and pause and play icons from Lucid React.

For the components we're going to need a button and voice avatar. Both of those are actually, sorry, button is from ShadCNUI, voice avatar is our custom voice avatar component which uses the use voice avatar Dice Bear collection hook. Okay, so make sure you use that one. Let's create a interface. VoicePreviewPanel.Voice will accept an optional ID and a name.

So let's go ahead. So that's specifically for the voice prop. If you forget what I mean here, in the views, this ID and name. That's what we'll be accepting for the voice prop. But the actual voice preview panel function, which we are going to export now, will have some different props.

So let's go ahead and define the types first. It's going to accept audio URL, then the voice, voice preview panel voice or null, and finally the text. And then let's go ahead and extract those three. Perfect. And let's go ahead and open up the function.

Inside of here, we're going to go ahead and define the Selected voice name by checking if we have voice.name, otherwise fallback to null. We're gonna do the same thing with the selected voice seed by checking if we have voice id or fallback to null. We're then gonna go ahead and create an audio ref which has a type of HTML audio element or null by default using the use ref hook. We're going to create a state which will keep track of whether the audio is playing or not. By default it will not be playing and we're going to create a temporary use effect here which will be used to play audio.

Later this will be moved to the wave surfer hook. But for now let's just go ahead and create a super simple audio player. So if the audio is missing let's go ahead and stop. Let's go ahead and add a very simple handle play set is playing to true, handle pause, set it's playing to false and ended to false as well. And for each of those functions we can add an equivalent listener.

So add event listener, play, pause, ended and their respective functions. And in case any error appears we can just catch that. And then make sure that you always clean up after yourself. So we have to remove all the event listeners we've just added and the bindings to their functions. So play, pause and finally end it.

And in here we need audio URL. All right. Now let's go ahead and let's develop the toggle play pause function. Inside of here again let's check if we have audio and a simple toggle will follow if is playing audio pause else audio play as simple as that then let's go ahead and do a return here we're going to start with a div which will have the following class names height-full, gap-eight, flex-col, border-top, hidden, flex-one, and enlarge, flex Then we're going to develop the header which is going to have a container of padding 6, padding bottom 0, and a heading 3 element of font semi bold and text foreground with the text voice preview. All right.

And then let's go ahead and add a very simple content here with again relative flex flex one item center justify center. This is some weird error. I don't think it should matter. Yeah. And in here we display an audio element with the ref audio ref and source audio URL.

OK. And then we just have to develop the footer. The footer again is going to have a div flex flex call item center and padding 6. Then inside of it we're going to have a grid with full and three columns inside. The first column will be taken by the metadata about the audio we are playing.

The metadata will have a div flex minimum width of zero, flex call and a gap 0.5 and inside of there we're going to display a text within a paragraph. The paragraph will be truncated, text small, font medium and text foreground. And now in here we have to see if we have a selected name. So if we have a selected voice name in here we're going to open a container once again. Let me go ahead and close it.

This will be a div with flex, items center, gap1, text extra small, and text muted foreground. In here, we're going to use the voice avatar, we're going to pass along the seed by using either selected voice seed if we have the voice id or we're going to fall back the selected voice name and for the name we're going to use the selected voice name which will use the snapshot of the name and the class name shrink zero and then let's go ahead and render the voice name. So this is basically why we are doing all of that snapshotting So if someone loads this component but the voice is deleted, we always have a snapshot of the voice name. So we can always display something here. And then the last thing we need to do here is the player controls.

Okay. The player controls is going to have a div flex item center justify center and gap three and let's add the first button here which will have a variant of default size icon large class name rounded full and on click toggle play and pause and depending on whether the state is playing is true or false we're going to display respective icons if is playing render pause otherwise render play Both have the same class name fill with background color. And for now, after this, did just add a simple spacer. OK, We don't have the component for this yet, so just a single self-closing div, like that. Great.

Now let's go ahead and render this. So we're gonna go back to the view, text-to-speech detail view, and instead of rendering voice preview placeholder we're going to render voice preview panel ok, so go ahead and import that from Components Voice Preview Panel. You can remove the placeholder now and we're gonna have to add some props here. So the props are gonna be the audio URL coming from the data audio URL, voice, generation voice, and text, data.text. Keep in mind I'm doing this in text-to-speech detail view.

Don't accidentally do it in text-to-speech view. In here it needs to be the placeholder And in here you don't have the loaded generation, you don't have data. Data is generation query result, okay? Get by ID result. That's why we have all of those things here.

And then on desktop mode, you should be able to see this. So if I go ahead and do, I don't know, hello there, how are you, and go ahead and click generate speech, I'm going to pause, you should see the text here and you should be able to listen to the result. And I just confirmed that this works as expected. You can see the text is here, the voice is here, and when I click play I can hear this exact text. Great!

The only problem is it doesn't work on the mobile. So mobile has no way of seeing any of those things. So let's go ahead and develop a very, very similar component called Voice Preview Mobile. So I'm going to copy Voice Preview Panel, paste it, and I'm going to rename it to Voice Preview Mobile. Then in here, I'm going to go ahead and change this to be Voice Preview panel to be voice preview mobile voice.

And I will change this prop right here. I'm going to go ahead and change this to be voice preview mobile function export. It will still accept audio URL, voice and text. So all of This is good. One important difference here.

We're going to add use is mobile hook. So import hooks use is mobile right here. Selected voice name and seed say the same audio ref stays the same, is playing stays the same, the use effect stays exactly the same but let's also do the following, let me just see this should be in reverse, So let's do audio pause and audio current time set to zero. Okay, so kind of in reverse here. Then let's go ahead and see.

We have to add one more use effect here after this one for mobile. If it's not mobile, make sure that this is always paused so you don't hear double audio. Okay. Then in toggle play pause, it should be normal, nothing special here. And in general if audio URL is not available just don't display this component.

Now let's go ahead and change the outer div here. So let's change it to be border, top, padding 4 and large hidden. Then immediately here we're going to render audio and we can remove everything inside for now. We will copy some things if they're similar. So Let's go ahead and create a little grid here.

So grid, grid columns using this specific value, items center gap 4. Then let's go ahead and do a width reset using minimum width of 0. I'm going to go ahead and render the current text of this generation in truncate text small and font medium let's go ahead and Render the selected voice name if we have it the container for the voice will be the following a Div with class name a margin top 0.5 flex items center gap one text extra small and text muted foreground. Let's render the voice avatar in here with seed, which is either voice seed or selected voice name and class name shrink 0 and beneath it or should I say next to it we're going to render selected voice name within a span which is truncated. All right and then outside of this div we're going to render a new div with flex items center and gap too.

And you can go ahead and go inside of voice preview panel here, and you can copy this button. So just paste it here now. So the buttons variant will be default. Let me just fix this and this. Let me see if I did this correctly.

I did. Okay. The size will not be icon large. It's just going to be icon. On click will be the same.

Is playing will be the same. Great, that's it. Now that we have Voice Preview Mobile, let's go inside of Text-to-Speech detail view and beneath the voice preview panel, you should now also add a voice preview mobile or above it. It really doesn't matter. Import this.

There we go. And it looks like something is off here. Because I can see it immediately here voice preview mobile it looks like it doesn't have any spacing So let's go inside of voice preview mobile to see exactly what's happening here. So border top. Oh, hidden padding 4.

Okay, I did something large. There we go. And when you click play, it should do the exact same thing. Brilliant. So now it works both on desktop and mobile.

Great. What we should do now is we should add the audio waveform visualizer for the desktop mode and enable the download button. Now let's go ahead and let's install a package called Wavesurfer. So npm install Wavesurfer.js and I'm going to show you what version I'm working with. Package.json Wavesurfer 7.12.1 Again, you don't have to be on the same version, but in case something is broken for you, it could be that some breaking change has been introduced.

Now we're going to go and create a hook for initializing Wavesurfer. So we're going to go inside of source, features, text to speech, Go ahead and create hooks folder here and let's create use wave surfer dot TS Now keep in mind that this will mostly be rebuilding what we just created here in the voice panels, right? But using the voice surfer library. So if you want to feel free to copy this component from the source code, simply because I'm not sure there's so much learning value here as there is just, you know, setting it up. But however you prefer, I will still go through the file to explain what we're doing.

So we need callback, useEffect, useRef and useState. Let's get waveSurfer from waveSurfer.js and useIsMobile from hooks.useMobile. Then let's go ahead and define useWaveSurferOptions to be optional URL and optional autoplay boolean as well as onReady and onError optional functions. We're then going to create an interface Use wave surfer return and in here, Let's go ahead and add container ref with react ref object, HTML div element or no is playing is ready. Current time duration and then some functions.

These functions are going to be toggle, play or pause, seek forward for a specific number of seconds, and seek backwards for a specific number of seconds as well. Using that, we can go ahead and export a function, useWavesurfer. We can go ahead and assign useWavesurfer options for URL, autoplay, onReady, and error, and we can define the return method. Now let's go ahead and for now I'm just gonna remove the return method simply so it doesn't error so much but later we're gonna add it back. Let's go ahead and define the container ref using use ref, wave surfer ref again with use ref, the container one is a div element, the wave surfer has wave surfer instance or null, and our use is mobile hook instance.

Then let's define all the states that we're going to need. IsPlaying, IsReady, CurrentTime, and Duration. First two are booleans, The second two are numbers. Then let's go ahead and create a use effect. The use effect will control the wave surfer in a similar manner as our use effect did in the voice preview panel.

So first things First, let's check if we can mount the waveform anywhere and if we have the URL. If we don't, we can do an early return. If we do have it, let's go ahead and make sure that we reset it entirely by destroying and restarting it. Then we can go ahead and set destroyed back to false. And Then we can go ahead and set destroyed back to false.

And then we can go ahead and create the WaveSurfer instance. In here, we're going to select to which container should we mount WaveSurfer to. And then we're going to add some colors to it. So I'm choosing these colors for wave color, progress color, and cursor color simply because they mostly match what we have in our global CSS. And then we go ahead and we match the following things.

The cursor width for example. Then we go ahead and do bar width and we make that 2. We do bar gap and we put that to 2. Bar radius, bar min height. Height, we set this to auto.

And Normalize we set that to true Then let's go ahead and Append to the wave surfer ref the wave surfer instance. We just created above and let's go ahead and add some events. The first event will be wave surfer on ready. So what we should do is we should set set is ready to true and set duration to wave surfer get duration. In here we're gonna add the following thing.

If autoplay, let's go ahead and do vs.play and with an empty catch. The reason we're doing that is because otherwise you will have errors in your console. Catch not allowed error when the browser blocks autoplay without user interaction. So this is a simple fix for that. And then beneath that, let's just go ahead and execute the onReady callback.

Then we have to go ahead and assign all the other WaveSurfer events. So on play, pause and finish we are changing the hooks accordingly. On play set is playing to true and on pause and finish to false. So very similar to what we did here in the voice preview panel. We're just adjusting it for this package.

There is another one here which is for duration, so time update set current time. Then Let's go ahead and do the error scenario, so vs on error, if destroyed, let's go ahead and do an early return, otherwise let's log the error and let's trigger a callback new error string error. Then let's go ahead and load the URL. So vs.loadURL and let's catch the error. Again, if destroyed do an early return, otherwise log and do an on error, which is identical to one above.

And then we need to do the cleanup. The cleanup is very simple. Return set the destroyed to true which will prevent all subsequent loads or errors and do vs destroy. And the dependencies array should be the following. Url, autoplay, onready, onerror and ismobile.

Then let's go ahead and define a function toggle play pause with use callback wave surfer current play pause then let's go ahead and define a seek forward function. Seek forward function will set the seconds parameter to be 5 by default. It will initialize the wave surfer dot current. If it doesn't exist, it's going to do an early return and it's going to calculate the new time based on the current time of the wave surfer, multiply it, my apologies, add seconds to it and pass in the total duration in the second argument of math.min so it cannot go above that. And then it will use wave surfer seekTo the new time.

Now let's go ahead and do an equivalent seekBackward, which is a very similar function, it just works in a different, in the opposite way. So we use math.max and we deduct the seconds instead of add the seconds. And then again we just repeat vs. SeekTo. We're using useCallback here so we memorize these functions.

And the last thing we have to do is we have to return all of those functions through return. So container ref is playing, is ready, current time, duration, toggle play, pause, seek forward and seek backward. Now let's go ahead back inside of the voice preview panel. So inside of source features, text to speech components, voice preview panel. And in here we can now remove pause and play and we can remove use ref, use state and use effect.

Actually let's do the following. The only thing we're gonna leave here is use state and for pause and play we are gonna add some more icons so we're gonna add download we're gonna add redo and undo then let's go ahead and let's also add format from date FNS here. Let's go ahead and import a batch component from components UI badge. Let's go ahead and import a spinner component from components UI spinner and let's go ahead and import CNutil. Then let's go ahead and import this new package, I mean this new hook we just built, Wavesurfer from hooks use Wavesurfer.

So we have to go outside of the components folder inside of hooks use wave surfer that's where it's located and then let's go ahead and develop a very simple format time function here it accepts the seconds and it returns a string So let's go ahead and return format, pass in new date, seconds times a thousand and in this specific format, we're using format from date FNS Then let's go ahead and let's add one more property here to the voice preview panel so I'm gonna go ahead and add is downloading and set is downloading here in the use state then what we're gonna do is we're gonna remove audio ref we're gonna remove is playing and we're gonna remove the entire use effect okay then go ahead and remove the entire toggle play pause and instead let's go ahead and let's add use wave surfer for the URL pass along audio URL and auto play set to true. Then in here let's get the container ref is playing, is ready, current time, duration, toggle play pause, seek forward, sorry seek backward but also seek Forward which I don't seem to have I will see why that is happening.

Oh, we do seek forward, okay? And that should be it Now let's go ahead and let's add a function to download a file. So handleDownload is going to trigger setIsDownloading to true. It will generate a safe name based on the text prop. So it's gonna slice to 50 characters, trim it.

It's going to use the following regex to turn it into a safe file name and it will replace any characters that file systems don't support and remove them. Otherwise, it's just going to fall back to speech. If you just want to, if you don't want to write these regexes, you can just use safe name to be like speech. Right. This is if you want to reuse the text so the user sees exactly what this file is about and the way the download is going to work is by creating a mock element and then simply appending to that element and simulating a click on it.

So we're gonna add it to the body, click on it and then immediately remove it. And then after one second, we're going to set downloading back to false. Alright. Now let's go ahead and fix everything we need here. So in the content here, We're gonna go ahead and remove this audio here with isReady like this.

So if it's ready, what it's gonna do is it's going to render a div like this with absolute inset zero, z-index of 10, flex item center and justify center and then it's going to render a batch component. This batch component is going to have the following class name gap2 bg-background with 90% opacity px3 py1.5 text-small text-muted foreground and shadow-small and inside we're gonna go ahead and render a spinner with the size 4 and we're going to render a span loading audio and then outside of this my apologies I told you this is is ready This is supposed to be not ready. So make sure you have an exclamation point here. This is like we are loading your audio. And then otherwise, we're going to go ahead and render the container ref here where the wave surfer will be initialized.

So this wave surfer will have the following class name, full width, cursor pointer, transition opacity and duration 200. And if it's not ready it will simply have opacity 0. The reason we cannot hide the entire element using this is because this element needs to be mounted somewhere because otherwise, WaveSurfer will never load. That's why we have to do it with opacity. Let's see what we have to do next.

So outside of this div here, we have to add time display. So I'm gonna add flex item center and justify center. I'm gonna go ahead and add a paragraph here which will be text 3xl font semi-bold tabular nums because the numbers will change, tracking tight and text foreground. We're going to go ahead and display the current time. Then we're going to add an empty space and render a span with text muted foreground and the divider with format time.

So I'm gonna add this here, format time duration. In the footer here, majority of this will stay the same. Let's go inside of player controls to see what needs to be changed here. Alright so the players control will have a change. We're going to introduce some new buttons.

So the first button here is going to be a variant of Ghost, size icon large, class name flex call and on click we're going to seek backwards for 10 seconds. And it's going to be disabled if the audio is not ready. It's going to use the undo icon with class name size 4, minus margin bottom 1 and the span text 10 pixels font medium 10. So we can go ahead and copy this then, go after this button to toggle play pause, paste it here and change it to be seek forward. And I think we don't have to do much except do redo icon here.

Everything else should be exactly the same. And then one last thing we're gonna do is instead of this spacer here, we no longer need spacer, we're gonna add download. Flex justify end, a button, variant outline, size small, on click handle download, disabled is downloading download icon and download text So let's go ahead and see this now Beautiful you can click anywhere. You can listen to it. Let me refresh to see this.

If you want to, you can add some different prompt. Let me go ahead and try and do something here. I'm gonna go to text-to-speech. I'm gonna add a longer prompt and choose some deep voice and I'm gonna click generate speech to test it out. And here we have the result.

You should be able to change the specific time. You should be able to seek forward. You should be able to seek backwards. You should be able to pause, play. You should be able to download the file, go ahead and play it to see if it works.

And now we just have to do the same thing but on the mobile, but not all of it, just the download button. Okay, because thankfully for us, mobile doesn't have the capacity. Okay, so let's just go ahead and I just wanna confirm that's the only thing we have to do. So let me go ahead and find voice preview mobile. Yeah, So let's go inside of Voice Preview Mobile The only thing we're gonna do is we're gonna add the Download Icon To Lucid React Then we're going to copy the entire Handle Download Function from here We're gonna add it Anywhere Here I'm gonna do it before this early return.

Let me see. We don't need set is downloading. You can remove the set timeout. So this one will be simpler. Okay.

And then we just have to render the button somewhere. So let me go ahead and see where we should do that. Let's go ahead and do it here. Flex item center gap 2. So right before the toggle play pause, we're going to add a button with variant ghost size icon on click handle download.

And in here, render a download icon. There we go. Right next to it, There's a download button and it works. Beautiful. Amazing, amazing job.

So go ahead and feel free to play with this. This is, I would say, the main point of the project. I will test this myself to see if there are any visible bugs, but I think we did a pretty good job with this. I love how this waveform look. I like that it's clickable and you have these cool controls.

Great! Amazing, amazing job. Let's go ahead and let's properly test the linting and everything. So npm run lint, npm run build and see if it works. Great, this works just fine.

Now here's one tip for you. So inside of Chatterbox text-to-speech Python, we kind of leak our bucket name and account ID. When I say leak, I mean to GitHub, nowhere else. So if this is a private repository, then include Chatterbox text-to-speech. If it's not, you might want to add chatterbox tts.py to gitignore.

Okay? So if it's a public repository you probably want to hide it. If it's a private repository like mine it doesn't matter. Right? I mean as I said no one can really do much without your secret.

And that's the most important part here. But you shouldn't even give them a chance to try and brute force your account ID. But It should work. This file right here can even be maintained in a separate repository. You don't have to keep it in this one.

I'm keeping it here for simplicity sake. Obviously, as this file grows, if you want to do proper fast API, you can do a whole file structure around it and folder structure. So now I'm going to do so this is chapter six. Okay, get checkout dash b 06 tts generation and audio player and git push u dash u origin 06 tts generation audio player. Great!

And let's go ahead and open a pull request here and let's review the changes. Here we have our finished CI CD. So summary, we added text to speech generation with customizable voice parameters and audio controls, we added audio playback functionality with optimized interfaces for mobile and desktop devices, download functionality and generation detail pages to view and manage text-to-speech outputs. We also have a successfully deployed railway instance, though if you try and do something on this web instance it will instantly fail. The reason is we need to add new variables here.

So I would recommend that you do that if you deploy it on railway the same way I did. So I'm going to go inside of dot environment here. And let me just see everything that I am missing here. I'm missing a lot here. I'm going to go ahead.

It feels like I've had more of them before let me check if I'm on the correct one what I'm gonna do is I'm just gonna copy the entire environment file and just paste it here I will just skip removing this like this. So I'm gonna keep everything in here but make sure you have skip environment validation set to true because that's not part of our dot environment. So I'm gonna click update variables here and then I'm going to click deploy once again. Just so this works on the deployed side. About the CodeRabbit comments.

So this one is a major one, which again I'm not too sure about because our works. So it says that the course configuration is invalid because we allow origins and we are using allow credentials true. For course specification browsers will reject responses with this combination. Since authentication here is API key based, not credential cookie based, set allow credentials to false. Okay, I will look into this and see if we have to modify it for this case.

I think our works is because we access it through the API, but I will verify that for you. It also does some additional Python changes here, but as I said, I'm not too familiar with Python, so I'm really not sure. All I know is that the code currently works. In here, it's detecting that we should handle these awaited signed URLs in a better way, but that's basically the message I've been trying to get across every time we do anything R2 related. You should probably in production for scale do this in some kind of retry way in some kind of background job and things like that.

Some accessibility issues we should probably be aware of that too. These buttons do need accessibility things. Same thing here. In here we should probably do this. Reset player state and clear the ref when the URL is missing.

So that's a potential edge case in use WaveSurfer hook. In here our guard seek operation when duration is 0 is incorrect. So it says that it can get to infinity which should break the API. So I will look into that as well. And this is a good one.

Yes, chatterbox client is a server only lib, so we should definitely use the server only package for it. That would be a very very good fix. So this is definitely a major issue, because otherwise this can leak into client files. Very, very good thing. In here, it looks like we're not cleaning up uploaded R2 objects in case post upload database fails.

So we then have an orphaned R2 object. We had this discussion, right? Again, this should be handled in some background job with some retries and things like that. Add a timeout to the upstream text-to-speech generation call. Okay, I'm not sure what this is.

I will research if we need that or not. Other than that, amazing, amazing job. Let's go ahead and merge this. We will look into those major issues in the next one. Let's go ahead and get checkout main, git pull, origin, main, and that will synchronize our branches.

So in here, we should now be on the main branch and we should now see here in the graph, the scenario six merged to main. Beautiful, amazing, amazing job. Great comments by CodeRabbit here. And we have CI-CD building from Railway with updated environment variables. Great!

So I believe that marks the end of this chapter. We are on the main branch and we are ready for chapter 7. Amazing, amazing job!