Firecrawl AI

Transcript

In this chapter, we're going to learn how to add Firecrawl AI to our project. And why do we even need it? Well, AI models have something called a knowledge cutoff. That means that from the moment their knowledge has been cut off or up to the point they have been trained, they probably don't know about anything happening after that. That would include for example, Next.js 15, Next.js 16, React 19, the newest React use hook, or maybe the new file name for a middleware inside of newest version of Next.js.

Those are all real problems when working with AI. So the goal of this chapter is to enable our AI models to improve their knowledge and go even beyond their cutoff date by giving them the ability to read anything on the web using Firecrawl AI. In fact, we already encountered this pattern very early on. So I'm going to give you a quick reminder. In the very first pull request that we've created, we noticed this pattern happen, right?

We had this comment by CodeRabbit, which basically warned us, hey, you are using an invalid filename. It's not called proxy. It should be called middleware.ts. And I have corrected it and said, that's not true. Here's the URL to the newest documentation.

And it agreed with me because the new name is indeed proxy.ts So this is a current issue of working with AI models. All of them have a knowledge cut-off date. They can do their best to be up-to-date with various methods and this is one of them, allowing users to teach them new things. So there must be some way this amazing tool CodeRabbit somehow read this URL. And while I do not know exactly how they do it, I know how we are going to do it.

Using Firecrawl, which will allow us to turn websites into LLM ready data. And in fact, I'm going to demonstrate exactly how that works right here. This is the finished project, right? This is what you will have at the end of this tutorial. And let's see how that works.

So I'm going to do the following. What is the current file name of the middleware file in Next.js? And let's see the answer that it gave us. So first of all, it told me this is not a Next.js project. Yes, that is because this AI tool has access to my entire app and this is indeed a Vite project, not a Next.js one.

But still, it answered and it gave me the wrong information. So it says in Next.js projects the middleware file is typically named middleware.ts. That is incorrect. So what I'm going to do is I'm going to copy the URL which demonstrates the new proxy file right here and I'm going to tell it that is incorrect. That is not correct.

Read this and tell me the new name. So let's see if it will be able to do that. And here we have the answer. You are absolutely right. So the same as CodeRabbit, right?

According to Next.js documentation, starting with Next.js 16, the middleware file has been renamed from middleware to proxy. The documentation states, starting with Next.js 16, middleware is now called proxy to better reflect its purpose. So this is what we will be able to achieve. By default, you are not able to do this just like that. You need a tool like firecrawl to help you achieve this effect.

So that's what we're going to be focusing on. So using the link on the screen, go ahead and create an account with Firecrawl. Once you get to the plan page, feel free to select their free plan. It will be more than enough for this project and in fact you will get even more credits than you think. So just go ahead and click get started on the free plan.

It's more than enough. And then what I want you to do is go ahead and click on your account here and let's go inside of our account settings right here. Actually let's click on the settings here and let's go ahead and click on billing and in here you will find apply coupon and in here go ahead and enter Antonio coupon. What this will give you is a thousand extra credits. So let's go ahead and just take a look at this.

So successfully redeemed coupon for a thousand credits. You know, you have nothing to lose. Go ahead and enter the Antonio coupon here. So what do we have to do now? Well, I think the best way to use it is by actually opening the documentation as well.

So first things first, we have to establish the fire crawl singleton or client, however you want to call it. And for that, we need an API key. So either copy the existing one or create the first one. Let's go ahead and install Firecrawl in our project. So npm install at mandible forward slash firecrawl dash js.

There we go. Now that this has been installed, I'm gonna go ahead inside of my source lib and I'm gonna create a new file firecrawl.ts and then inside of here I'm going to import firecrawl from our newly installed package and then I'm just going to export const firecrawl with an API key which reads from process.environment Firecrawl API key. Now let's go inside of .environment.local. Let's add firecrawl here and let's add that here. Let's copy this and let's paste it.

There we go. So we successfully created the firecrawl client. So what we have to do now to test this out is the following. I think the easiest way to do this is by modifying our ingest function here. Let's go ahead and make this just a bit more advanced.

So what we're going to do now is we're going to add more steps to this background job. We're first of all going to allow the user to add their custom prompt into this and then we are going to use that prompt to extract all URLs which user has pasted. So if user said, hey, read this URL, we have to specifically extract that URL. Why do we need to extract it? Well, because that is how FileCrawler API accepts data.

So in their standard features here, we have scrape. Make sure you select node because that's what we're going to be using. So this is the npm install package which we have created and we have this client set up. And basically, this is the function that we are going to call. Firecrawl.scrape and then we're going to enter the URL which we have extracted from the users prompt and we are going to return a format that we want to add to our context.

In our case, Markdown will be the one that we need. But as you can see, you can do even more advanced things like scraping down to HTML and so much more things you can do, which we are going to go through later. But for now, let's focus on the most simple and easy to understand feature here. So first things first, let's go ahead and modify this. Let's add event here and then Let's go ahead and extract from event prompt.

We can get that from event.data. And we are going to define the type of event.data to very simply accept a prompt, which is a type of string. There we go. Now let's go ahead and extract the URLs using await step.run and let's call the step extract URLs. It's going to be an asynchronous method.

So this is now a separate step. And what we need to define here is URL regex. So I'm gonna go ahead and write a forward slash, https question mark colon and let's add a backwards slash, forwards slash, another backwards slash and then another forwards slash, open square brackets, Then go ahead and add a caret inside another backwards slash s plus forward slash g or just Google URL regex or use AI. Okay, whatever you do, just make sure you have a regex that can scrape URLs. And what we're gonna do now is just return prompt.match URL regex or fallback to an empty array.

And this will then be basically an array of strings in the end. So that's what URLs is going to be. Now that we have that, let's go ahead and define our scraped content. And we can do that by doing await step.run scrape-urls. Again, an asynchronous method.

And what we're going to do is we're going to get the results by doing await promise.all urls.map so we are iterating over every single array, over every single URL which we have extracted, so let's go ahead and extract it like this and let's simply run the following await firecrawl which you can now import from lib firecrawl.scrape and simply pass in the URL as the first argument and then open an object formats and go ahead and add markdown as the option here there we go and now from this result let's just return result.markdown or fall back to null in case we were not able to scrape. There we go. And now let's just return results.filter by boolean. What this will do is it will filter out any of those null, basically unsuccessful ones. And let's just join this with a page break.

So use backwards slash N, backwards slash N. Great. And now what we can do is we can structure the final prompt. So the final prompt here will be scraped content. If it's available, we're going to go ahead and do this.

Context, colon, break, scraped content, break, another break, and then the user's question, prompt. Otherwise, just the prompt. We weren't able to extract any URLs, or maybe there were no URLs to extract. And then, instead of the prompt being this, the final prompt will be this. So what are we doing now?

We now have three steps. The first step is to extract all URLs from the user's query. The second step is to scrape those URLs using Firecrawl in a markdown format. And then we simply combine that into the prompt, leading us to last and third step to actually generate a response using this new context. So you can do this with of course Google, you can do this with Anthropic, you can do this with OpenAI.

Depending on what model you use, it might know more and it might know less, depending on their cut-off knowledge date. So I am assuming by the name of this one that its cutoff date was 2024. I don't know, maybe it is, maybe it isn't. So let's go ahead and run our project now. In here npm run dev and in another slide npx ignore scripts false ingest CLI latest dev.

I'm just going to expand so you can see how it looks like in one line. All right make sure you have all of these running and in fact you won't even need to see your app. You just need to see Ingest server, because we can now actually pass data from here. So go inside of functions, find the demo generate one and click invoke. And inside of here, go ahead and add prompt and let's try something simple.

What is 2 plus 2 for example. And let's click invoke function. So right now we are just testing if this works. And this was super fast as you can see. We first attempted to extract URLs and as you can see, no URL is found.

We attempted to scrape URLs, none of them were found and finally we just went ahead and there we go. This is the response, 2 plus 2 is equal to 4. Very simple, right? Basically, our prompt thingy works. So let's try this again.

This time let's go ahead and pose it the same question. What is the name of the middleware? Let me just go ahead and use some other quotes. File in Next.js. So if I am correct this model shouldn't know that the newest name is proxy.

Depending on when you are watching this tutorial maybe that's common knowledge in AIs now. But I think that right now, let's see, generate text Okay, in Next.js there is no specific middleware file Instead, Next.js provides a middleware feature that allows you to intercept Okay, basically yes, this is it. This is the middleware file. I guess it's just understood me in differently. But there we go.

It has no information about the new proxy thing, right? So let's change that. Let's go ahead and let me just prepare the proxy thing. If you want to find it too, go inside of the documentation and just click on proxy here. And then copy the URL.

So let's try now, Let's try again this time with firecrawl. So let's go ahead and give it here are the docs and paste it. Let's click invoke function and this time as you can see extract URLs has successfully extracted the URL. Then it scraped the URL. So you can see we have every single information about the page in markdown format now.

And then let's see the output. According to the documentation, the file for the middleware functionality in Next.js is now called proxy. So successfully we have extended the context of our very limited and knowledge cut-off model. So Claude's Haiku is actually, well, I'm not sure if the dumbest model is the correct term to use, but it's supposed to be used for very short tasks. So the fact that we were able to extend the knowledge of this very simple model shows you how useful something like Firecrawl is.

And if you weren't impressed enough by its generous free tier, amazing feature, did you know it's also open source? That's right, you can actually contribute to Firecrawl yourself. But let's go ahead and see what other things a fire crawl can do. So we just did the very basic scrape function, right? I think it's self-explanatory what this does.

It can scrape. But it can do so much more than that. So if for a specific case, perhaps our case when we need super fast responses within our code editor or Go suggestions, we might use their faster scraping function. You can see they even thought of that. Just make sure to always click on JavaScript or Node.js so you can see the actual code that you will be using.

They also offer batch scraping for multiple URLs. They offer JSON mode. And this is a cool one. They offer tracking changes on websites. This one is actually super cool.

I can already think of a SASS you could build around this. Your users could give you URLs of the websites they want to track and you could build a SaaS with Firecrawl which basically alerts them every time there's a change or maybe you detect that there is an A-B test going on perhaps something to see how competitors are doing. So you can see how many things they have besides this scrape thing, a stealth mode, proxies, so many other things. And if that wasn't enough, you can also search the web in general, right? So if you want to, you can extend this even further by not waiting for the user to give us the direct URL, but instead allowing the user to just search for anything they want, like what is the most up-to-date Next.js version and we would use Firecrawl's search of that term.

So instead of Firecrawl here, we would search for latest Next.js version. And we would maybe limit to top 3 results so we don't overload the context. And what that would do is it would return 3 items like this with relevant information about where you can find that info. So if that crossed your mind, you know, oh cool, but we have to know the exact URL. Well, not only that you can literally tell it, hey, can you search for Next.js proxy, right?

Are there any updates about that file? And it will genuinely find the results just like a Google search would. And then you could extract the URLs from here instead and do the usual process. So it is a very, very advanced tool. If that's not enough, they also offer something called Map.

Map basically allows you to input a website and get all the URLs on that website extremely fast. And I think there's like a billion things you can build with this. How far they've gone with this is absolutely amazing. And the fact they're open source means that they will only grow more. In fact, in here you can see their GitHub almost 70, 000, more than 70, 000 stars and extremely active repository, completely open source.

So I highly recommend that you take a look at this. I am super impressed by everything you can do here. You can even run this locally and you can self-host Firecrawl. Did I mention that? On top of everything else, they offer step-by-step self-hosting guides.

So I hope I explained why we need this because in the very first chapter that we did when we encountered an AI model which was CodeRabbit, you saw how it was limited by its cutoff knowledge date. And the one way we can improve that is by allowing it to read the web. And we just added that to our AI model using Firecrawl. Amazing, amazing job. So what we've done now is just get familiar with Firecrawl, understand what it can do and we grab some free tokens.

But what we're going to do later is obviously we're going to plug Firecrawl in into various AI features that we're going to have which I demonstrated at the beginning of this chapter. The agentic chat on the left side will have a step called extracts URLs and scrape URLs and then we're going to have a quick edit where user will also be able to add their own URL and we are going to have the same thing happening there. So that's how we're gonna be using Firecrawl in our app. Amazing amazing job! So let's go ahead and merge these changes.

So Firecrawl AI I'm just going to shut down my app. Let's go ahead and do git add, git commit 04 firecrawl AI git checkout dash b 04 firecrawl ai git push origin 04 firecrawl dash ai perfect Now let's go ahead onto our GitHub repository URL. Let's open a pull request. And this time we don't actually have to review it simply because this was just a demonstration of Firecrawl and we are later going to add Firecrawl in its proper place, in its proper functions. So right now it's in a function called demo function, right?

So no need to review that right now, since obviously this will later be added again in a proper function. So just for because of that there is no need to go through the entire review process right now. We are going to have the same review later. So let's just go ahead and immediately merge 04 firecrawl AI. And now we can go ahead and do git checkout main, git pull origin main.

Let's wait a second and let's confirm everything is good here. So we are on the main branch and inside of my source control in the graph here I can see that I have oh is it 04 firecrawler.ai? Huh Did I make a mistake in numeration? Oh, this is chapter 5. Okay, my apologies.

This is supposed to be chapter 5. So one mistake that I've made, my apologies. So later when you look at the branches, you will see 04 background jobs and 04 Firecrawl AI. That is supposed to be a 05. Okay, one mistake, it's not gonna hurt anyone, but you know that this is supposed to be 05.

Maybe you've noticed and corrected yourself. Either way, not a big deal. We go ahead and we continue going with this amazing project. So, we demonstrated outdated AI code patterns, we've set up FileCrawl web scraping, We successfully extracted URLs from user prompts. And finally, using Filecrawl's scrape function, we have enhanced prompts with live documentation.

Amazing, amazing job and see you in the next chapter.

Transcript

Read this and tell me the new name. So let's see if it will be able to do that. And here we have the answer. You are absolutely right. So the same as CodeRabbit, right?

They also offer batch scraping for multiple URLs. They offer JSON mode. And this is a cool one. They offer tracking changes on websites. This one is actually super cool.

Amazing, amazing job and see you in the next chapter.