Using the new AI template to create a chatbot about a website : Andrew Lock

Using the new AI template to create a chatbot about a website
by: Andrew Lock
blow post content copied from  Andrew Lock | .NET Escapades
click here to view original post


In this post I use the new .NET AI Chat Web App template (currently in preview) to create a chat application that ingests the contents of a website (The Modern .NET Show) and uses that data to answer questions in the chat.

This post was partly inspired by a conversation I had with Jamie Taylor at the MVP summit in which he was exploring ways to do exactly this: have a chatbot for discussing the contents of his podcast, The Modern .NET Show. Seeing as Jamie already has full transcripts for the podcast on his website, this scenario seemed like a perfect use case for the new AI template!

The new .NET AI Chat Web App template

In the previous post, I explored the new .NET AI Chat Web App template (currently in preview), walked through the getting started experience, and explored the code it includes.

To install the AI template, you can run the following command

dotnet new install Microsoft.Extensions.AI.Templates

You can then create the template by running aichatweb, and providing the required parameters:

dotnet new aichatweb \
    -- output ModernDotNetShowChat
    --provider githubmodels \
    --vector-store local \
    --aspire true

I describe these various options in the previous post, as well as how to configure GitHub models to allow free access for prototyping applications using large language models (LLMs) from OpenAI and others. The generated solution looks like the following:

The solution layout

Inside the solution folder is a README.md file that describes the remaining configuration, such as how to configure the required connection string for GitHub Models:

cd ModernDotNetShowChat.AppHost
dotnet user-secrets set ConnectionStrings:openai "Endpoint=https://models.inference.ai.azure.com;Key=YOUR-API-KEY"

Finally, I showed the working chat application. You run the app by running the Aspire AppHost project. This starts the web app (and passes in all the required connection strings). The web app then runs an "ingestion" process against 2 pdf files (about watches) that are available in the content folder. More on this later.

The web app is a "traditional" chat application, just like you've seen with ChatGPT or GitHub Copilot Chat. This interface lets you ask questions about the PDFs that were ingested. In the example below I asked the question "Which watches are available":

Trying out the default template

The chat assistant interprets your question and decides what phrases to search for in the documents. It then answers your question based on the details it finds in the documents, and even provides a link to the file that contains the answer.

This general technique of providing "sources" for the LLM to use, instead of relying on the built-in knowledge is called retrieval-augmented generation (RAG), and is one way to try to ensure that the LLM provides answers grounded in facts. It involves ingesting source data, encoding it as vectors in a vector store, and making this store available to the LLM.

In the previous post I explore the template in more detail, but in this post I take a different approach and customise the template for a slightly different purpose.

Modifying the website to chat about a website

As I mentioned at the start of this post, when I saw this template, I immediately thought of a conversation I had with Jamie Taylor at the MVP summit in which he was exploring ways to do exactly this: have a chatbot for discussing the contents of his podcast, The Modern .NET Show. Jamie already has full transcripts for the podcast on his website, so all we should need to do is tweak the data ingestion details.

I also had to make various tweaks to the UI to show links to URLs instead of PDFs, but those are less interesting so I don't discuss them here. You can see the updated project on GitHub (the modifications are all in this commit).

Creating a new ingestion source

Other than UI changes, the main change needed is to the ingestion code. Instead of ingesting PDF files, the app will ingest web pages from The Modern .NET Show website. To do that, you can implement the IIngestionSource interface which is part of the template:

public interface IIngestionSource
{
    string SourceId { get; }

    Task<IEnumerable<IngestedDocument>> GetNewOrModifiedDocumentsAsync(
        IQueryable<IngestedDocument> existingDocuments);

    Task<IEnumerable<IngestedDocument>> GetDeletedDocumentsAsync(
        IQueryable<IngestedDocument> existingDocuments);

    Task<IEnumerable<SemanticSearchRecord>> CreateRecordsForDocumentAsync(
        IEmbeddingGenerator<string, Embedding<float>> embeddingGenerator, string documentId);
}

As you can see, there are three methods that need implementing, and a property that gives the ingestion source a unique ID. The property is the easy part, so we'll start there. We'll also create an HttpClient instance that we'll use to retrieve data from the website:

public class WebIngestionSource : IIngestionSource
{
    private readonly HttpClient _httpClient;

    public WebIngestionSource(string url)
    {
        // Create a unique source ID based on the type name and the provided URL
        SourceId = $"{nameof(WebIngestionSource)}:{url}";
        _httpClient = new HttpClient()
        {
            BaseAddress = new Uri(url),
        };
    }

    public string SourceId { get; }

    // ...
}

We have the initial basis of our type, so now we can start implementing the required methods. We'll start by implementing the function that lists all the possible pages of the site.

Finding all the pages in the site

To find all the pages on a website you can basically take two options:

  • Crawl the website to find all possible pages.
  • Use a sitemap.xml file if the site provides one.

Luckily for us, the Modern .NET show site includes a sitemap.xml file, so we can take the easy route and parse that. The following basic code parses the sitemap.xml file using the XmlSerializer, parsing out only the values we're interested in (the location (url) and the date it was last modified):

private async Task<Sitemap> GetSitemap()
{
    var serializer = new XmlSerializer(typeof(Sitemap));
    await using var stream = await _httpClient.GetStreamAsync("sitemap.xml");
    var sitemap = serializer.Deserialize(stream) as Sitemap;
    if (sitemap is null)
    {
        throw new Exception("Unable to read sitemap");
    }

    return sitemap;
}

[XmlRoot("urlset", Namespace = "http://www.sitemaps.org/schemas/sitemap/0.9")]
public class Sitemap
{
    [XmlElement("url")]
    public required List<Entry> Entries { get; set; }

    public class Entry
    {
        [XmlElement("loc")]
        public required string Location { get; set; }

        [XmlElement("lastmod")]
        public required DateTime LastModified { get; set; }
    }
}

From this method we get a SiteMap object, that includes all the pages available on the site. We'll use that in the GetNewOrModifiedDocumentsAsync() method that is called by the data ingestion code. This method is provided a collection of IngestedDocument which are the previously ingested pages from previous executions of the app. Any IngestedDocument that you return from the method are queued for subsequent ingestion; this method doesn't do the ingestion itself, it's just finding which pages to ingest:

public async Task<IEnumerable<IngestedDocument>> GetNewOrModifiedDocumentsAsync(
    IQueryable<IngestedDocument> existingDocuments)
{
    // Fetch the sitemap for the website
    Sitemap sitemap = await GetSitemap();
    var results = new List<IngestedDocument>();

    // Loop through all of the entries in the sitemap
    foreach (var entry in sitemap.Entries)
    {
        // The "ID" for the page, which is the URL of the page
        string sourceFileId = entry.Location;
        // The "version" for the page, which is the last modified time
        string sourceFileVersion = entry.LastModified.ToString("o");

        // Try to see if we have ingested this page before
        var existingDocument = await existingDocuments
            .Where(d => d.SourceId == SourceId && d.Id == sourceFileId)
            .FirstOrDefaultAsync();

        if (existingDocument is null)
        {
            // If there's no matching document, add this page to the ingestion list
            results.Add(new() { Id = sourceFileId, Version = sourceFileVersion, SourceId = SourceId });
        }
        else if (existingDocument.Version != sourceFileVersion)
        {
            // If we have already ingested this page, but the last modified date has changed,
            // then update the version and add it to the ingestion list
            existingDocument.Version = sourceFileVersion;
            results.Add(existingDocument);
        }
        // Otherwise we have already ingested this version of the document
    }

    return results;
}

After the method is called, all the documents returned are ready to be ingested.

Ingesting new and modified pages

Ingesting a page involves several steps:

  1. Download the page from the website
  2. Parse the HTML for the page
  3. Convert the HTML into plain-text, so that it can be more easily understood by the LLM
  4. Split the page into paragraphs
  5. Generate vector-embeddings for performing RAG for each paragraph

Step 1 is easy, as we can use HttpClient. For step 2, I chose to use AngleSharp, a .NET library for parsing HTML, and my go-to for this kind of thing. For step 3, converting the HTML to plain-text, a quick bit of googling found this small package called Textify which seemed to do exactly what I needed, converting an AngleSharp IDocument to markdown.

Step 4 comes courtesy of a utility called TextChunker that's currently built into the Semantic Kernel packages (though I expect these will end up pushed to other abstraction libraries in the future). Step 5 is handled by the template and abstraction libraries provided by Microsoft and OpenAI, in particular the Microsoft.Extensions.AI.Abstractions and implementation libraries.

The code below shows how we I achieved all this relatively simply using the abstractions provided by the template:

public async Task<IEnumerable<SemanticSearchRecord>> CreateRecordsForDocumentAsync(
    IEmbeddingGenerator<string, Embedding<float>> embeddingGenerator, string documentId)
{
    // Request the page body for the provided ID (url)
    await using var stream = await _httpClient.GetStreamAsync(documentId);

    // Use AngleSharp to read and parse the HTML document 
    var config = Configuration.Default.WithDefaultLoader();
    var context = BrowsingContext.New(config);
    IDocument document = await context.OpenAsync(documentId);

    // Use Textify to convert the HTML to a markdown document 
    string pageText = new HtmlToTextConverter().Convert(document);

    // Split the document into paragraphs using the experimental type from SemanticKernel
#pragma warning disable SKEXP0050 // Type is for evaluation purposes only
    List<(int IndexOnPage, string Text)> paragraphs =
        TextChunker.SplitPlainTextParagraphs([pageText], maxTokensPerParagraph: 200)
            .Select((text, index) => (index, text))
            .ToList();
#pragma warning restore SKEXP0050 // Type is for evaluation purposes only

    // Generate embeddings for all the paragraphs
    var embeddings = await embeddingGenerator.GenerateAsync(paragraphs.Select(c => c.Text));

    // Combine the paragraphs and embeddings, to return a collection of SemanticSearchRecords
    return paragraphs.Zip(embeddings).Select((pair, index) => new SemanticSearchRecord
    {
        Key = $"{documentId}_{pair.First.IndexOnPage}",
        Url = documentId,
        Text = pair.First.Text,
        Vector = pair.Second.Vector,
    });
}

That's the most complex method, the final required implementation is handling record deletion.

Finding records to delete

The final method we need to implement is GetDeletedDocumentsAsync() which returns a list of documents that we previously ingested, which we should remove from the dataset. In the implementation below we again read the sitemap.xml and extract all the URLs to a HashSet. We then loop through all the existing documents; if the document ID is not in the set, we keep it, and return it from the method. These documents will be removed from the vector store.

public async Task<IEnumerable<IngestedDocument>> GetDeletedDocumentsAsync(
    IQueryable<IngestedDocument> existingDocuments)
{
    Sitemap sitemap = await GetSitemap();
    var urls = sitemap.Entries.Select(x => x.Location).ToHashSet();
    return await existingDocuments
        .Where(doc => !urls.Contains(doc.Id))
        .ToListAsync();
}

And with that, we've tackled the majority of the implementation. Note that I made a couple of tweaks to the SemanticSearchRecord type to replace the Filename property with Url and removed the PageNumber property, to better match our case of ingesting webpages instead of files.

Configuring the app to use the new ingestion source

The final step we need to take is to update the actual ingestion. This occurs in Program.cs of the web app, just before calling app.Run():

// ... existing setup

// Use the provided DataIngestor, but pass in 
// our custom IIngestionSource, the WebIngestionSource
await DataIngestor.IngestDataAsync(
    app.Services,
    new WebIngestionSource("https://dotnetcore.show"));

app.Run();

With that, we're finished. Time to take it for a spin!

Trying out the new chat app

We can try out our new app by hitting F5, or running dotnet run on the aspire app. This starts the ingestion process, first running GetNewOrModifiedDocumentsAsync() and GetDeletedDocumentsAsync() to find the pages to ingest, and then running CreateRecordsForDocumentAsync() to ingest the pages, for example:

info: ModernDotNetShowChat.Web.Services.Ingestion.DataIngestor[0]
      Processing https://dotnetcore.show/episodes/
info: ModernDotNetShowChat.Web.Services.Ingestion.DataIngestor[0]
      Processing https://dotnetcore.show/season-7/jonathan-peppers-unleashes-code-chaos-how-dotnet-meets-the-nes/
info: ModernDotNetShowChat.Web.Services.Ingestion.DataIngestor[0]
      Processing https://dotnetcore.show/
info: ModernDotNetShowChat.Web.Services.Ingestion.DataIngestor[0]
      Processing https://dotnetcore.show/season-7/google-gemini-in-net-the-ultimate-guide-with-jochen-kirstaetter/
...

This takes a long time (more on that later) but once it's finished, the app starts and we can ask questions about The Modern .NET Show:

Trying out the new app

Just as for the original template, the chatbot includes citations for its answers, including links to the page that contained the answer, and a quote from the page.

There are obviously a lot of tweaks that you could make to the template and the site in general. For example, I increased the size of the quote that the chatbot can include in it's citations. I didn't play around with the template much more than that, as the proof of concept seemed to work!

Ingesting web pages is slow

One observation I found is that the ingestion process was very slow. It took an average of about 10 seconds to ingest a single page. And when there's ~250 pages to ingest, that adds up to roughly 40 minutes of ingestion time! 😅 I haven't dug into this too deeply to understand where it's so slow, but I'd be willing to put money on it being the actual embedding step. We generate an embedding for every paragraph in the page, and seeing as that requires a network call, it seems reasonable to think that's the source of the problem.

Another aspect that's a little annoying is that the DataIngestor that ships with the template calls CreateRecordsForDocumentAsync() for all of the pages but doesn't save the results until after it's run for every document. That means that if you interrupt the process for some reason (because it's taking 40 mins perhaps 😉) then you have to start from the beginning again the next time you run the app. To work around that I made a small change to the DataIngestor to save after every 20 documents instead.

The final issue is that by default, the template uses a random file path for the SQLite ingestion cache every time the app starts up. That means it redoes all that ingestion work every time it starts up. The simple solution to that is to provide a path and filename for the database file. That way the database isn't re-created every time your app starts, and you can instead incrementally update the embeddings:

var ingestionCache = builder.AddSqlite(
    "ingestionCache",
    databasePath: @"D:\repos",
    databaseFileName: "modern_dotnetshow_embeddings.db");

That simple change makes the app much more usable. Obviously, for production you likely wouldn't be using a SQLite database and JSON file for your vectors, so it wouldn't be a problem anyway.

Overall, I think this makes an interesting proof of concept, although how useful it is in practice will remain to be seen I think.

Summary

In this post I showed how you can use the new .NET AI Chat Web App template (currently in preview) to create a custom IIngestionSource to ingest data about a website, so that you can chat with an LLM about the site. The resulting chat app provides quotes and citations from the website when answering your questions. You can find the source code for this demo app on GitHub.


May 13, 2025 at 02:30PM
Click here for more details...

=============================
The original post is available in Andrew Lock | .NET Escapades by Andrew Lock
this post has been published as it is through automation. Automation script brings all the top bloggers post under a single umbrella.
The purpose of this blog, Follow the top Salesforce bloggers and collect all blogs in a single place through automation.
============================

Salesforce