Building an AI Home Security System Using .NET, Python, CLIP, Semantic Kernel, Telegram, and Raspberry Pi 4 – Part 2 : jamie_maguire

May 11, 2025 May 11, 2025

Building an AI Home Security System Using .NET, Python, CLIP, Semantic Kernel, Telegram, and Raspberry Pi 4 – Part 2
by: jamie_maguire
blow post content copied from Jamie Maguire
click here to view original post

This is the second instalment of a miniseries where I am building an end-to-end AI home security and cross platform system.

I recommend reading Part 1 before reading on. To recap, the main requirements for this system are:

Detect motion
Capture photo
Create message with photo attached
Send message to defined Telegram bot
Detect who is in the photo, if the person is me, then do not invoke the Telegram bot with a message and image

In this blog post, we implement core functionality that lets us convert images to vectors using OpenAIs free CLIP SDK and ViT-B-32 model.

These can then be used to perform real-time similarity checks against new incoming data.

What is CLIP

CLIP (Contrastive Language–Image Pre-training) is an AI model developed by OpenAI that can understand the relationship between images and text.

CLIP makes it easy for you to encode images into high-dimensional vectors (embeddings). With images in vector format, you can then perform:

Semantic similarity search (image-to-image or image-to-text).
Zero-shot classification (without retraining).
Content-based image retrieval.

Vector representations of visual content can then be compared using mathematical operations like cosine similarity.

Watch this 2-minute explainer to help you understand how cosine similarity works.

Why Use CLIP

I selected CLIP because:

no need for custom model training and it works out-of-the-box
fast and efficient – runs locally with support for CPU or GPU
versatile – can compare images or map them to descriptive text

For the home security system, this means:

privacy is maintained as no need to send images to cloud APIs
responsive and quick as no cloud latency
can train the system on known people or objects using sample images
can match new images in real-time to these known entities

It means that when the Raspberry Pi detects an image, it can be sent to CLIP to perform image classification and recognition.

Within the context of the home security system, it makes it possible to check the newly captured image against a set of locally embedded training images (safe list).

If an image is captured and someone is on the “safe list”, the Telegram message will not be sent to my cell phone.

Finally, CLIP is FREE. No cloud costs! Find out more here.

Implementing the CLIP Server

The CLIP server runs on a self-hosted Python server and the following technologies are used:

Python 3.9+
FastAPI for HTTP handling
PyTorch for model execution
OpenAI’s CLIP GitHub package
Uvicorn to serve the API

The following code listing contains everything you need to spin up a locally running CLIP server:

from fastapi import FastAPI, UploadFile, File
from fastapi.responses import JSONResponse
from PIL import Image

import torch
import clip
import io


app = FastAPI()

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

@app.post("/embed")
async def embed_image(file: UploadFile = File(...)):
    image_bytes = await file.read()
    image = Image.open(io.BytesIO(image_bytes)).convert('RGB')
    image_input = preprocess(image).unsqueeze(0).to(device)


    with torch.no_grad():
        image_features = model.encode_image(image_input)
        embedding = image_features[0].cpu().tolist()

    return JSONResponse(content={"embedding": embedding})


if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=5003)

You’ll see from the above that a single API endpoint (/embed) is made available.

How Does the CLIP Server Work

This CLIP server accepts image uploads and returns 512-dimensional vector embeddings. The main steps are:

Uploads an image to the /embed endpoint
Processes the image with the CLIP model.
Returns the embedding as a JSON array.

Under the hood, OpenAI’s CLIP ViT-B/32 model is used. This model is fetched the first time the server loads.

It’s then added to the local cache for subsequent runs (see below).

This makes it a fully offline solution.

You can find the model here: %USERPROFILE%\.cache\clip\

Running the CLIP Server in Visual Studio Code

We can run the CLIP server in VS Code. The CLIP server is surfaced via an API endpoint with the following command:

uvicorn.run(app, host="0.0.0.0", port=5003)

We can see this in the terminal in VS Code:

With the API running, we can test it using Postman.

Testing the CLIP Server using Postman

The API exposes the /embed endpoint at the following address:

http://localhost:5003/embed

The code below handles requests sent to the endpoint:

async def embed_image(file: UploadFile = File(...)):

    image_bytes = await file.read()
    image = Image.open(io.BytesIO(image_bytes)).convert('RGB')
    image_input = preprocess(image).unsqueeze(0).to(device)


    with torch.no_grad():
        image_features = model.encode_image(image_input)
        embedding = image_features[0].cpu().tolist()

    return JSONResponse(content={"embedding": embedding})

A POST request containing the image in required. The API will then return JSON with vector embeddings that represent the image you send.

The high-level steps are:

Run the server on port 5003.
Send a POST request to http://localhost:5003/embed.
Attach an image as file in form-data.
Inspect the returned embedding.

We check the CLIP server is running in VS Code:

We create a POST request for the /embed endpoint, attaching a file:

We submit the request and vector embeddings are returned:

Perfect.

Demo

In this demo, you can see the CLIP server in action. We create a POST request, submit it, and vectors embeddings are returned:

How Does This Fit Within the Wider Solution?

The CLIP server forms a key component of the wider solution and helps satisfy the following requirement:

Detect who is in the photo, if the person is me, then do not invoke the Telegram bot with a message and image

Process

When a photo is captured by the Raspberry Pi, it will be sent a .NET API that consumes the CLIP Server.
An embedding is generated and compared against a collection of existing training images that have already been converted to embeddings (images of me).
The result of the embeddings comparison as a cosine similarity score.
If the score matches a given threshold and for existing images labelled “me”, the API will return a match with a high score. This will invoke the Telegram message notification.
If the score does not match a given threshold, no Telegram message will be sent.

All the above will happen in milliseconds and doesn’t require extension compute.

Summary

In Part 3 of this series, , we’ll expose the CLIP server running on Python using a .NET Web API.

The .NET Web API will leverage the CLIP server to generate embeddings in a few different ways:

Generate embeddings for training image data
Generate embeddings for real-time data being captured from the camera

Stay tuned.

May 12, 2025 at 12:03AM
Click here for more details...

=============================
The original post is available in Jamie Maguire by jamie_maguire
this post has been published as it is through automation. Automation script brings all the top bloggers post under a single umbrella.
The purpose of this blog, Follow the top Salesforce bloggers and collect all blogs in a single place through automation.
============================

jamie_maguire

Dotnet Reader