Building an AI Home Security System Using .NET, Python, CLIP, Semantic Kernel, Telegram, and Raspberry Pi 4 – Part 2 : jamie_maguire
by: jamie_maguire
blow post content copied from Jamie Maguire
click here to view original post
This is the second instalment of a miniseries where I am building an end-to-end AI home security and cross platform system.
I recommend reading Part 1 before reading on. To recap, the main requirements for this system are:
- Detect motion
- Capture photo
- Create message with photo attached
- Send message to defined Telegram bot
- Detect who is in the photo, if the person is me, then do not invoke the Telegram bot with a message and image
In this blog post, we implement core functionality that lets us convert images to vectors using OpenAIs free CLIP SDK and ViT-B-32 model.
These can then be used to perform real-time similarity checks against new incoming data.
~
What is CLIP
CLIP (Contrastive Language–Image Pre-training) is an AI model developed by OpenAI that can understand the relationship between images and text.
CLIP makes it easy for you to encode images into high-dimensional vectors (embeddings). With images in vector format, you can then perform:
- Semantic similarity search (image-to-image or image-to-text).
- Zero-shot classification (without retraining).
- Content-based image retrieval.
Vector representations of visual content can then be compared using mathematical operations like cosine similarity.
Watch this 2-minute explainer to help you understand how cosine similarity works.
~
Why Use CLIP
I selected CLIP because:
- no need for custom model training and it works out-of-the-box
- fast and efficient – runs locally with support for CPU or GPU
- versatile – can compare images or map them to descriptive text
For the home security system, this means:
- privacy is maintained as no need to send images to cloud APIs
- responsive and quick as no cloud latency
- can train the system on known people or objects using sample images
- can match new images in real-time to these known entities
It means that when the Raspberry Pi detects an image, it can be sent to CLIP to perform image classification and recognition.
Within the context of the home security system, it makes it possible to check the newly captured image against a set of locally embedded training images (safe list).
If an image is captured and someone is on the “safe list”, the Telegram message will not be sent to my cell phone.
Finally, CLIP is FREE. No cloud costs! Find out more here.
~
Implementing the CLIP Server
The CLIP server runs on a self-hosted Python server and the following technologies are used:
- Python 3.9+
- FastAPI for HTTP handling
- PyTorch for model execution
- OpenAI’s CLIP GitHub package
- Uvicorn to serve the API
The following code listing contains everything you need to spin up a locally running CLIP server:
from fastapi import FastAPI, UploadFile, File from fastapi.responses import JSONResponse from PIL import Image import torch import clip import io app = FastAPI() device = "cuda" if torch.cuda.is_available() else "cpu" model, preprocess = clip.load("ViT-B/32", device=device) @app.post("/embed") async def embed_image(file: UploadFile = File(...)): image_bytes = await file.read() image = Image.open(io.BytesIO(image_bytes)).convert('RGB') image_input = preprocess(image).unsqueeze(0).to(device) with torch.no_grad(): image_features = model.encode_image(image_input) embedding = image_features[0].cpu().tolist() return JSONResponse(content={"embedding": embedding}) if __name__ == "__main__": import uvicorn uvicorn.run(app, host="0.0.0.0", port=5003)
You’ll see from the above that a single API endpoint (/embed) is made available.
~
How Does the CLIP Server Work
This CLIP server accepts image uploads and returns 512-dimensional vector embeddings. The main steps are:
- Uploads an image to the
/embed
endpoint - Processes the image with the CLIP model.
- Returns the embedding as a JSON array.
Under the hood, OpenAI’s CLIP ViT-B/32 model is used. This model is fetched the first time the server loads.
It’s then added to the local cache for subsequent runs (see below).
This makes it a fully offline solution.
You can find the model here: %USERPROFILE%\.cache\clip\
~
Running the CLIP Server in Visual Studio Code
We can run the CLIP server in VS Code. The CLIP server is surfaced via an API endpoint with the following command:
uvicorn.run(app, host="0.0.0.0", port=5003)
We can see this in the terminal in VS Code:
With the API running, we can test it using Postman.
~
Testing the CLIP Server using Postman
The API exposes the /embed
endpoint at the following address:
http://localhost:5003/embed
The code below handles requests sent to the endpoint:
async def embed_image(file: UploadFile = File(...)): image_bytes = await file.read() image = Image.open(io.BytesIO(image_bytes)).convert('RGB') image_input = preprocess(image).unsqueeze(0).to(device) with torch.no_grad(): image_features = model.encode_image(image_input) embedding = image_features[0].cpu().tolist() return JSONResponse(content={"embedding": embedding})
A POST request containing the image in required. The API will then return JSON with vector embeddings that represent the image you send.
The high-level steps are:
- Run the server on port 5003.
- Send a POST request to http://localhost:5003/embed.
- Attach an image as file in form-data.
- Inspect the returned embedding.
We check the CLIP server is running in VS Code:
We create a POST request for the /embed endpoint, attaching a file:
We submit the request and vector embeddings are returned:
Perfect.
~
Demo
In this demo, you can see the CLIP server in action. We create a POST request, submit it, and vectors embeddings are returned:
~
How Does This Fit Within the Wider Solution?
The CLIP server forms a key component of the wider solution and helps satisfy the following requirement:
- Detect who is in the photo, if the person is me, then do not invoke the Telegram bot with a message and image
Process
- When a photo is captured by the Raspberry Pi, it will be sent a .NET API that consumes the CLIP Server.
- An embedding is generated and compared against a collection of existing training images that have already been converted to embeddings (images of me).
- The result of the embeddings comparison as a cosine similarity score.
- If the score matches a given threshold and for existing images labelled “me”, the API will return a match with a high score. This will invoke the Telegram message notification.
- If the score does not match a given threshold, no Telegram message will be sent.
All the above will happen in milliseconds and doesn’t require extension compute.
~
Summary
In Part 3 of this series, , we’ll expose the CLIP server running on Python using a .NET Web API.
The .NET Web API will leverage the CLIP server to generate embeddings in a few different ways:
- Generate embeddings for training image data
- Generate embeddings for real-time data being captured from the camera
Stay tuned.
~
May 12, 2025 at 12:03AM
Click here for more details...
=============================
The original post is available in Jamie Maguire by jamie_maguire
this post has been published as it is through automation. Automation script brings all the top bloggers post under a single umbrella.
The purpose of this blog, Follow the top Salesforce bloggers and collect all blogs in a single place through automation.
============================

Post a Comment