What is ComfyUI?

ComfyUI is a powerful and flexible user interface for Stable Diffusion, allowing users to create complex image generation workflows through a node-based system. While ComfyUI comes with a variety of built-in nodes, its true strength lies in its extensibility. Custom nodes enable users to add new functionality, integrate external services, and tailor it to their specific needs.

In this blog post, we will walk through the process of creating a custom node for image captioning using ComfyUI. This node will take an image as input and return a generated caption using an external API.

We will be using Google Gemini API for generating the caption of an image.

So here is the entire code which does the ImageCaptioning using Gemini API.

You can copy the following code into any file under the custom_nodes folder in ComfyUI, I have named mine as gemini-caption.py

Complete code for Generating Image Captions

import numpy as np
from PIL import Image
import requests
import io
import base64

class ImageCaptioningNode:
    @classmethod
    def INPUT_TYPES(s):
        return {
            "required": {"image": ("IMAGE",), "api_key": ("STRING", {"default": ""})}
        }

    RETURN_TYPES = ("STRING",)
    FUNCTION = "caption_image"
    CATEGORY = "image"
    OUTPUT_NODE = True

    def caption_image(self, image, api_key):
        # Convert the image tensor to a PIL Image
        image = Image.fromarray(
            np.clip(255.0 * image.cpu().numpy().squeeze(), 0, 255).astype(np.uint8)
        )

        # Convert the image to base64
        buffered = io.BytesIO()
        image.save(buffered, format="PNG")
        img_str = base64.b64encode(buffered.getvalue()).decode()
        api_url = f"https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent?key={api_key}"
        payload = {
            "contents": [
                {
                    "parts": [
                        {
                            "text": "Generate a caption for this image in as detail as possible. Don't send anything else apart from the caption."
                        },
                        {"inline_data": {"mime_type": "image/png", "data": img_str}},
                    ]
                }
            ]
        }

        # Send the request to the Gemini API
        try:
            response = requests.post(api_url, json=payload)
            response.raise_for_status()
            caption = response.json()["candidates"][0]["content"]["parts"][0]["text"]
        except requests.exceptions.RequestException as e:
            caption = f"Error: Unable to generate caption. {str(e)}"

        print(caption)
        return (caption,)


NODE_CLASS_MAPPINGS = {"ImageCaptioningNode": ImageCaptioningNode}

Here is how the node looks on the UI:

Let's go over it line by line, to get an understanding how do we go about creating a similar node for your use case. First of all whatever node you want to create, make it as a function, so you can call it just in the same way in ComfyUI, as I did here for my caption_image function.

Import the necessary libraries needed

import numpy as np
from PIL import Image
import requests
import io
import base64

These lines import the necessary libraries for my Image Captioning node:

numpy for numerical operations
PIL (Python Imaging Library) for image processing
requests for making HTTP requests to Gemini API
io for handling byte streams
base64 for encoding the image

Defining the ClassName for your ComfyUI node

class ImageCaptioningNode:
    @classmethod
    def INPUT_TYPES(s):
        return {
            "required": {"image": ("IMAGE",), "api_key": ("STRING", {"default": ""})}
        }

In my case, I have named it as ImageCaptioningNode as it does what is says.

The class method defines the input types for our node:

An "image" input of type "IMAGE"
An "api_key" input of type "STRING" with a default empty value, needed for sending API requests to Gemini API.

    RETURN_TYPES = ("STRING",)
    FUNCTION = "caption_image"
    CATEGORY = "image"
    OUTPUT_NODE = True

These class variables define:

The return type (a string)
The main function to be called ("caption_image")
The category in which the node will appear in ComfyUI
That this node can be an output node

    def caption_image(self, image, api_key):
        # Convert the image tensor to a PIL Image
        image = Image.fromarray(
            np.clip(255.0 * image.cpu().numpy().squeeze(), 0, 255).astype(np.uint8)
        )

        # Convert the image to base64
        buffered = io.BytesIO()
        image.save(buffered, format="PNG")
        img_str = base64.b64encode(buffered.getvalue()).decode()
        api_url = f"https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent?key={api_key}"

        # Prepare the request payload
        payload = {
            "contents": [
                {
                    "parts": [
                        {
                            "text": "Generate a caption for this image in as detail as possible. Don't send anything else apart from the caption."
                        },
                        {"inline_data": {"mime_type": "image/png", "data": img_str}},
                    ]
                }
            ]
        }
        try:
            response = requests.post(api_url, json=payload)
            response.raise_for_status()
            caption = response.json()["candidates"][0]["content"]["parts"][0]["text"]
        except requests.exceptions.RequestException as e:
            caption = f"Error: Unable to generate caption. {str(e)}"

        print(caption)
        return (caption,)

This is a standalone function which I have written that takes an Image as input, and sends it to Gemini API using the API key. The code is straightforward, we are just doing base64 encoding so image gets sent via API. We instruct Gemini to caption the image in detail using the prompt. The response from API is parsed, and printed in the console and returned as a tuple (required by ComfyUI).

NODE_CLASS_MAPPINGS = {"ImageCaptioningNode": ImageCaptioningNode}

This dictionary maps the class name to the class itself, which is used by ComfyUI to register the custom node.

To conclude your article on creating a custom ComfyUI node, you can summarize the key points and provide some final thoughts. Here's a suggested conclusion:

Conclusion:

Creating custom nodes for ComfyUI opens up a world of possibilities for extending and enhancing your image generation workflows. In this article, we've walked through the process of building a custom image captioning node, demonstrating how to:

Define input and output types
Integrate with external APIs (in this case, the Gemini API for image captioning)

By following these steps, you can create your own custom nodes to add virtually any functionality you need to ComfyUI. Whether you're integrating new LLM models, adding specialized image processing techniques, or creating shortcuts for common tasks, custom nodes allow you to tailor ComfyUI to your specific requirements.

Remember that while we've focused on image captioning in this example, the same principles can be applied to create nodes for a wide variety of tasks. The key is to understand the structure of a ComfyUI node and how to interface with the expected inputs and outputs.

In case if you still have any questions regarding this post or want to discuss something with me feel free to connect on LinkedIn or Twitter.

If you run an organization and want me to write for you, please connect with me on my Socials 🙃

How to create custom nodes in ComfyUI

What is ComfyUI?

Complete code for Generating Image Captions

Import the necessary libraries needed

Defining the ClassName for your ComfyUI node

Conclusion: