Image description with multimodal LLM in CrewAI

Image description with multimodal LLM in CrewAI

AI Multi-agents with Multimodal Model

Image description with multimodal LLM in CrewAI

In this article, we'll explore how to unlock powerful image description capabilities within CrewAI using the open-source Llava model running with Ollama. By creating a custom tool, we can seamlessly integrate image processing into our multi-agent workflows, enhancing the scope of what AI can achieve in automation tasks. Whether you're an AI enthusiast or a developer looking to push the boundaries of your projects, this guide will provide you with the insights and tools you need to get started.

August 12, 20245 Minutes0 Comments

The crewAI framework doesn’t support yet multimodal model, this article is explaining a workaround using custom tools to be able to use the image description feature from the Llava model inside of your multi-agent crew.

The Challenge: Multimodal Integration in CrewAI

CrewAI is a versatile multi-agent framework, but it currently lacks built-in support for multimodal models that can process both text and images. This limitation posed a challenge for my projects, where I often need to generate detailed descriptions of images. To overcome this, I designed a custom tool that interfaces with the selfhosted Llava model on Ollama, enabling CrewAI to harness its image description capabilities.

The Solution: Custom Image Description Tool

Using the Llava model hosted on an Ollama instance, I created a custom tool in CrewAI that processes images and generates textual descriptions. Here’s the code that makes it all work:

from crewai_tools import BaseTool

class ImageUnderstandingTool(BaseTool):
    name: str = "Image description Tool"
    description: str = "This tool acts as your eyes and will textually respond to your prompt about an image"

    def _run(self, prompt: str) -> str:
        import base64
        import requests
        import json
        image_url = "https://directus.cloudseb.com/assets/3524b4ca-9474-495d-9101-f1bb7d085751/IMG_6682.jpeg"
        #prompt = "Describe the image"
        image = requests.get(image_url, headers={"Authorization": "Bearer BaaXbu4RAwLjCE2EZtO6KVnoRCgI4uAR"}).content
        image_base64 = base64.b64encode(image).decode('utf-8')
        request_json = {"model":"llava:latest","prompt":prompt,"images":[image_base64],"stream":False}
        request_response = requests.post("http://192.168.68.71:11434/api/generate",json=request_json)
        response_json = json.loads(request_response.text)
        answer = response_json["response"]
        return answer

## TOOLS

# Instantiate the tool
ImageDescriptionTool = ImageUnderstandingTool()

Breaking Down the Code

  1. Importing Necessary Libraries:
    • base64: For encoding the image into a base64 string.
    • requests: For making HTTP requests to fetch the image and communicate with the Llava model API.
    • json: For handling JSON data.
  2. Class Definition:
    • ImageUnderstandingTool inherits from BaseTool, a fundamental component in CrewAI for creating custom tools.
    • The tool is named “Image description Tool” and its purpose is described.
  3. The _run Method:
    • Fetching the Image: The image is downloaded from a specified URL using a GET request with authorization headers.
    • Encoding the Image: The image content is encoded into a base64 string to be sent in the API request.
    • Preparing the Request: A JSON object is created, containing the model type (llava:latest), the prompt, and the encoded image.
    • Making the API Call: A POST request is sent to the Llava model API endpoint hosted on my Ollama instance.
    • Processing the Response: The JSON response is parsed to extract the generated description of the image.
  4. Tool Instantiation:
    • The custom tool is instantiated and ready to be integrated into CrewAI workflows.

Putting It All Together

By integrating this custom tool into CrewAI, I can now automate tasks that require image descriptions, enhancing the versatility and capability of my AI agents. This setup is particularly useful in workflows that involve processing and analyzing visual content, making it an invaluable addition to my AI toolkit.

Conclusion

Exploring the intersection of AI and multimodal capabilities opens up a world of possibilities. With the workaround presented here, you can extend the functionality of CrewAI to include image descriptions, leveraging the power of the Llava model. I hope this article inspires you to experiment with similar integrations and unlock new potentials in your AI projects.

Feel free to reach out if you have any questions or need further assistance with your AI explorations. Happy coding!


If you found this article helpful, consider sharing it with your network and subscribing to my blog for more insights into AI, photography, and innovative problem-solving techniques.


Automating AI Image Generation with n8n and ComfyUI

Automating AI Image Generation with n8n and ComfyUI

AI Image generation

Automate AI Image Generation with n8n and ComfyUI

This blog post explores integrating AI image generation into your n8n workflows using ComfyUI. Whether you’re looking to automate visual creation or streamline your content generation process, this guide comes from my differents trials and aims to provide a direct solution for integrating ComfyUI into an n8n workflow. In this n8n workflow, we set up an automated system that sends HTTP requests to ComfyUI, monitors the generation status, and retrieves the generated images for further use.

July 21, 202411 Minutes2 Comments

Introduction

For low-code users, this article will explain how to integrate AI image generation into an n8n workflow. This solution leverages the power of open-source and self-hostable setups. In my case, ComfyUI runs on my primary desktop equipped with a recent GPU, while my n8n instance, which I use as a backend for several applications, is self-hosted on my home server.

The ComfyUI documentation is not giving much details about how to set-up and reach the API through HTTP Request but they provide on their Github repository some socket scripts [GitHub – ComfyUI Repository] that give the basic information about how to format the HTTP calls

This blog post will first dive into the detailed setup and explanation of the n8n workflow, followed by a sort of ComfyUI API documentation.

Pre-requisites

  1. This post assumes that you have both ComfyUI and n8n already running properly on your systems, whether self-hosted or standalone, inside or outside of a Docker environment. This article will not cover installation details, as plenty of resources are available online for that.
  2. Activate ComfyUI remote access by adding the following launch parameters either at the end of the command use to launch ComfyUI or on windows by modifying with a text editor the file used to launch the application (run_nvidia_gpu.bat or run_cpu.bat at the root of the ComfyUI folder) --listen. After restarting the application. ComfyUI will then be running on your computer [COMFYUI_IP_ADRESS] on the port by default 8188 [COMFYUI_PORT]
  3. You have on hand the ComfyUI JSON workflow file of your image generation flow. This JSON file can be obtain from the ComfyUI interface, by clicking on the “Save (API Format)” button on the application menu

n8n workflow

This node allows you to manually start the workflow by clicking on the ‘Test workflow’ in the n8n interface. This node can be replaced either by a webhook call or included in another workflow of yours.

Node Type: Manual Trigger

A custom JavaScript code node is used to set up the parameters and configuration for the image generation process. This includes defining the prompts and settings for the AI model. The code generates a prompt structure necessary for the ComfyUI API to initiate image generation. AS mentioned in the Pre-requisites, you can extract this ComfyUI JSON workflow from the ComfyUI interface, by clicking on the “Save (API Format)” button on the application menu and opening the downloaded file content with a text editor.

We will slightly modified this JSON and define two variables at the top of the JS Script that will need to be recalled inside of the ComfyUI JSON

    1. positiveprompt: This define the prompt to be passed to the AI image generation process. We will use this variable inside of the ComfyUI JSON
    2. seednumber: We will generate a random seed number in order to randomize the AI image generation. If you use always use the same seed number and repeat the workflow you will always get the same image generated

Node Type: Code
Language: JavaScript
JavaScript code:

// VARIABLE DEFINITION
// 1. Positive prompt used for the image generation. If used you also need to manually edit the COMFYUI JSON workflow and insert the variable. 
const positivePrompt = "A tech-savvy engineer working on a computer with a holographic display, showing AI-generated images and some automation workflow nodes, capturing the essence of integration and automation in a high-tech setting. Accent color should be a warm yellow orange";
// 2. Random seed generation. If you want to randomize the image generation process otherwise you will get the same image for a prompt generation 
const seednumber = Math.floor(Math.random() * 1000000000000000);
const prompt = 

//BEGINING - PASTE HERE THE COMFYUI JSON WORKFLOW (and insert the above variable call)
{
  "3": {
    "inputs": {
      "seed": seednumber,
      "steps": 50,
      "cfg": 8,
      "sampler_name": "euler",
      "scheduler": "normal",
      "denoise": 1,
      "model": [
        "4",
        0
      ],
      "positive": [
        "6",
        0
      ],
      "negative": [
        "7",
        0
      ],
      "latent_image": [
        "5",
        0
      ]
    },
    "class_type": "KSampler",
    "_meta": {
      "title": "KSampler"
    }
  },
  "4": {
    "inputs": {
      "ckpt_name": "sd_xl_base_1.0.safetensors"
    },
    "class_type": "CheckpointLoaderSimple",
    "_meta": {
      "title": "Load Checkpoint"
    }
  },
  "5": {
    "inputs": {
      "width": 512,
      "height": 512,
      "batch_size": 1
    },
    "class_type": "EmptyLatentImage",
    "_meta": {
      "title": "Empty Latent Image"
    }
  },
  "6": {
    "inputs": {
      "text": positiveprompt,
      "clip": [
        "4",
        1
      ]
    },
    "class_type": "CLIPTextEncode",
    "_meta": {
      "title": "CLIP Text Encode (Prompt)"
    }
  },
  "7": {
    "inputs": {
      "text": "",
      "clip": [
        "4",
        1
      ]
    },
    "class_type": "CLIPTextEncode",
    "_meta": {
      "title": "CLIP Text Encode (Prompt)"
    }
  },
  "8": {
    "inputs": {
      "samples": [
        "3",
        0
      ],
      "vae": [
        "4",
        2
      ]
    },
    "class_type": "VAEDecode",
    "_meta": {
      "title": "VAE Decode"
    }
  },
  "9": {
    "inputs": {
      "filename_prefix": "ComfyUI",
      "images": [
        "8",
        0
      ]
    },
    "class_type": "SaveImage",
    "_meta": {
      "title": "Save Image"
    }
  }
}
//END - PASTE HERE THE COMFYUI JSON WORKFLOW 
;

const jsonData = {"prompt": prompt};
return [{ json: jsonData }];

This node sends a POST request to the ComfyUI API, initiating the image generation process with the specified parameters.

Node Type: HTTP Request
Parameters:

  • Method: POST
  • URL: http://COMFYUI_IP_ADRESS:COMFYUI_PORT/prompt
  • Body:
    • Content-Type: JSON
    • Specify Body: Using Fields Below
    • Body parameters:
      • Name: prompt
      • Value: {{ $('2. Set-up the ComfyUI workflow').item.json.prompt } The JSON payload generated in the previous step.

This node sends a GET request to the ComfyUI API to check the status of the image generation process using the prompt_id from the previous node response. We will create a loop with the 2 next nodes WAIT and IF, in order to repeat this status check until the generation process in ComfyUI is finished.

Node Type: HTTP Request
Parameters:

  • Method: GET
  • URL: Expression http://COMFYUI_IP_ADRESS:COMFYUI_PORT/history/{{ $('3. ComfyUI HTTP Request').item.json.prompt_id }}

This node evaluates the status of the image generation process status.status_str coming from the response of the previous node If the status is ‘success’, it proceeds to the next step; otherwise, it waits and rechecks the status later awaiting for the ComfyUI image generation process to be finished..

Node Type: If
Condition:

  • Left Value: Expression{{ $('ComfyUI - Check Image generation status').item.json[$('3. ComfyUI HTTP Request').item.json.prompt_id].status.status_str }}
  • Operator: is equal to
  • Right Value: success

This node pauses the workflow for a few seconds if the results of the previous status check is not yet completed before rechecking the status of the image generation process. I have set-up this temporization to 10 seconds, by experience as on my system an image generation in ComfyUI takes between 30sec to 50sec depending on the workflow.

Node Type: Wait
Parameters:

  • Resume: After time interval
  • Wait amount: 10
  • Wait unit: Seconds

Once the image generation is successful, this node retrieves the generated image from the API for viewing or further processing.

Node Type: HTTP Request
Parameters:

  • Method: GET
  • URL: http://COMFYUI_IP_ADDRESS:COMFYUI_PORT/view
  • Query Parameters: Using fields Below, parsed from the response of the previous step
    • filename: The filename of the generated image: {{$json[$node["3. ComfyUI HTTP Request"].json.prompt_id].outputs[Object.keys($json[$node["3. ComfyUI HTTP Request"].json.prompt_id].outputs)[0]].images[0].filename}}
    • subfolder: The subfolder where the image is stored. {{$json[$node["3. ComfyUI HTTP Request"].json.prompt_id].outputs[Object.keys($json[$node["3. ComfyUI HTTP Request"].json.prompt_id].outputs)[0]].images[0].subfolder}}
    • type: The type of the image: {{$json[$node["3. ComfyUI HTTP Request"].json.prompt_id].outputs[Object.keys($json[$node["3. ComfyUI HTTP Request"].json.prompt_id].outputs)[0]].images[0].type}}

The recovery of the filename, subfolder,type parameters from the previous node is a bit tricky as the structure of the returned JSON is dynamic depending on the prompt_id

Comfy UI – API HTTP Request documentation

Base URL: http://COMFYUI_IP_ADRESS:COMFYUI_PORT/ with COMFYUI_IP_ADDRESS the IP address of your ComfyUI instance running on COMFYUI_PORT (By defaut: 8188)

Parameters
Request Parameter name Type Value & Comments
BODY prompt JSON A JSON ComfyUI workflow exported from the ComfyUI User interface by clicking on the "Save (API Format)" button (Or generated from another of your workflow)
Response
Code Description
200
(Success)
Example:
{
  "prompt_id": "782a0bc6-cc9f-4c82-9043-914294a26f8d",
  "number": 0,
  "node_errors": {}
}

Parameters

Request Parameter name Type Value & Comments
URL prompt_id String The unique workflow prompt_id created by ComfyUI and returned during the prompt creation request and returned in the request response

Response

Code Description
200
(Success)
Example with prompt_id = “91fae807-ff1a-4bf0-a493-0c56b0f8fba2”

  {
    "91fae807-ff1a-4bf0-a493-0c56b0f8fba2": {
      "prompt": [WORKFLOW STEP DESCRIPTION - DELETED FOR MORE CLARITY ] ,
      "outputs": {
        "9": {
          "images": [
            {
              "filename": "ComfyUI_00492_.png",
              "subfolder": "",
              "type": "output"
            }
          ]
        }
      },
      "status": {
        "status_str": "success",
        "completed": true,
        "messages": [
          [
            "execution_start",
            {
              "prompt_id": "91fae807-ff1a-4bf0-a493-0c56b0f8fba2"
            }
          ],
          [
            "execution_cached",
            {
              "nodes": [
                "7",
                "5",
                "4",
                "6"
              ],
              "prompt_id": "91fae807-ff1a-4bf0-a493-0c56b0f8fba2"
            }
          ]
        ]
      }
    }
  }

Parameters

Request Parameter name Type Value & Comments
QUERY filename String The generated image filename created by ComfyUI and returned by the status check request and returned in the request response
QUERY subfolder String The generated image subfolder created by ComfyUI and returned by the status check request and returned in the request response
QUERY type String The generated image type created by ComfyUI and returned by the status check request and returned in the request response

Response

Code Description
200
(Success)
An image as a binary file

Conclusion

This n8n workflow efficiently integrates AI image generation using ComfyUI, allowing for automation and streamlined content creation. By setting up an automated system to send HTTP requests to ComfyUI, monitor the generation status, and retrieve the generated images, you can leverage AI image generation in your projects seamlessly. The approach described in this post serves as a foundation for further customization and integration based on your specific requirements. You can also very easily reproduce this workflow in other automation tools like Make.com

Ressources

n8n: Automate Workflows Easily — n8n is an extendable workflow automation tool, that allows to build powerful workflows, really fast with a low-code approach. Insert code only when you need it. n8n is self-hostable, an extended version is also available on their online platform depending on your need.

ComfyUI: A very powerful and modular stable diffusion GUI and backend to personalize, test image and video AI generation workflow. Alternative to AUTOMATIC1111, self-hostable and with a large and active community developing custom nodes for all kind of extra feature support: controlnet, CLIP Encoder…

Try it out and share your experience in comments. Thanks.


Timelapse - Bangkok Street at night

Bangkok Street at night timelapse, Thailand


Timelapse - Holy grail sunset over Sukhumvit skyline, Bangkok

Holy grail sunset timelapse over Sukhumvit, Bangkok, Thailand


Travel - Australia East Coast

Australie - Photo gallerie


Exploring AI - My pictures through Midjourney eyes

Ep1 – Japan – Chureito pagoda and Mount Fuji under the sakura sunet Welcome to this blog post series, where I will jump into the fascinating world of photography and its dynamic relationship with cutting-edge technology. Today, I am excited to share my recent experience into the realm of AI-generated art using the powerful and impressive Midjourney AI image generator. In this type of post that I hope to turn into a series of articles, I will showcase different of my photographies and the stunning clones achieved through the application of artificial intelligence. For my first experiment, I selected one of my favorite photography that I captured in Japan: the breathtaking scenery centered around the Chureito Pagoda set against the backdrop of Mount Fuji during the Sakura season at sunset. This particular image holds sentimental value for me, with its striking blend of red, pink, and purple hues from the sky, and the delicate Sakura blossoms adorning this landscape. I fed this original photograph into the Midjourney AI image generator, aiming to explore its ability to describe the scene [Using the /describe prompt command] and generate an AI-created counterpart from this description. To my amazement, the AI accurately depicted the core elements of the image, amplifying the colors and intensifying the dreamlike atmosphere. The result was a visually stunning interpretation that enhanced the original scene’s beauty and evoked an even stronger emotional response. This captivating experiment highlights the immense potential of AI in the field of artistic creation. The Midjourney AI image generator demonstrated its capability to collaborate seamlessly with human creativity, producing mesmerizing imagery that expands the horizons of visual expression. As I continue to explore the intersection of photography and AI, this series will serve as a platform to share my experiences, insights, and discoveries. Together, we will unravel the possibilities and celebrate the fusion of artistry and technology that lies at the heart of this captivating journey. Stay tuned for future installments in this series as we embark on new adventures, witnessing the dynamic synergy between photography and artificial intelligence. Join me in embracing the evolution of our craft, and let us push the boundaries of creative exploration.