Image description with multimodal LLM in CrewAI

AI Multi-agents with Multimodal Model

Image description with multimodal LLM in CrewAI

Sebastien

In this article, we'll explore how to unlock powerful image description capabilities within CrewAI using the open-source Llava model running with Ollama. By creating a custom tool, we can seamlessly integrate image processing into our multi-agent workflows, enhancing the scope of what AI can achieve in automation tasks. Whether you're an AI enthusiast or a developer looking to push the boundaries of your projects, this guide will provide you with the insights and tools you need to get started.

August 12, 20245 Minutes0 Comments

The crewAI framework doesn’t support yet multimodal model, this article is explaining a workaround using custom tools to be able to use the image description feature from the Llava model inside of your multi-agent crew.

The Challenge: Multimodal Integration in CrewAI

CrewAI is a versatile multi-agent framework, but it currently lacks built-in support for multimodal models that can process both text and images. This limitation posed a challenge for my projects, where I often need to generate detailed descriptions of images. To overcome this, I designed a custom tool that interfaces with the selfhosted Llava model on Ollama, enabling CrewAI to harness its image description capabilities.

The Solution: Custom Image Description Tool

Using the Llava model hosted on an Ollama instance, I created a custom tool in CrewAI that processes images and generates textual descriptions. Here’s the code that makes it all work:

from crewai_tools import BaseTool

class ImageUnderstandingTool(BaseTool):
    name: str = "Image description Tool"
    description: str = "This tool acts as your eyes and will textually respond to your prompt about an image"

    def _run(self, prompt: str) -> str:
        import base64
        import requests
        import json
        image_url = "https://directus.cloudseb.com/assets/3524b4ca-9474-495d-9101-f1bb7d085751/IMG_6682.jpeg"
        #prompt = "Describe the image"
        image = requests.get(image_url, headers={"Authorization": "Bearer BaaXbu4RAwLjCE2EZtO6KVnoRCgI4uAR"}).content
        image_base64 = base64.b64encode(image).decode('utf-8')
        request_json = {"model":"llava:latest","prompt":prompt,"images":[image_base64],"stream":False}
        request_response = requests.post("http://192.168.68.71:11434/api/generate",json=request_json)
        response_json = json.loads(request_response.text)
        answer = response_json["response"]
        return answer

## TOOLS

# Instantiate the tool
ImageDescriptionTool = ImageUnderstandingTool()

Breaking Down the Code

Importing Necessary Libraries:
- base64: For encoding the image into a base64 string.
- requests: For making HTTP requests to fetch the image and communicate with the Llava model API.
- json: For handling JSON data.
Class Definition:
- ImageUnderstandingTool inherits from BaseTool, a fundamental component in CrewAI for creating custom tools.
- The tool is named “Image description Tool” and its purpose is described.
The _run Method:
- Fetching the Image: The image is downloaded from a specified URL using a GET request with authorization headers.
- Encoding the Image: The image content is encoded into a base64 string to be sent in the API request.
- Preparing the Request: A JSON object is created, containing the model type (llava:latest), the prompt, and the encoded image.
- Making the API Call: A POST request is sent to the Llava model API endpoint hosted on my Ollama instance.
- Processing the Response: The JSON response is parsed to extract the generated description of the image.
Tool Instantiation:
- The custom tool is instantiated and ready to be integrated into CrewAI workflows.

Putting It All Together

By integrating this custom tool into CrewAI, I can now automate tasks that require image descriptions, enhancing the versatility and capability of my AI agents. This setup is particularly useful in workflows that involve processing and analyzing visual content, making it an invaluable addition to my AI toolkit.

Conclusion

Exploring the intersection of AI and multimodal capabilities opens up a world of possibilities. With the workaround presented here, you can extend the functionality of CrewAI to include image descriptions, leveraging the power of the Llava model. I hope this article inspires you to experiment with similar integrations and unlock new potentials in your AI projects.

Feel free to reach out if you have any questions or need further assistance with your AI explorations. Happy coding!

If you found this article helpful, consider sharing it with your network and subscribing to my blog for more insights into AI, photography, and innovative problem-solving techniques.

AI Multi-agents with Multimodal Model