MarkItDown: Microsoft's Secret Weapon for Converting Documents to Markdown

How I discovered a powerful tool that makes working with LLMs so much easier

The Problem: Documents That LLMs Can’t Read

I’ve been working with large language models (LLMs) for a while now, and one of the biggest frustrations I’ve encountered is how difficult it can be to get them to work with documents in various formats. PDFs, Word documents, PowerPoint presentations – these are all formats that humans can easily read, but LLMs struggle with them.

When I feed these documents directly to an LLM, I often get responses like “I can’t read the content of this file” or “I don’t have access to the document you’re referring to.” It’s incredibly frustrating, especially when I know the information is right there in the document.

That’s why I was thrilled when I discovered MarkItDown, a Python utility from Microsoft that converts various file formats to Markdown – a format that LLMs can easily understand and process.

The Solution: MarkItDown

MarkItDown is a lightweight Python tool that converts various files to Markdown for use with LLMs and related text analysis pipelines. It’s designed to preserve important document structure and content, including headings, lists, tables, links, and more.

What makes MarkItDown special is its ability to handle a wide range of file formats:

PDF documents
PowerPoint presentations
Word documents
Excel spreadsheets
Images (with EXIF metadata and OCR)
Audio files (with EXIF metadata and speech transcription)
HTML pages
Text-based formats (CSV, JSON, XML)
ZIP files (iterates over contents)
YouTube URLs
EPubs
And more!

Why Markdown?

You might be wondering why Markdown is such a big deal. After all, isn’t it just a simple text format with some basic formatting?

The beauty of Markdown is that it’s extremely close to plain text, with minimal markup, but still provides a way to represent important document structure. Mainstream LLMs, such as OpenAI’s GPT-4o, natively “speak” Markdown, and often incorporate Markdown into their responses unprompted. This suggests that they have been trained on vast amounts of Markdown-formatted text and understand it well.

As a side benefit, Markdown conventions are also highly token-efficient, which means you can fit more content into your LLM’s context window.

Getting Started with MarkItDown

Installing MarkItDown is straightforward:

pip install 'markitdown[all]'

The [all] option installs all optional dependencies for various file formats. If you only need support for specific formats, you can install just those:

pip install markitdown[pdf, docx, pptx]

Using MarkItDown from the command line is simple:

markitdown path-to-file.pdf > document.md

Or you can specify an output file:

markitdown path-to-file.pdf -o document.md

You can even pipe content:

cat path-to-file.pdf | markitdown

Using MarkItDown in Python

For more flexibility, you can use MarkItDown in your Python code:

from markitdown import MarkItDown

# Basic usage
md = MarkItDown(enable_plugins=False)
result = md.convert("test.xlsx")
print(result.text_content)

# With Azure Document Intelligence for better PDF conversion
md = MarkItDown(docintel_endpoint="<document_intelligence_endpoint>")
result = md.convert("test.pdf")
print(result.text_content)

# With LLM for image descriptions
from openai import OpenAI

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("example.jpg")
print(result.text_content)

Real-World Applications

Since discovering MarkItDown, I’ve used it for several projects:

Document Analysis: Converting PDF reports to Markdown for analysis by LLMs
Content Extraction: Extracting text from PowerPoint presentations for summarization
Data Processing: Converting Excel spreadsheets to structured Markdown for data analysis
Image Analysis: Using MarkItDown with OCR to extract text from images
Content Creation: Converting various document formats to Markdown for my blog

Advanced Features

MarkItDown offers several advanced features that make it even more powerful:

Azure Document Intelligence Integration: For better PDF conversion
LLM Integration: For generating descriptions of images
Plugin System: For extending MarkItDown’s capabilities
Docker Support: For running MarkItDown in containers

Conclusion: A Game-Changer for LLM Workflows

MarkItDown has become an essential tool in my LLM workflow. It bridges the gap between human-readable documents and LLM-processable content, making it much easier to work with various file formats.

Whether you’re building an AI agent that needs to process documents, creating a content analysis pipeline, or just trying to get your LLM to understand the content of a PDF, MarkItDown is a tool worth adding to your toolkit.

Give it a try, and you might just find that it transforms how you work with documents and LLMs.