MarkItDown: Microsoft's Secret Weapon for Converting Documents to Markdown
How I discovered a powerful tool that makes working with LLMs so much easier
The Problem: Documents That LLMs Can’t Read
I’ve been working with large language models (LLMs) for a while now, and one of the biggest frustrations I’ve encountered is how difficult it can be to get them to work with documents in various formats. PDFs, Word documents, PowerPoint presentations – these are all formats that humans can easily read, but LLMs struggle with them.
When I feed these documents directly to an LLM, I often get responses like “I can’t read the content of this file” or “I don’t have access to the document you’re referring to.” It’s incredibly frustrating, especially when I know the information is right there in the document.
That’s why I was thrilled when I discovered MarkItDown, a Python utility from Microsoft that converts various file formats to Markdown – a format that LLMs can easily understand and process.
The Solution: MarkItDown
MarkItDown is a lightweight Python tool that converts various files to Markdown for use with LLMs and related text analysis pipelines. It’s designed to preserve important document structure and content, including headings, lists, tables, links, and more.
What makes MarkItDown special is its ability to handle a wide range of file formats:
- PDF documents
- PowerPoint presentations
- Word documents
- Excel spreadsheets
- Images (with EXIF metadata and OCR)
- Audio files (with EXIF metadata and speech transcription)
- HTML pages
- Text-based formats (CSV, JSON, XML)
- ZIP files (iterates over contents)
- YouTube URLs
- EPubs
- And more!
Why Markdown?
You might be wondering why Markdown is such a big deal. After all, isn’t it just a simple text format with some basic formatting?
The beauty of Markdown is that it’s extremely close to plain text, with minimal markup, but still provides a way to represent important document structure. Mainstream LLMs, such as OpenAI’s GPT-4o, natively “speak” Markdown, and often incorporate Markdown into their responses unprompted. This suggests that they have been trained on vast amounts of Markdown-formatted text and understand it well.
As a side benefit, Markdown conventions are also highly token-efficient, which means you can fit more content into your LLM’s context window.
Getting Started with MarkItDown
Installing MarkItDown is straightforward:
pip install 'markitdown[all]'
The [all]
option installs all optional dependencies for various file formats. If you only need support for specific formats, you can install just those:
pip install markitdown[pdf, docx, pptx]
Using MarkItDown from the command line is simple:
markitdown path-to-file.pdf > document.md
Or you can specify an output file:
markitdown path-to-file.pdf -o document.md
You can even pipe content:
cat path-to-file.pdf | markitdown
Using MarkItDown in Python
For more flexibility, you can use MarkItDown in your Python code:
from markitdown import MarkItDown
# Basic usage
md = MarkItDown(enable_plugins=False)
result = md.convert("test.xlsx")
print(result.text_content)
# With Azure Document Intelligence for better PDF conversion
md = MarkItDown(docintel_endpoint="<document_intelligence_endpoint>")
result = md.convert("test.pdf")
print(result.text_content)
# With LLM for image descriptions
from openai import OpenAI
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("example.jpg")
print(result.text_content)
Real-World Applications
Since discovering MarkItDown, I’ve used it for several projects:
- Document Analysis: Converting PDF reports to Markdown for analysis by LLMs
- Content Extraction: Extracting text from PowerPoint presentations for summarization
- Data Processing: Converting Excel spreadsheets to structured Markdown for data analysis
- Image Analysis: Using MarkItDown with OCR to extract text from images
- Content Creation: Converting various document formats to Markdown for my blog
Advanced Features
MarkItDown offers several advanced features that make it even more powerful:
- Azure Document Intelligence Integration: For better PDF conversion
- LLM Integration: For generating descriptions of images
- Plugin System: For extending MarkItDown’s capabilities
- Docker Support: For running MarkItDown in containers
Conclusion: A Game-Changer for LLM Workflows
MarkItDown has become an essential tool in my LLM workflow. It bridges the gap between human-readable documents and LLM-processable content, making it much easier to work with various file formats.
Whether you’re building an AI agent that needs to process documents, creating a content analysis pipeline, or just trying to get your LLM to understand the content of a PDF, MarkItDown is a tool worth adding to your toolkit.
Give it a try, and you might just find that it transforms how you work with documents and LLMs.