Deep Dive: Building a Custom Markdown-to-HTML Engine from Scratch | AI Systems Design From Scratch

🏠 Documentation Hub 📝 Engineering Blog 💻 GitHub Repository

Deep Dive: Building a Custom Markdown-to-HTML Engine from Scratch

Amin Boulouma — Software Engineer

In our previous post, we looked at how to orchestrate a static site generator pipeline using structural design patterns. Now, let’s zoom in on the most critical engine under the hood: the Markdown Parser (MarkdownToHTMLBuilder).

Writing a parser from scratch means bypassing heavy abstract syntax tree (AST) libraries and directly converting raw document lines into valid semantic HTML. Let’s break down how this specific internal component utilizes design patterns, object state manipulation, and explicit token parsing.

The Design Pattern: The Fluent Builder

When constructing a compiler or a conversion tool, you don’t always know your data source up front. Sometimes you read a file directly from a disk, and other times you process an active in-memory text stream.

To solve this cleanly, the subsystem implements the Builder Pattern combined with a fluent interface. Notice how MarkdownToHTMLBuilder abstracts object creation away from the execution class (MarkdownToHTML):

class MarkdownToHTMLBuilder:
    def __init__(self):
        self.markdown_file_path = ""
        self.markdown_text = ""

    def set_markdown_file_path(self, markdown_file_path):
        self.markdown_file_path = markdown_file_path
        return self  # Method chaining enabled here

    def set_markdown_text(self, markdown_text):
        self.markdown_text = markdown_text
        return self

    def build(self):
        return MarkdownToHTML(
            markdown_file_path = self.markdown_file_path,
            markdown_text = self.markdown_text,
        )

By returning self, the builder allows us to instantiate our markdown engine transparently through semantic static methods without complex positional arguments:

# Reading from disk
engine = MarkdownToHTMLBuilder.create_from_file("posts/hello-world.md")

# Reading from raw memory string
engine = MarkdownToHTMLBuilder.create_from_text("# Quick Title")

Token Standardization via Enums

Instead of cluttering string methods with hardcoded symbols ("", "*"), the framework maps special syntax markers using Python’s native Enum system. This isolates changes to markdown syntax rules within a single structural point.

from enum import Enum 

class MD_SPECIAL_CASES(Enum):
    BOLD = '**'
    ITALIC = '*'
    MULTILINE_CODE = '```'
    INLINE_CODE = '`'
    LINK = '[]()'
    IMAGE = '![]()'

Line-by-Line Compilation Flow

The processing core uses a stateless pipeline approach inside MarkdownParser.parse(). Here is how a markdown document is split, evaluated, and compiled:

1. Front Matter Extraction

Jekyll files use YAML configurations bounded by triple dashes (---) at the top of the file. The helper uses structural lookahead arrays to bypass metadata initialization before processing semantic content:

if lines[0].startswith('---'): 
    lines = Helpers._ignore_metadata_line(lines[1:])

2. Lexical Routing

Every single line goes through _parse_line(), which acts as a router matching prefixes to target HTML elements:

Markdown Input Prefix	Target Parser Method	Output HTML Wrapper
`# Title`	`_parse_header()`	`<h1>Title</h1>`
`> Quote`	`_parse_quote()`	`<blockquote>Quote</blockquote>`
`* Item`	`_parse_bullet_point()`	`<li>Item</li>`
`code`	`_parse_inline_code()`	`<pre><code>code</code></pre>`

3. Inline Special Processing (Alternating Split Logic)

Parsing inline decorations like bold (``) or italic (*) without an official AST can be tricky. This codebase solves it creatively using array splits.

When you split a string by an inline token (e.g., text containing a single bolded word), the system splits the data into alternating indices:

Even Indexes (i % 2 == 0): Represent standard text.
Odd Indexes (i % 2 == 1): Represent text captured inside the special tokens.

@staticmethod
def _parse_text_with_possible_bold(markdown_content):
    html_content = ""
    bold_split = markdown_content.split(MD_SPECIAL_CASES.BOLD.value)
    
    for i, bold_split_element in enumerate(bold_split): 
        if i % 2 == 1: 
            # Odd element means it sat between the '**' tokens!
            html_content += MarkdownParser._parse_bold_html_element(bold_split_element)
        else: 
            # Even element is passed down to check for italics nested inside
            html_content += MarkdownParser._parse_text_with_possible_italic(bold_split_element)
    return html_content

Clean Architecture Benefits

By ensuring that MarkdownParser reads lists of strings and outputs single strings, we guarantee complete independence from file-system behavior. The parsing logic is deterministic and completely isolated:

Testability: You can easily pass arrays of mock strings directly into MarkdownParser.parse() to test styling rules without touching a hard drive.
Extensibility: If you want to add support for striking text (~~), you only need to update the MD_SPECIAL_CASES Enum and map a router step inside _parse_line().
Decoupled Architecture: The file layout mechanics are managed by a custom FileOperations boundary wrapper, protecting your core string transformer from breaking if file access privileges change.

Connect with Amin Boulouma Official