Connect with Amin Boulouma Official
Deep Dive: Building a Custom Markdown-to-HTML Engine from Scratch
In our previous post, we looked at how to orchestrate a static site generator pipeline using structural design patterns. Now, let’s zoom in on the most critical engine under the hood: the Markdown Parser (MarkdownToHTMLBuilder).
Writing a parser from scratch means bypassing heavy abstract syntax tree (AST) libraries and directly converting raw document lines into valid semantic HTML. Let’s break down how this specific internal component utilizes design patterns, object state manipulation, and explicit token parsing.
The Design Pattern: The Fluent Builder
When constructing a compiler or a conversion tool, you don’t always know your data source up front. Sometimes you read a file directly from a disk, and other times you process an active in-memory text stream.
To solve this cleanly, the subsystem implements the Builder Pattern combined with a fluent interface. Notice how MarkdownToHTMLBuilder abstracts object creation away from the execution class (MarkdownToHTML):
class MarkdownToHTMLBuilder:
def __init__(self):
self.markdown_file_path = ""
self.markdown_text = ""
def set_markdown_file_path(self, markdown_file_path):
self.markdown_file_path = markdown_file_path
return self # Method chaining enabled here
def set_markdown_text(self, markdown_text):
self.markdown_text = markdown_text
return self
def build(self):
return MarkdownToHTML(
markdown_file_path = self.markdown_file_path,
markdown_text = self.markdown_text,
)
By returning self, the builder allows us to instantiate our markdown engine transparently through semantic static methods without complex positional arguments:
# Reading from disk
engine = MarkdownToHTMLBuilder.create_from_file("posts/hello-world.md")
# Reading from raw memory string
engine = MarkdownToHTMLBuilder.create_from_text("# Quick Title")
Token Standardization via Enums
Instead of cluttering string methods with hardcoded symbols ("", "*"), the framework maps special syntax markers using Python’s native Enum system. This isolates changes to markdown syntax rules within a single structural point.
from enum import Enum
class MD_SPECIAL_CASES(Enum):
BOLD = '**'
ITALIC = '*'
MULTILINE_CODE = '```'
INLINE_CODE = '`'
LINK = '[]()'
IMAGE = '![]()'
Line-by-Line Compilation Flow
The processing core uses a stateless pipeline approach inside MarkdownParser.parse(). Here is how a markdown document is split, evaluated, and compiled:
1. Front Matter Extraction
Jekyll files use YAML configurations bounded by triple dashes (---) at the top of the file. The helper uses structural lookahead arrays to bypass metadata initialization before processing semantic content:
if lines[0].startswith('---'):
lines = Helpers._ignore_metadata_line(lines[1:])
2. Lexical Routing
Every single line goes through _parse_line(), which acts as a router matching prefixes to target HTML elements:
| Markdown Input Prefix | Target Parser Method | Output HTML Wrapper |
|---|---|---|
# Title |
_parse_header() |
<h1>Title</h1> |
> Quote |
_parse_quote() |
<blockquote>Quote</blockquote> |
* Item |
_parse_bullet_point() |
<li>Item</li> |
code |
_parse_inline_code() |
<pre><code>code</code></pre> |
3. Inline Special Processing (Alternating Split Logic)
Parsing inline decorations like bold (``) or italic (*) without an official AST can be tricky. This codebase solves it creatively using array splits.
When you split a string by an inline token (e.g., text containing a single bolded word), the system splits the data into alternating indices:
- Even Indexes (
i % 2 == 0): Represent standard text. - Odd Indexes (
i % 2 == 1): Represent text captured inside the special tokens.
@staticmethod
def _parse_text_with_possible_bold(markdown_content):
html_content = ""
bold_split = markdown_content.split(MD_SPECIAL_CASES.BOLD.value)
for i, bold_split_element in enumerate(bold_split):
if i % 2 == 1:
# Odd element means it sat between the '**' tokens!
html_content += MarkdownParser._parse_bold_html_element(bold_split_element)
else:
# Even element is passed down to check for italics nested inside
html_content += MarkdownParser._parse_text_with_possible_italic(bold_split_element)
return html_content
Clean Architecture Benefits
By ensuring that MarkdownParser reads lists of strings and outputs single strings, we guarantee complete independence from file-system behavior. The parsing logic is deterministic and completely isolated:
- Testability: You can easily pass arrays of mock strings directly into
MarkdownParser.parse()to test styling rules without touching a hard drive. - Extensibility: If you want to add support for striking text (
~~), you only need to update theMD_SPECIAL_CASESEnum and map a router step inside_parse_line(). - Decoupled Architecture: The file layout mechanics are managed by a custom
FileOperationsboundary wrapper, protecting your core string transformer from breaking if file access privileges change.