Tuesday, April 29, 2008

Authoring Cross-Platform Wiki Markup

I'll never think of line delimiters the same way again. Mac uses \r, Unix uses \n, and Windows uses \r\n. Who cares? Well, when it comes to writing platform-independent Wiki markup, it matters.


Take for example a document that was written on a Unix platform. Every line is separated by a single \n character. For Wiki markup, paragraphs are separated by an empty line, which is represented in the document as two consecutive newlines \n\n. When a Wiki markup parser converts this document to HTML, it looks for the empty newline (the second \n) and uses that to close the previous paragraph and start a new one.


Now what happens if the document is opened and edited on a Mac? Mac uses \r as a newline. Suppose the Mac user adds a new empty line just before an existing one in this same document. For example:


Prior to editing:


some text\nmore text


After editing:

some text\r\nmore text


Prior to adding the newline, the document contains a single \n character, and afterwards the document contains \r\n. In the users editor (on a Mac) the \r\n will appear visually as two lines, however \r\n happens to be a Windows line delimiter and thus will be parsed by Windows-accomodating Wiki markup parsers as a single line delimiter. In many documents this may not matter, however in Wiki markup this can make a big difference.


Editors that support editing Wiki markup on multiple platforms must be coded carefully to avoid this issue. For Textile-J this means that the end-of-line markers are converted to the platform default when the editor first opens a file.


Who knew that line delimiters could be so important.

Thursday, April 24, 2008

Roller and Textile-J

The Roller Support Project has announced plugins for parsing markup using Textile-J.  This brings support for Textile, MediaWiki, Confluence and TracWiki markup to the Apache Roller blog server.  Great to see projects like this picking up Textile-J, good work Anil!

Tuesday, April 15, 2008

JUnit Made It Easy!

Recently I posted about evolving the architecture of a markup parser.  During its evolution I ended up completely rewriting the parser to a new architecture.  A solid JUnit test suite made it possible to do this while maintaining confidence in the quality of the new code.  This article details the nature of JUnits that make such a switch easy.

Methodology:

While developing the Textile-J markup parser I incrementally improved the parser by adding support for markup syntax.  Paragraphs were added first, then bold text, then lists, you get the idea.  As support was added for each syntax feature, I created one or more JUnit tests to exercise the feature.  I inspected the results of the HTML output from the small snippet of Textile used, and when I was satisfied created one or more assertions in the JUnit.  This made development easy because:
  • I could isolate the syntax that I wanted to support and test it independently of other syntax
  • Since the tests were small, stepping through the code in the debugger only involved the relevant code
  • It's easy to see when something breaks that worked before
  • I could be confident in the quality of the parser
The Result

The result of using this methodology was a JUnit test suite of over 100 tests, and I could easily measure progress in supporting Textile markup features.

Markup Parser Rewrite

Rewriting the parser first involved stubbing out the new parser API, and then making the old parser API a facade to the new one.  In doing this my new parser API automatically gained over 100 JUnit tests -- which of course were now failing.  By incrementally improving the new parser until all JUnits passed, I could be certain that the new parser supports the same markup language features.

As I finished the last language feature and saw green for every JUnit test, I had a huge sense of relief.  I now knew that the new parser was working with a guaranteed level of quality.

The Textile-J project now has over 250 JUnit tests.  To get an idea what feedback JUnit can give, take a look at the test report here, and the code coverage report here.

Attributes of a Test Oriented Project

The following are some attributes of my project that make JUnit tests so powerful:
  • A JUnit test for every feature
  • Meaningful assertions
  • An environment that makes running JUnits painless and easy (like Eclipse)
  • A development methodology that encourages or requires tests for all new code
  • An Ant build script that runs the JUnits whenever a new build is created
  • Integrated code coverage (such as cobertura) so that you can see which code is tested
Once your project is set up to run tests, it's easy to add them... so jump that first hurdle and get testing with JUnit!

Wednesday, April 2, 2008

Evolving a Wiki Markup Parser

Initially my Textile-J open-source project started out with modest goals: to provide a Textile markup parser for Java. Since the project has started, interest in other features drove evolution of the code-base in some interesting directions. This article discusses architectural choices and the evolution of the Textile-J parser architecture to meet changing requirements.

The initial design of the Textile-J parser was a monolithic stateful parser class that used regular expressions and local variables to parse the markup. The results of parsing markup were passed to the XMLStreamWriter interface as HTML elements and attributes. This architecture worked fine initially, as it allowed me to evolve my understanding of Textile markup structure (blocks, phrases and tokens) and the XMLStreamWriter provided a solid means of outputting XHTML.

Feature: Multiple Output Formats

The first major feature request was to support multiple output formats: HTML and DocBook. This required a major rethink of the parser. While XMLStreamWriter was a great means of outputting XML, the parser had to know that HTML was the output format. After some thinking and reviewing the ever-relevant GOF Design Patterns book, I recognized that the Builder design pattern was an ideal fit. So I created a new interface called DocumentBuilder, with an HtmlDocumentBuilder and DocbookDocumentBuilder implementations. Now the parser need not know about the output format, meaning that a single parser could drive multiple output formats.

Feature: Markup Dialects

Next extensions to the Textile markup language were requested. For example, Confluence markup dialect is very similar to Textile but has some additional syntax features. With a better understanding of markup structure (blocks, phrases and tokens) I designed a new 'Dialect' concept that allowed for markup extensions to be added to the base Textile language. The 'Dialect' design was object-oriented, making extensions modular and relatively easy to add without disturbing the base markup parser code.

Feature: Markup Languages

While the approach of markup dialects worked well for adding extensions to Textile parsing, it did not solve the problem of fully supporting new markup languages. Dialects at this point were only capable of extending Textile.

Community demand for supporting new markup languages was increasing, including requests for MediaWiki and Markdown. Supporting these languages was not possible using the existing Textile parser, as the markup rules of Textile were embedded in a single monolithic class. After some prolonged hesitation (this would be a big job), I got down to the design of a complete rewrite of the parser architecture.

The first step was to read up on various markup languages. Most languages that I've looked at use a simple line-based parsing approach that consists of dividing the markup into blocks, phrases and tokens. Blocks are usually multi-line constructs that have certain attributes, such as paragraphs and lists. Phrases are modifiers that affect text on a single line, and tokens are match-and-replace elements in the text.

Requirements of the new design had to include:
  • ease of adding new markup languages
  • modular object-oriented design for better maintainability
  • facilitate comprehensive JUnit tests
  • easier learning curve for community contributions
  • pluggable architecture
  • output format agnostic
Using my experience with the previous Dialect design here's what I came up with:




The parser delegates all language-specific parsing to the Dialect. The Dialect defines a language with a collection of Blocks, phrase modifiers (PatternBasedElement) and tokens (PatternBasedElement). Blocks implement rules specific to paragraphs, lists, tables, etc. Concrete PatternBasedElementProcessor classes know how to emit portions of content affected by markup (phrase modifiers or tokens).

So far the following languages have been imlemented using this new architecture:
  • Textile
  • MediaWiki (a la WikiPedia fame)
  • Confluence
I hope to see contributions from the community for supporting other languages, such as Markdown and Creole.