Fun with Simple Text Processing and Directed Graphs

I wanted to turn a nested Markdown list into a directed graph in MATLAB.

Today's post is about simple Markdown text lists and text processing using string patterns and directed graphs. Although I do have an application in mind for this, eventually, I enjoyed getting just this part to work, and I thought these processing techniques might be useful to other MATLAB users.

Useful functions making an appearance: readlines, textBoundary, asManyOfPattern, whitespacePattern, extract, erase, table, digraph, plot, highlight, successors, and outdegree.

For my hypothetical application, I envision having a bunch of nested lists of text labels. I thought that these text labels would be mostly easily written and revised in a plain text file, using Markdown syntax. Here is a simple example with three levels of nesting. (Each level is indented by 4 spaces.)

- A
    - B
        - C
    - D
        - G
        - H
- E
- F
    - Q

Read the Text File

I have this text in a file called sample_list.md. My first step is to read the file contents into a string array, one string per line of the text file. The function readlines is perfect for this.

lines = readlines("sample_list.md")
lines = 10x1 string
"- A"       
"    - B"       
"        - C"       
"    - D"       
"        - G"       
"        - H"       
"- E"       
""          
"- F"       
"    - Q"       

Notice there is a blank line. I've decided to simply ignore blank lines in the file, so I will delete them from the lines array.

lines(lines == "") = []
lines = 9x1 string
"- A"       
"    - B"       
"        - C"       
"    - D"       
"        - G"       
"        - H"       
"- E"       
"- F"       
"    - Q"       

Extract the Labels

Next, I want to extract the label, which is the text to the right of the hyphen and space. My strategy is to remove, from the beginning of each line, any number of whitespace characters, followed by a hyphen, followed by any number of whitespace characters. The MATLAB string pattern functions come in handy for this.

For illustration, I'll break down the pattern formation into several annotated steps.

Match text at the beginning of the line:

p = textBoundary("start");

Followed by any number of whitespace characters:

p = p + asManyOfPattern(whitespacePattern);

Followed by a hyphen:

p = p + "-";

Followed by any number of whitespace characters:

p = p + asManyOfPattern(whitespacePattern)
p = pattern
  Matching:

    textBoundary("start") + asManyOfPattern(whitespacePattern) + "-" + asManyOfPattern(whitespacePattern)

Next, call erase, which can operate on an entire string array at once.

labels = erase(lines,p)
labels = 9x1 string
"A"         
"B"         
"C"         
"D"         
"G"         
"H"         
"E"         
"F"         
"Q"         

Compute the Nesting Levels

I'll assume the normal Markdown convention of using 4 spaces for each indentation level. Form another text pattern that matches spaces at the beginning of the line:

p2 = textBoundary("start") + asManyOfPattern(whitespacePattern);

Extract the beginning spaces, count the number of characters, and divide by 4.

indent_levels = strlength(extract(lines, p2)) / 4
indent_levels = 9x1
     0
     1
     2
     1
     2
     2
     0
     0
     1

Form a Directed Graph of the Labels

A directed graph will be useful for further processing of the nested list labels. Each label will be a node, and there will be an additional node that serves as the root of the list tree.

I won't be using the labels as node names, because, in my anticipated application, list labels will not necessarily be unique. Therefore, I'll use the NodeProperties table to store each label. The label of the root node will be an empty string, "".

nodes = table([labels ; ""], VariableNames = "Label")
Label
1 "A"
2 "B"
3 "C"
4 "D"
5 "G"
6 "H"
7 "E"
8 "F"
9 "Q"
10 ""

The edges of the directed graph will represent the nested list hierarchy. For example, the node labeled "B" should have the node "A" as a parent, and it should have the node "C" as a child. The node labeled "E" should have the root node as a parent, and it should have no children.

I figure this out in a loop.

num_list_items = length(labels);
root_node = num_list_items + 1;
edge_nodes = zeros(num_list_items, 2);
for k = 1:length(labels)
    level_k = indent_levels(k);
        if (level_k == 0)
            parent = root_node;
        else
            % This node's parent is the nearest previous node whose list
            % level is one less.
            previous_node_indent_levels = indent_levels(1:(k-1));
            parent = find(previous_node_indent_levels == (level_k - 1), ...
                1, "last");
        end
        edge_nodes(k, :) = [parent k];
end

edges = table(edge_nodes, VariableNames = "EndNodes")
EndNodes
1 2
1 10 1
2 1 2
3 2 3
4 1 4
5 4 5
6 4 6
7 10 7
8 10 8
9 8 9

Now the graph can be formed from the tables of nodes and edges.

g = digraph(edges, nodes)
g = 
  digraph with properties:

    Edges: [9x1 table]
    Nodes: [10x1 table]

Working with the Directed Graph

Here are some simple processing and visualization techniques for working with the directed graph.

Visualize the graph using plot:

graph_plot = plot(g, NodeLabel = g.Nodes.Label);

figure_0.png

The highlight function can draw visual attention to specific nodes and edges. For example, here is a graph plot that shows all of the top-level list items:

top_level_items = successors(g,root_node)
top_level_items = 3x1
     1
     7
     8

highlight(graph_plot, top_level_items, Marker = "pentagram", ...
    MarkerSize = 10)

figure_1.png

Here is a graph plot that highlights the "leaf" list items—items that have nothing indented under them.

graph_plot = plot(g, NodeLabel = g.Nodes.Label);
leaf_items = find(outdegree(g) == 0)
leaf_items = 5x1
     3
     5
     6
     7
     9

highlight(graph_plot, leaf_items, Marker = "square", ...
    MarkerSize = 10)

figure_2.png

Finally, for each leaf item, I'd like to be able to show the complete set of list items leading to it. Here's how to do that, using the digraph function shortestpath and the string function join.

num_leaf_items = length(leaf_items);
leaf_paths = strings(num_leaf_items, 1);
for k = 1:num_leaf_items
    p = shortestpath(g, root_node, leaf_items(k));
    leaf_paths(k) = join(g.Nodes.Label(p(2:end)), " → ");
end
disp(leaf_paths)
    "A → B → C"
    "A → D → G"
    "A → D → H"
    "E"
    "F → Q"

I hope you find some of the text processing, graph processing, and graph visualization functions useful for your own applications.