HTTP response parsing with Erlang

// under erlang parsing
// Mon 08 February 2016

The goal is to parse an HTTP response into two elements, the header and the body. As everyone knows, the header is separated from the body using two CRLFs (\r\n). Some freestylers (obviously not following the RFC) use two \n. I however won't deal with that in here.

First get a HTTP response:

$ echo -e 'GET / HTTP/1.1\r\n\r\n' | nc google.com 80 | tee /tmp/response
HTTP/1.1 302 Found
Cache-Control: private
Content-Type: text/html; charset=UTF-8
Location: ...
Content-Length: 256
Date: ...
Server: GFE/2.0

<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>302 Moved</TITLE></HEAD><BODY>
<H1>302 Moved</H1>
The document has moved
<A HREF="...">here</A>.
</BODY></HTML>

The response content is copied to /tmp/response. With erlang, this is easily read with file:read_file (loading the content to a Binary) and then transforming the binary to a string using erlang:binary_to_list.

Erlang provides two interesting modules which help to deal with string: string and lists (since a string is a list in erlang). Looking for something to allow me to split a string into several tokens based on a separator lead me to these solutions:

tokenize using string:tokens/2
offset and sub-string using string:str/2 and string:substr
offset and split using string:str/2 and lists:split/2
coding way

Tokenize

string:tokens/2 allows to split a string into a list of tokens.

For example:

1> string:tokens("this is a test", " ").
["this","is","a","test"]

A disadvantage (or advantage) from this method is that adjacent separators are treated as one as shown here:

1> string:tokens("I want to separate successive c in here", "cc").
["I want to separate su","essive "," in here"]

instead of the expected ["I want to separate su","essive c in here"] :-(

Similarly when parsing HTTP response, we're looking to separate the header from the body using the separator \r\n\r\n. As shown this won't work with tokens since \r\n\r\n will be treated as \r\n and thus result in the following:

1> {ok, Bin} = file:read_file("/tmp/response").
...
2> Content = binary_to_list(Bin).
...
3> string:tokens(Content, "\r\n\r\n").
["HTTP/1.1 302 Found","Cache-Control: private",
 "Content-Type: text/html; charset=UTF-8",
 "Location: ...",
 "Content-Length: 256","Date: ...",
 "Server: GFE/2.0",
 "<HTML><HEAD><meta http-equiv=\"content-type\" content=\"text/html;charset=utf-8\">",
 "<TITLE>302 Moved</TITLE></HEAD><BODY>",
 "<H1>302 Moved</H1>","The document has moved",
 "<A HREF=\"...\">here</A>.",
 "</BODY></HTML>"]

Let's try something else ...

Offset and sub-string

string:str/2 allows you to retrieve the offset of a specific sequence in a string while string:substr helps you extract a sub-string based on its starting offset and its length.

One way of doing it would then be:

1> {ok, Bin} = file:read_file("/tmp/response").
...
2> Content = binary_to_list(Bin).
...
3> Offset = string:str(Content, "\r\n\r\n").
...
4> % first retrieve the header
4> string:substr(Content, 1, Offset+1).
"HTTP/1.1 302 Found\r\nCache-Control: private\r\nContent-Type: text/html; charset=UTF-8\r\nLocation: ...\r\nContent-Length: 256\r\nDate: ...\r\nServer: GFE/2.0\r\n"
5> % then the body
5> string:substr(Content, Offset+4).
"<HTML><HEAD><meta http-equiv=\"content-type\" content=\"text/html;charset=utf-8\">\n<TITLE>302 Moved</TITLE></HEAD><BODY>\n<H1>302 Moved</H1>\nThe document has moved\n<A HREF=\"...\">here</A>.\r\n</BODY></HTML>\r\n"

This works but you must admit it is not very elegant. Moreover you have to play with the offset value to get exactly the part you want. Not nice ! Moreover you can only split a string in two and thus wouldn't easily be able to split a string in more than two fields.

Offset and split

lists:split/2 will split the list provided in argument in two lists with the first one containing the N first char(s):

1> {ok, Bin} = file:read_file("/tmp/response").
...
2> Content = binary_to_list(Bin).
...
3> Offset = string:str(Content, "\r\n\r\n").
...
4> {Header, Body} = lists:split(Offset+1, Content).
{"HTTP/1.1 302 Found\r\nCache-Control: private\r\nContent-Type: text/html; charset=UTF-8\r\nLocation: ...\r\nContent-Length: 256\r\nDate: ...\r\nServer: GFE/2.0\r\n",
 "\r\n<HTML><HEAD><meta http-equiv=\"content-type\" content=\"text/html;charset=utf-8\">\n<TITLE>302 Moved</TITLE></HEAD><BODY>\n<H1>302 Moved</H1>\nThe document has moved\n<A HREF=\"...\">here</A>.\r\n</BODY></HTML>\r\n"}

This is a bit better but we still have to strip the two returned strings in order to remove the separator \r\n. Moreover, as the previous example, you cannot easily split a string in more than two fields.

Coding way

This is IMHO the best way for our use-case. Unlike the other solutions this:

is easily customizable
handles errors and cases when no match is found
looks very elegant and uses erlang's power
[bonus] allows to split in more than two fields

Without further introduction, here's how I did this:

-define(CRLF, "\r\n").

parse_response(?CRLF ++ ?CRLF ++ Rest, Acc) ->
  [lists:reverse(Acc)++?CRLF, Rest];
parse_response([H|T], Acc) ->
  parse_response(T, [H|Acc]);
parse_response([], _Acc) ->
  ["", ""].

This allows, using list comprehension and pattern matching, to nicely separate the header of the body:

1> {ok, Bin} = file:read_file("/tmp/response").
...
2> Content = binary_to_list(Bin).
...
3> tmp:parse_response(Content, []).
["HTTP/1.1 302 Found\r\nCache-Control: private\r\nContent-Type: text/html; charset=UTF-8\r\nLocation: ...\r\nContent-Length: 256\r\nDate: ...\r\nServer: GFE/2.0\r\n",
 "<HTML><HEAD><meta http-equiv=\"content-type\" content=\"text/html;charset=utf-8\">\n<TITLE>302 Moved</TITLE></HEAD><BODY>\n<H1>302 Moved</H1>\nThe document has moved\n<A HREF=\"...\">here</A>.\r\n</BODY></HTML>\r\n"]

A little explanation on how that works:

While no successive CRLFs are found, go through the string, char by char (using [H|T]), and prepend the processed char (H) to the accumulator. Once two CRLFs are found, we reverse the accumulator (since it was built in reverse order) and construct the resulting list. If no \r\n\r\n is found, a list of two empty lists is returned.

This function can then be improved to handle the splitting of several fields, as for example in a csv line where fields are separated by a comma. We wouldn't be able to use string:tokens since we want to be able to see empty fields:

-define(SEP, ",").

parse_line(?SEP ++ Rest, Acc) ->
  [string:strip(lists:reverse(Acc))|parse_line(Rest, [])];
parse_line([H|T], Acc) ->
  parse_line(T, [H|Acc]);
parse_line([], Acc) ->
  [string:strip(lists:reverse(Acc))].

resulting in (you can see the difference with string:tokens):

1> Line = "field1,ofjeofje    ,field3, foefjoejfe,    field5,,".
"field1,ofjeofje    ,field3, foefjoejfe,    field5,,"
2> tmp:parse_line(Line, []).
["field1","ofjeofje","field3","foefjoejfe","field5",[],[]]
3> string:tokens(Line, ",").
["field1","ofjeofje    ","field3"," foefjoejfe",
 "    field5"]

Since this is easily customizable, one could think of parsing the header's fields (the fields separator is \r\n and the key-value separator is :) and put them in a map for easy retrieval. I let that coding exercise to the reader ;-)

Happy coding !!