The goal is to parse an HTTP response into two elements, the header
and the body. As everyone knows, the header is separated from the
body using two CRLFs (\r\n
).
Some freestylers (obviously not following the RFC) use two \n
. I however won't
deal with that in here.
First get a HTTP response:
$ echo -e 'GET / HTTP/1.1\r\n\r\n' | nc google.com 80 | tee /tmp/response
HTTP/1.1 302 Found
Cache-Control: private
Content-Type: text/html; charset=UTF-8
Location: ...
Content-Length: 256
Date: ...
Server: GFE/2.0
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>302 Moved</TITLE></HEAD><BODY>
<H1>302 Moved</H1>
The document has moved
<A HREF="...">here</A>.
</BODY></HTML>
The response content is copied to /tmp/response
. With erlang,
this is easily read with file:read_file
(loading the content to a Binary) and then transforming the binary to a string
using erlang:binary_to_list.
Erlang provides two interesting modules which help to deal with string: string and lists (since a string is a list in erlang). Looking for something to allow me to split a string into several tokens based on a separator lead me to these solutions:
- tokenize using string:tokens/2
- offset and sub-string using string:str/2 and string:substr
- offset and split using string:str/2 and lists:split/2
- coding way
Tokenize
string:tokens/2 allows to split a string into a list of tokens.
For example:
1> string:tokens("this is a test", " ").
["this","is","a","test"]
A disadvantage (or advantage) from this method is that adjacent separators are treated as one as shown here:
1> string:tokens("I want to separate successive c in here", "cc").
["I want to separate su","essive "," in here"]
instead of the expected ["I want to separate su","essive c in here"]
:-(
Similarly when parsing HTTP response, we're looking to separate the
header from the body using the separator \r\n\r\n
. As shown this won't work
with tokens
since \r\n\r\n
will be treated as \r\n
and thus result in
the following:
1> {ok, Bin} = file:read_file("/tmp/response").
...
2> Content = binary_to_list(Bin).
...
3> string:tokens(Content, "\r\n\r\n").
["HTTP/1.1 302 Found","Cache-Control: private",
"Content-Type: text/html; charset=UTF-8",
"Location: ...",
"Content-Length: 256","Date: ...",
"Server: GFE/2.0",
"<HTML><HEAD><meta http-equiv=\"content-type\" content=\"text/html;charset=utf-8\">",
"<TITLE>302 Moved</TITLE></HEAD><BODY>",
"<H1>302 Moved</H1>","The document has moved",
"<A HREF=\"...\">here</A>.",
"</BODY></HTML>"]
Let's try something else ...
Offset and sub-string
string:str/2 allows you to retrieve the offset of a specific sequence in a string while string:substr helps you extract a sub-string based on its starting offset and its length.
One way of doing it would then be:
1> {ok, Bin} = file:read_file("/tmp/response").
...
2> Content = binary_to_list(Bin).
...
3> Offset = string:str(Content, "\r\n\r\n").
...
4> % first retrieve the header
4> string:substr(Content, 1, Offset+1).
"HTTP/1.1 302 Found\r\nCache-Control: private\r\nContent-Type: text/html; charset=UTF-8\r\nLocation: ...\r\nContent-Length: 256\r\nDate: ...\r\nServer: GFE/2.0\r\n"
5> % then the body
5> string:substr(Content, Offset+4).
"<HTML><HEAD><meta http-equiv=\"content-type\" content=\"text/html;charset=utf-8\">\n<TITLE>302 Moved</TITLE></HEAD><BODY>\n<H1>302 Moved</H1>\nThe document has moved\n<A HREF=\"...\">here</A>.\r\n</BODY></HTML>\r\n"
This works but you must admit it is not very elegant. Moreover you have to play with the offset value to get exactly the part you want. Not nice ! Moreover you can only split a string in two and thus wouldn't easily be able to split a string in more than two fields.
Offset and split
lists:split/2 will split the list provided in argument in two lists with the first one containing the N first char(s):
1> {ok, Bin} = file:read_file("/tmp/response").
...
2> Content = binary_to_list(Bin).
...
3> Offset = string:str(Content, "\r\n\r\n").
...
4> {Header, Body} = lists:split(Offset+1, Content).
{"HTTP/1.1 302 Found\r\nCache-Control: private\r\nContent-Type: text/html; charset=UTF-8\r\nLocation: ...\r\nContent-Length: 256\r\nDate: ...\r\nServer: GFE/2.0\r\n",
"\r\n<HTML><HEAD><meta http-equiv=\"content-type\" content=\"text/html;charset=utf-8\">\n<TITLE>302 Moved</TITLE></HEAD><BODY>\n<H1>302 Moved</H1>\nThe document has moved\n<A HREF=\"...\">here</A>.\r\n</BODY></HTML>\r\n"}
This is a bit better but we still have to strip the two returned strings in order
to remove the separator \r\n
. Moreover, as the previous example, you cannot
easily split a string in more than two fields.
Coding way
This is IMHO the best way for our use-case. Unlike the other solutions this:
- is easily customizable
- handles errors and cases when no match is found
- looks very elegant and uses erlang's power
- [bonus] allows to split in more than two fields
Without further introduction, here's how I did this:
-define(CRLF, "\r\n").
parse_response(?CRLF ++ ?CRLF ++ Rest, Acc) ->
[lists:reverse(Acc)++?CRLF, Rest];
parse_response([H|T], Acc) ->
parse_response(T, [H|Acc]);
parse_response([], _Acc) ->
["", ""].
This allows, using list comprehension and pattern matching, to nicely separate the header of the body:
1> {ok, Bin} = file:read_file("/tmp/response").
...
2> Content = binary_to_list(Bin).
...
3> tmp:parse_response(Content, []).
["HTTP/1.1 302 Found\r\nCache-Control: private\r\nContent-Type: text/html; charset=UTF-8\r\nLocation: ...\r\nContent-Length: 256\r\nDate: ...\r\nServer: GFE/2.0\r\n",
"<HTML><HEAD><meta http-equiv=\"content-type\" content=\"text/html;charset=utf-8\">\n<TITLE>302 Moved</TITLE></HEAD><BODY>\n<H1>302 Moved</H1>\nThe document has moved\n<A HREF=\"...\">here</A>.\r\n</BODY></HTML>\r\n"]
A little explanation on how that works:
While no successive CRLFs are found, go through the string, char by char
(using [H|T]
), and prepend the processed char (H
) to the accumulator.
Once two CRLFs are found, we reverse the accumulator (since
it was built in reverse order) and construct the resulting list.
If no \r\n\r\n
is found, a list of two empty lists is returned.
This function can then be improved to handle the splitting of several fields,
as for example in a csv line where fields are separated by a comma. We wouldn't
be able to use string:tokens
since we want to be able to see empty fields:
-define(SEP, ",").
parse_line(?SEP ++ Rest, Acc) ->
[string:strip(lists:reverse(Acc))|parse_line(Rest, [])];
parse_line([H|T], Acc) ->
parse_line(T, [H|Acc]);
parse_line([], Acc) ->
[string:strip(lists:reverse(Acc))].
resulting in (you can see the difference with string:tokens
):
1> Line = "field1,ofjeofje ,field3, foefjoejfe, field5,,".
"field1,ofjeofje ,field3, foefjoejfe, field5,,"
2> tmp:parse_line(Line, []).
["field1","ofjeofje","field3","foefjoejfe","field5",[],[]]
3> string:tokens(Line, ",").
["field1","ofjeofje ","field3"," foefjoejfe",
" field5"]
Since this is easily customizable, one could think of parsing the header's fields
(the fields separator is \r\n
and the key-value separator is :
) and put
them in a map for easy retrieval. I let
that coding exercise to the reader ;-)
Happy coding !!