Parse url which contains unicode query, using urlparse.parse_qs

Task: get dictionary of URL GET query. For example, we have following url:

http://example.com/?key=value&a=b

it is needed to get a dict:

{'key': ['value'], 'a': ['b']}

Values are lists, because one key may have several values:

In: http://example.com/?key=value&a=b&a=c
Out: {'key': ['value'], 'a': ['b', 'c']}

In python there is a function urlparse.parse_qs for that purpose:

>>> import urlparse
>>> query = "key=value&a=b"
>>> urlparse.parse_qs(query)
{'a': ['b'], 'key': ['value']}

So, on input parse_qs receives query, without “http://exapmle.com/?”. To get only query we can use urlparse.urlparse:

>>> import urlparse
>>> url = "http://example.com/?key=value&a=b"
>>> query = urlparse.urlparse(url).query
>>> query
'key=value&a=b'
>>> params = urlparse.parse_qs(query)
>>> params
{'a': ['b'], 'key': ['value']}

Lets restore original url, using urllib.urlencode:

>>> import urllib
>>> urllib.urlencode(params, doseq=True)
'a=b&key=value'

The order of parameters doesn’t matter, so it’s ok.

URL with unicode parameter

According to RFC3986, URL can contain only limited set of characters consisting of digits, letters, and a few graphic symbols from US-ASCII set. And some of characters are reserved (":", "/", "?", "#", "[", "]", "@", "!", "$", "&", "'", "(", ")", "*", "+", ",", ";", "="). If it is needed to send nonprintable or reserved characters in URL (for example as query param value), they must be Percent-Encoded: %HH, where HH is hexadecimal digits.

Suppose we need to send u”значение”. In python string u"значение" contains unicode code points and we need to get bytes, to be able to percent-encode them. So first lets encode the unicode string using, for example, utf8 encoding:

>>> value = u'значение'
>>> value_utf8 = value.encode('utf8')
>>> value_utf8
'\xd0\xb7\xd0\xbd\xd0\xb0\xd1\x87\xd0\xb5\xd0\xbd\xd0\xb8\xd0\xb5'

Now encode those bytes, using Percent-Encoding (%HH) to be able to include in url:

>>> value_url = urllib.quote(value_utf8)
>>> value_url
'%D0%B7%D0%BD%D0%B0%D1%87%D0%B5%D0%BD%D0%B8%D0%B5'

Full URL:

>>> url = "http://example.com/?key=%s&a=b" % value_url
>>> url
'http://example.com/?key=%D0%B7%D0%BD%D0%B0%D1%87%D0%B5%D0%BD%D0%B8%D0%B5&a=b'

Again, lets get the query dict:

>>> query = urlparse.urlparse(url).query
>>> query
'key=%D0%B7%D0%BD%D0%B0%D1%87%D0%B5%D0%BD%D0%B8%D0%B5&a=b'
>>> params = urlparse.parse_qs(query)
>>> params
{'a': ['b'], 'key': ['\xd0\xb7\xd0\xbd\xd0\xb0\xd1\x87\xd0\xb5\xd0\xbd\xd0\xb8\xd0\xb5']}

As we can see, parse_qs decoded value from Percent-Encoding and returned bytes. Now we can get unicode, as we remember, that encoding was utf8:

>>> params['key'][0].decode('utf8')
u'\u0437\u043d\u0430\u0447\u0435\u043d\u0438\u0435'
>>> print params['key'][0].decode('utf8')
значение

Ok. Restore original query from the dict:

>>> urllib.urlencode(params, doseq=True)
'a=b&key=%D0%B7%D0%BD%D0%B0%D1%87%D0%B5%D0%BD%D0%B8%D0%B5'

We’ve got same parameters as it was in original url.

Same steps with URL, that was returned from django’s request.get_full_path(). For some reason, request.get_full_path() returns not the str string, but unicode (tried on django 1.4, 1.5):

>>> request.get_full_path()
u'/?key=%D0%B7%D0%BD%D0%B0%D1%87%D0%B5%D0%BD%D0%B8%D0%B5&a=b'

Repeat same steps with this URL:

>>> url = request.get_full_path()
>>> query = urlparse.urlparse(url).query
>>> query
u'key=%D0%B7%D0%BD%D0%B0%D1%87%D0%B5%D0%BD%D0%B8%D0%B5&a=b'
>>> params = urlparse.parse_qs(query)
>>> params
{u'a': [u'b'], u'key': [u'\xd0\xb7\xd0\xbd\xd0\xb0\xd1\x87\xd0\xb5\xd0\xbd\xd0\xb8\xd0\xb5']}

Interesting, that value for u’key’ is unicode string, that contains bytes! Of course, decoding of that string will fail:

>>> params['key'][0].decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-15: ordinal not in range(128)

Same error using urlencode:

>>> urllib.urlencode(params, doseq=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\urllib.py", line 1337, in urlencode
    l.append(k + '=' + quote_plus(str(elt)))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-15: ordinal not in range(128)

For me there are two surprises:

django returned url as unicode (what for? why not just str as url can contain only ascii characters)
parse_qs returned unicode string, that contains bytes.

Solution is simple, just always give str to parse_qs:

>>> url = request.get_full_path()
>>> url = url.encode('ascii')
>>> url
'/?key=%D0%B7%D0%BD%D0%B0%D1%87%D0%B5%D0%BD%D0%B8%D0%B5&a=b'

Or, which is the same:

>>> url = request.get_full_path()
>>> url = str(url)
>>> url
'/?key=%D0%B7%D0%BD%D0%B0%D1%87%D0%B5%D0%BD%D0%B8%D0%B5&a=b'

Links:

Question about this problem on stackoverflow
Great presentation about python strings and encoding: http://nedbatchelder.com/text/unipain.html