Issues with https proxy in Python via suds and urllib2
I recently had the need to access a SOAP API to obtain some data. SOAP works by posting an xml file to a site url in a format defined by the API's schema. The API then returns data, also in a form of an xml file. Based on this post, I figured suds was the easiest way to utilize Python to access the API so I could sequentially (and hence, parallelize) query data repeatedly. suds
did turn out to be relatively easy to use:
```{python}
from suds.client import Client
= 'http://www.ripedev.com/webservices/localtime.asmx?WSDL'
url = Client(url)
client print client
'90210')
client.service.LocalTimeByZipCode(```
This worked on my home network. At work, I had to utilize a proxy in order to access the outside world. Otherwise, I'd get a connection refuse message: urllib2.URLError: <urlopen error [Errno 111] Connection refused>
. The modification to use a proxy was straightforward:
```{python}
from suds.client import Client
= {'http': 'proxy_username:proxy_password@proxy_server.com:port'}
proxy = 'http://www.ripedev.com/webservices/localtime.asmx?WSDL'
url # client = Client(url)
= Client(url, proxy=proxy)
client print client
'90210')
client.service.LocalTimeByZipCode(```
The previous examples were from a public SOAP API I found online. Now, the site I wanted to actually hit uses ssl for encryption (i.e., https site) and requires authentication. I thought the fix would be as simple as:
```{python}
from suds.client import Client
= {'https': 'proxy_username:proxy_password@proxy_server.com:port'}
proxy = 'https://some_server.com/path/to/soap_api?wsdl'
url = 'site_username'
un = 'site_password'
pw # client = Client(url)
= Client(url, proxy=proxy, username=un, password=pw)
client print client
client.service.someFunction(args)```
However, I got the error message: Exception: (404, u'/path/to/soap_api')
. Very weird to me. Is it an authentication issue? Is it a proxy issue? If a proxy issue, how so, as my previous toy example worked. Tried the same site on my home network where there is no firewall, and things worked:
```{python}
from suds.client import Client
= 'https://some_server.com/path/to/soap_api?wsdl'
url = 'site_username'
un = 'site_password'
pw # client = Client(url)
= Client(url, username=un, password=pw)
client print client
client.service.someFunction(args)```
Conclusion? Must be a proxy issue with https. I used the following prior to calling suds to help with debugging:
```{python}
import logging
=logging.INFO)
logging.basicConfig(level'suds.client').setLevel(logging.DEBUG)
logging.getLogger('suds.transport').setLevel(logging.DEBUG)
logging.getLogger('suds.xsd.schema').setLevel(logging.DEBUG)
logging.getLogger('suds.wsdl').setLevel(logging.DEBUG)
logging.getLogger(```
My initial thoughts after some debugging: there must be something wrong with the proxy as the log shows python sending the request to the target url, but I get back a response that shows the path (minus the domain name) not found. What happened to the domain name? I notified the firewall team to look into this, as it appears the proxy is modifying something (url is not complete?). The firewall team investigated, and found that the proxy is returning a message that warns the ClientHello message is too large. This is one clue. The log also shows that the user was never authenticated and that the ssl handshake was never completed. My thought: still a proxy issue, as the python code works at home. However, the proxy team was able to access the https SOAP API through the proxy using the SOA Client plugin for Firefox. Now that convinced me that something else may be the culprit.
Googled for help, and thought this would be helpful.
```{python}
import urllib2
import urllib
import httplib
import socket
class ProxyHTTPConnection(httplib.HTTPConnection):
= {'http' : 80, 'https' : 443}
_ports def request(self, method, url, body=None, headers={}):
#request is called before connect, so can interpret url and get
#real host/port to be used to make CONNECT request to proxy
= urllib.splittype(url)
proto, rest if proto is None:
raise ValueError, "unknown URL type: %s" % url
#get host
= urllib.splithost(rest)
host, rest #try to get port
= urllib.splitport(host)
host, port #if port is not defined try to get from proto
if port is None:
try:
= self._ports[proto]
port except KeyError:
raise ValueError, "unknown protocol for: %s" % url
self._real_host = host
self._real_port = port
self, method, url, body, headers)
httplib.HTTPConnection.request(def connect(self):
connect(self)
httplib.HTTPConnection.#send proxy CONNECT request
self.send("CONNECT %s:%d HTTP/1.0\r\n\r\n" % (self._real_host, self._real_port))
#expect a HTTP/1.0 200 Connection established
= self.response_class(self.sock, strict=self.strict, method=self._method)
response = response._read_status()
(version, code, message) #probably here we can handle auth requests...
if code != 200:
#proxy returned and error, abort connection, and raise exception
self.close()
raise socket.error, "Proxy connection failed: %d %s" % (code, message.strip())
#eat up header block from proxy....
while True:
#should not use directly fp probablu
= response.fp.readline()
line if line == '\r\n': break
class ProxyHTTPSConnection(ProxyHTTPConnection):
= 443
default_port def __init__(self, host, port = None, key_file = None, cert_file = None, strict = None, timeout=0): # vinh added timeout
__init__(self, host, port)
ProxyHTTPConnection.self.key_file = key_file
self.cert_file = cert_file
def connect(self):
connect(self)
ProxyHTTPConnection.#make the sock ssl-aware
= socket.ssl(self.sock, self.key_file, self.cert_file)
ssl self.sock = httplib.FakeSocket(self.sock, ssl)
class ConnectHTTPHandler(urllib2.HTTPHandler):
def do_open(self, http_class, req):
return urllib2.HTTPHandler.do_open(self, ProxyHTTPConnection, req)
class ConnectHTTPSHandler(urllib2.HTTPSHandler):
def do_open(self, http_class, req):
return urllib2.HTTPSHandler.do_open(self, ProxyHTTPSConnection, req)
from suds.client import Client
# from httpsproxy import ConnectHTTPSHandler, ConnectHTTPHandler ## these are code from above classes
import urllib2, urllib
from suds.transport.http import HttpTransport
= urllib2.build_opener(ConnectHTTPHandler, ConnectHTTPSHandler)
opener
urllib2.install_opener(opener)= HttpTransport()
t = opener
t.urlopener = 'https://some_server.com/path/to/soap_api?wsdl'
url = {'https': 'proxy_username:proxy_password@proxy_server.com:port'}
proxy = 'site_username'
un = 'site_password'
pw = Client(url=url, transport=t, proxy=proxy, username=un, password=pw)
client = Client(url=url, transport=t, proxy=proxy, username=un, password=pw, location='https://some_server.com/path/to/soap_api?wsdl') ## some site suggests specifying location
client ```
This too did not work. Continued to google, and found that lot's of people are having issues with https and proxy. I knew suds depended on urllib2
, so googled about that as well, and people too had issues with urllib2
in terms of https and proxy. I then decided to investigate using urllib2
to contact the https url through a proxy:
```{python}
## http://stackoverflow.com/questions/5227333/xml-soap-post-error-what-am-i-doing-wrong
## http://stackoverflow.com/questions/34079/how-to-specify-an-authenticated-proxy-for-a-python-http-connect
### at home this works
import urllib2
= 'https://some_server.com/path/to/soap_api?wsdl'
url = urllib2.HTTPPasswordMgrWithDefaultRealm()
password_mgr None,
password_mgr.add_password(=url,
uri='site_username',
user='site_password')
passwd= urllib2.HTTPBasicAuthHandler(password_mgr)
auth_handler = urllib2.build_opener(auth_handler)
opener
urllib2.install_opener(opener)= urllib2.urlopen(url)
page
page.read()
### work network, does not work:
= 'https://some_server.com/path/to/soap_api?wsdl'
url = urllib2.ProxyHandler({'https':'proxy_username:proxy_password@proxy_server.com:port', 'http':'proxy_username:proxy_password@proxy_server.com:port'})
proxy = urllib2.HTTPPasswordMgrWithDefaultRealm()
password_mgr None,
password_mgr.add_password(=url,
uri='site_username',
user='site_password')
passwd= urllib2.HTTPBasicAuthHandler(password_mgr)
auth_handler = urllib2.build_opener(proxy, auth_handler, urllib2.HTTPSHandler)
opener
urllib2.install_opener(opener)= urllib2.urlopen(site)
page ### also tried re-doing above, but with the custom handler as defined in the previous code chunk (http://code.activestate.com/recipes/456195/) running first (run the list of classes)
```
No luck. I re-read this post that I ran into before, and really agreed that urllib2
is severely flawed, especially when using https proxy. At the end of the page, the author suggested using the requests package. Tried it out, and I was able to connect using the https proxy:
```{python}
import requests
import xmltodict
= 'http://proxy_username:proxy_password@proxy_server.com:port'
p1 = 'https://proxy_username:proxy_password@proxy_server.com:port'
p2 = {'http': p1, 'https':p2}
proxy
= 'https://some_server.com/path/to/soap_api?wsdl'
site = requests.get(site, proxies=proxy, auth=('site_username', 'site_password'))
r ## works
r.text = """<?xml version="1.0" encoding="UTF-8"?>
soap_xml_in ...
"""
= {'SOAPAction': u'""', 'Content-Type': 'text/xml; charset=utf-8', 'Content-type': 'text/xml; charset=utf-8', 'Soapaction': u'""'}
headers = requests.post(site, data=soap_xml_in, headers=headers, proxies=proxy, auth=('site_username', 'site_password')).text
soap_xml_out ```
My learnings?
suds
is great for accessing SOAP, just not when you have to access an https site through a firewall.urllib2
is severely flawed. Things only work in very standard situations.requests
package is very powerful and just works. Even though I have to deal with actual xml files as opposed to leveragingsuds
' pythonic structures, thexmltodict
package helps to translate the xml file into dictionaries that only adds marginal effort to extract out relevant data.
NOTE: I had to install libuuid-devel
in cygwin64 because I was getting an installation error.