Sunday, February 13, 2011

Health 2 code-a-thon in DC Feb 12, 2011

http://health2challenge.org/code-a-thon/washington-dc/

Someone sent a link about this to the DC Python Meetup Group
(http://meetup.zpugdc.org/) a few weeks ago. It looked like fun and a
way to learn about a new domain, so I signed up. I'm not aware if any
other Python folks were there. I didn't bump into any.

I didn't really know what to expect. I knew pretty close to nothing
about the field. I wondered what technology would be used. It wasn't
clear how teams would be assembled.

A major motivation of this event was to leverage a growing collection
of health-related databases:
http://health2challenge.org/code-a-thon/data-resources/
The event was fun, if a bit chaotic. It was hard to find an
appropriate team and contribute. I gather some teams had formed ahead
of time, but as an outsider, there didn't seem to be any way to get
hooked up ahead of time.

I spent some time brainstorming with one loose team that was
interested in raising awareness at the community level of the economic
impact on a community of health issues. There were some ideas thrown
around that didn't seem very realistic. The "public" aren't likely to
visit dedicated health policy sites or even play health policy games.

I suggested that a good way to reach people in communities might be
through their community newspapers and web sites. The idea was to
develop database-based content in the form of mini applications,
possibly augmented by prose written my health professionals that could
be leveraged by community newspapers. Making this database-based
meant that the content could be relevant to the local community.

This idea was well received. This was a pleasant surprise, since it's
actually kinda close to my day job.

I worked for a while on a prototype application that would provide a
small bit of content of the form:
The hospital readmission rates in MYCOMMUNITY are X.
This compares to a rate of Y in MYSTATE and Z nationally.
To find out more, see http://services.healthindicators.gov.
where obviously MYCOMMUNITY and MYSTATE are community specific and X,
Y and Z are provided by a health database. We used data from
http://services.healthindicators.gov. The idea is that this blurb
would be published as an app that community newspapers could use to
create content. The specific blurb was just a proof of concept.

The database provides SOAP and REST interfaces. I ended up using
suds, http://pypi.python.org/pypi/suds to access the SOAP
interface. This was really easy:
from suds.client import Client
url = 'http://services.healthindicators.gov/v1/SOAP.svc?wsdl'
client = Client(url)
To get a list of all of the methods:
print client
To call a method:
client.service.SomeMethod()
(All of the methods in this API have camel-case names with initial
upper case letters.)

Of course, since this is Python, I could do all of this interactively!
(I say this for the benefit of Health 2.0 readers who read this.)
I was exploring the API in a few minutes. Nice!

For some reason, the API breaks most requests into pages. Each
request has three parts:
foo(some_args, page)
Get some data.
For example: GetLocales, GetIndicatorsByLocaleID, GetGenders.
fooCount(some_args)
Get the result count
For example: GetLocalesCount, GetIndicatorsByLocaleIDCount, GetGendersCount.
(In case you're wondering, client.service.GetGendersCount() returns 2.)
fooPageCount(some_args)
Get the result count
For example: GetLocalesPageCount, GetIndicatorsByLocaleIDPageCount, GetGendersPageCount.
I ended up creating a helper function:
def paged(client, name, *args):
    r = []
    service = client.service
    for page in range(1, getattr(service, name+'PageCount')(*args)+1):
        r.extend(getattr(service, name)(*(args+(page, )))[0])
    return r
(If you're paying close attention, you might be wondering about the
[0] in the code above. For some reason, each "page" of data was
returned by suds as a sequence object with one item containing a
list of the actual data. I don't know if this is a quirk of the API or
of suds.)

This allowed me, for example, to get all locales with:
locales = paged(client, 'GetLocales')
to deal with the paged data.

As is to be expected, the database is challenging. Data are not
uniformly available. Some data are available down to the county
level, but other data isn't. For example, hospital readmission rates
are available at the level of "Health Referral Region", which is
typically (always?) much larger than a county. Different localities
have different amounts of data. Prince William County has on the
order of 300 health indicators available, while DC has around 10,000.

Speaking of "indicators", as with any domain, this one has confusing
jargon. There were "indicator descriptions", like "Acute Hospital
Readmission Rate" and "indicators", like "the value in Arlington is
17%". As it was explained to me, the indicator descriptions are the
questions and the indicators are the answers. The answers are
qualified and adjusted in various ways, probably based on whatever
studies they came out of. I suspect that there will be lots of naive
and misleading uses of this data. I hope these automated
applications get some careful review by domain experts.

Using the database affectively requires either familiarity
with the data, or the ability to quickly browse. The SOAP interface
to the database is pretty slow and doesn't provide very targeted
queries. For example, there's no way to request one type of indicator
for a locale. You can pick an indicator, and get data for all locales,
or pick a locale and get all indicators for it. Getting all of the
indicators for DC took several minutes. They're working on their
search capabilities, so I'm sure this will improve over time.

These sorts of databases will be used for a variety of
applications and run-time use of the databases will likely prove to be
impractical. Taking snapshots is unattractive, as data will
be out of data. Probably, a download model with update
subscriptions would be a better way to go. In other words,
applications might be well served by downloading a database and either
polling for updates or getting updates sent to them.

We decided to bail on our prototype because we didn't feel the data
was local enough. This was a mistake! We should have finished the
prototype. The actual data didn't matter. The presentation of the
prototype would have been a good time to discuss the issues. Dang.

I wandered over to another team that was working with the same
database. They were working on a system for looking at local policy
decisions based on county government databases and connecting these to
outcomes via the health indicators database. I think this is a cool
idea and they were led by a domain expert who had a pretty definite
idea of what he was trying to accomplish. I'm pretty sure that this
will lead to success.

I was hoping to provide some help because I has gained some
familiarity with the database. Unfortunately, they were bogged down
accessing the database using some Java-based SOAP
interface. Gaaaa. Their Java programmer was obviously good, but he was
still using Java. Most of the developers were just sitting around
waiting for the Java programmer. I tried to explain some of the issues
with the data, but the Java programmer was just too busy hacking
Java. I ended up learning the Google chart API so I could help them
eventually display the data.

I eventually got bored and left early. I wish, in hindsight, I'd
finished the prototype I was working on. Hopefully, this blog will be
useful and make up for this a little bit. :)

I wouldn't mind doing this again, especially if I could hook up with a
team ahead of time. I'd even be willing to finish that prototype if
there was interest. I can't spend too much time on this though, as I
have to many other interesting projects.