In this example, we will use a system-wide ODBC connector which will be used to retrieve data from the DHIS2 database. There are some disadvantages with this approach, as ODBC is slower than other methods and it does raise some security concerns by providing a system-wide connector to all users. However, it is a convenient method to provide a connection to multiple users. The use of the R package RODBC will be used in this case. Other alternatives would be the use of the RPostgreSQL package, which can interface directly through the Postgresql driver described in Section C.4, “Mapping with R and Postgresql”
Assuming you have already installed R from the procedure in the previous section. Invoke the following command to add the required libraries for this example.
apt-get install r-cran-rodbc r-cran-lattice odbc-postgresql
Next, we need to configure the ODBC connection. Edit the file to suit your local situation using the following template as a guide. Lets create and edit a file called odbc.ini
[dhis2] Description = DHIS2 Database Driver = /usr/lib/odbc/psqlodbcw.so Trace = No TraceFile = /tmp/sql.log Database = dhis2 Servername = 127.0.0.1 UserName = postgres Password = SomethingSecure Port = 5432 Protocol = 9.0 ReadOnly = Yes RowVersioning = No ShowSystemTables = No ShowOidColumn = No FakeOidIndex = No ConnSettings = Debug = 0
Finally, we need to install the ODBC connection with odbcinst -i -d -f odbc.ini
From the R prompt, execute the following commands to connect to the DHIS2 database.
> library(RODBC)
> channel<-odbcConnect("dhis2")#Note that the name must match the ODBC connector name
> sqlTest<-c("SELECT dataelementid, name FROM dataelement LIMIT 10;")
> sqlQuery(channel,sqlTest)
name
1 OPD First Attendances Under 5
2 OPD First Attendances Over 5
3 Deaths Anaemia Under 5 Years
4 Deaths Clinical Case of Malaria Under 5 Years
5 Inpatient discharges under 5
6 Inpatient Under 5 Admissions
7 Number ITNs
8 OPD 1st Attendance Clinical Case of Malaria Under 5
9 IP Discharge Clinical Case of Malaria Under 5 Years
10 Deaths of malaria case provided with anti-malarial treatment 1 to 5 Years
>
It seems R is able to retrieve data from the DHIS2 database.
As an illustrative example, lets say we have been asked to calculate the relative percentage of OPD male and female under 5 attendances for the last twelve months.First, lets create an SQL query which will provide us the basic information which will be required.
OPD<-sqlQuery(channel,"SELECT p.startdate, de.name as de, sum(dv.value::double precision)
FROM datavalue dv
INNER JOIN period p on dv.periodid = p.periodid
INNER JOIN dataelement de on dv.dataelementid = de.dataelementid
WHERE p.startdate >= '2011-01-01'
and p.enddate <= '2011-12-31'
and de.name ~*('Attendance OPD')
GROUP BY p.startdate, de.name;")
We have stored the result of the SQL query in an R data frame called "OPD". Lets take a look at what the data looks like.
> head(OPD)
startdate de sum
1 2011-12-01 Attendance OPD <12 months female 42557
2 2011-02-01 Attendance OPD <12 months female 127485
3 2011-01-01 Attendance OPD 12-59 months male 200734
4 2011-04-01 Attendance OPD 12-59 months male 222649
5 2011-06-01 Attendance OPD 12-59 months male 168896
6 2011-03-01 Attendance OPD 12-59 months female 268141
> unique(OPD$de)
[1] Attendance OPD <12 months female Attendance OPD 12-59 months male
[3] Attendance OPD 12-59 months female Attendance OPD >5 years male
[5] Attendance OPD <12 months male Attendance OPD >5 years female
6 Levels: Attendance OPD 12-59 months female ... Attendance OPD >5 years male
>
We can see that we need to aggregate the two age groups (< 12 months and 12-59 months) into a single variable, based on the gender. Lets reshape the data into a crosstabulated table to make this easier to visualize and calculate the summaries.
>OPD.ct<-cast(OPD,startdate ~ de) >colnames(OPD.ct) [1] "startdate" "Attendance OPD 12-59 months female" [3] "Attendance OPD 12-59 months male" "Attendance OPD <12 months female" [5] "Attendance OPD <12 months male" "Attendance OPD >5 years female" [7] "Attendance OPD >5 years male"
We have reshaped the data so that the data elements are individual columns. It looks like we need to aggregate the second and fourth columns together to get the under 5 female attendance, and then the third and fifth columns to get the male under 5 attendance.After this, lets subset the data into a new data frame just to get the required information and display the results.
> OPD.ct$OPDUnder5Female<-OPD.ct[,2]+OPD.ct[,4]#Females > OPD.ct$OPDUnder5Male<-OPD.ct[,3]+OPD.ct[,5]#males > OPD.ct.summary<-OPD.ct[,c(1,8,9)]#new summary data frame >OPD.ct.summary$FemalePercent<- OPD.ct.summary$OPDUnder5Female/ (OPD.ct.summary$OPDUnder5Female + OPD.ct.summary$OPDUnder5Male)*100#Females >OPD.ct.summary$MalePercent<- OPD.ct.summary$OPDUnder5Male/ (OPD.ct.summary$OPDUnder5Female + OPD.ct.summary$OPDUnder5Male)*100#Males
Of course, this could be accomplished much more elegantly, but for the purpose of the illustration, this code is rather verbose.Finally, lets display the required information.
> OPD.ct.summary[,c(1,4,5)]
startdate FemalePercent MalePercent
1 2011-01-01 51.13360 48.86640
2 2011-02-01 51.49154 48.50846
3 2011-03-01 51.55651 48.44349
4 2011-04-01 51.19867 48.80133
5 2011-05-01 51.29902 48.70098
6 2011-06-01 51.66519 48.33481
7 2011-07-01 51.68762 48.31238
8 2011-08-01 51.49467 48.50533
9 2011-09-01 51.20394 48.79606
10 2011-10-01 51.34465 48.65535
11 2011-11-01 51.42526 48.57474
12 2011-12-01 50.68933 49.31067
We can see that the male and female attendances are very similar for each month of the year, with seemingly higher male attendance relative to female attendance in the month of December.
In this example, we showed how to retrieve data from the DHIS2 database and manipulate in with some simple R commands. The basic pattern for using DHIS2 and R together, will be the retrieval of data from the DHIS2 database with an SQL query into an R data frame, followed by whatever routines (statistical analysis, plotting, etc) which may be required.