By: Alex Baumberg & Amos (Kippi) Bordowitz
One of the most basic problems we run against while using data from several organizations is the heterogeneity of data formats, and the means to access it. In addition, data tends to become fragmented, even within the same organization. During a research project, this problem can become acute.
Just before we delve into the problem let us make sure we understand it. Think of a pregnant person; imagine all the procedures they must go through during their pregnancy and leading up to the delivery. They might find themselves having tests and procedures in a variety of medical centers, from their GP all the way to the hospital in which they will give birth. In today’s world, unfortunately, every organization maintains the data differently, meaning that data preparation for Federated Learning* models becomes a slow and costly process, as all data must be standardized prior to its digestion.
*According to Wikipedia, Federated Learning is “a machine learning technique that trains an algorithm across multiple decentralized edge devices or servers holding local data samples, without exchanging them”.
The problem is compounded when one wants to run a research project using the data from several organizations. Research data must be precise, and all data of a certain type must be represented in the exact same way, or it might skew the final results, or even make the ability to get results considerably more difficult and time intensive. This can lead to loss of grant money and many man-hours. To add another level of complication, in most medical research there is a basic requirement that the data be de-identified, which, once again, if the data isn’t standardized, can lead to even further complications.
Here at Outburn, we have developed our own in-house solution for on-demand data de-identification, while simultaneously preparing the data for federated learning models.
In this article, we will outline our solution and its architecture.
Business workflow:
Our project, a
collaboration with HUJI, is meant to assess the link between pregnant women who
were exposed to Cytomegalovirus (CMV) and the health of their newborn children
post-partum.
Once the researchers have received the Helsinki committee guidelines for their
project, the next step is to build a de-identification profile for the data,
according to said guidelines. In this project we received a profile set for de-identification
of data from pregnant women (including such as number of pregnancies and
pregnancies carried to term) and their newborns, along with several other data
sources, such as medications taken by the mother and lab results.
Our de-identification
repository now contains the created profile for use in future iterations of the
research project, using different target populations. This allows us to work
hard once and then reuse the products of our labor. We then built FHIR profiles
that represent all the data required for the project and used them to populate a
research FHIR sever. The server can be repopulated as often as needed with new
data according to researcher requirements.
De-identification
process:
As seen in the
diagram above, the researcher must be authenticated in order to make sure that
only people who have received the proper clearance can start the de-identification
process, as they will also be the recipients of the research results.
Once logged in,
the user defines the research population. The research query is then
transferred to the authentication unit (step 1) for approval by the data security
manager (step 2). Next, the researcher initiates the de-identification service (step
3) which queries the FHIR server and retrieves the data pertinent to the
specific research question (step 4). The service then retrieves the de-identification
profile and applies it to the data (step 5). The products are then stored in a
secured, dedicated data storage unit (step 6) which will be accessed by the
federated learning machine (“FLM” – step 7).
This entire
process is run in each of the organizations taking part in the research
project. FLMs in each organization creates a “model-result”, which is then exchanged
between the organizations. Thus, the Federated machine learning technique
trains an algorithm across multiple organizations by ” model-result”
exchange between partner organizations participating in the project.
Technical
Workflow:
Let’s take a deep
dive into the diagram above:
1. Data Source: Organizational data sources of identified
data. These are fed into the ESB (Enterprise Service Bus system, a type
of integration server). Each Organization is responsible for native to FHIR
transaction transformations according to the FHIR data model specified for
particular research. The data is fed into the –
2. FHIR Server: Stores all the resources required for the
project. The references between these resources compile the data model.
3. FADE (FHIR Advanced
Data Extract): Proprietary
data preparation and de-identification tool. The following describes FADE’s
workflow:
“Client”
component, the
application used by the researcher, is used for the entire data
preparation and managing the de-identification process. The component is
accessible by secure communication over SSL VPN and requires authorization
through the Loopback access control lists (ACL) component. The ACL component controls
user credentials (roles) linked to specific research(s). It’s aimed at controlling
the initiation of data preparation and de-identification for authorized
researchers only. The component uses methods exposed by the REST services of
the data preparation component and provides the researcher with the following
capabilities:
- Define research population – the system allows for construction of complex queries with
include/exclude rules. This allows maximum flexibility for building
complex research group criteria that cannot be expressed in a single FHIR
search query. The example
of complex query contains include and exclude rules. “All patients between
20 and 30 years old, gender – female, including patients having given
birth in last 5 years, including pregnancies having CMV IgG Ab findings,
excluding patients having gestational diabetes mellitus”
- Process initiation and monitoring
- Refresh population. This method
collects the patient population as currently stored on the FHIR
server. This explicit list of patients will be used when the population
is exported, allowing subsequent runs to re-use the exact same population.
- Get the number of patients used for the current run.
Data
validation and approval component
allows an organizational data security professional to review and approve the project’s
population and the de-identification profile chosen for the project, as well as
to assess the de-identified data post export.
Data preparation process retrieves a research population from the FHIR server,
as defined by the researcher using RESTful patient-level Queries, for further
de-identification. The entire data retrieval process is designed to allow target
population data export from the FHIR server according to the complex query defined
by the researcher. The Bulk Export process, natively supported by the FHIR
server, suffers from a lack of support for complex query execution. In order to
bypass this limitation, we developed and deployed our own in-house solution. The
complex query is broken down into simple, atomic queries. E.g. – loosely based
on what we’ve already discussed: first query – “All female patients between 20
and 30 years old”; second – “gave birth in last five years”. Each query
populates a “Group” resource. We then merge the three sub-populations into a
single, coherent, population Group.
De-identification
profile repository keeps
the de-identification profile (mask) used for data de-identification approved
by the organizational “Helsinki data committee.” The profile
repository can store multiple de-identification profiles per research project.
De-identification
process aimed to apply
a de-identification mask on the identified data previously retrieved from
the FHIR server for further approval by the organization.
Research
Storage stores de-identified
ndJSON files used as a source of data for the federated learning server.
Federated
Learning. The Federated
machine learning technique trains an algorithm across multiple organizations
holding data samples retrieved from research storage. As a result, the “Model
Results” exchange between partner organizations participating in the
research is needed in the scope of particular research. Data exchange between different organizations
is performed by a secure Proxy server installed in the DMZ of the
organizational network.
4. Proxy server used for a secure “Model Results”
data exchange between Federated Learning nodes located at partner
organizations. The data exchange process is managed through secure REST API
calls.