Program


Friday 6th November Friday 13th November Friday 20th November Friday 27th November
Friday 4th December

Back to program

Making life easier: Demo session 1

Friday 20th November, 10:00 - 11:30

Accelerating the application of AI: Automating data fusion and knowledge curation

Mr Sam Adams (RTI International) - Presenting Author
Mr James Rineer (RTI International)

The single greatest barrier to the successful application of AI in Data Science is not the availability of hardware, software or algorithms. It’s not even the lack and expense of skilled practitioners. It is the accepted fact that the end-to-end process of applying AI and Data Science to multiple sources of data is highly inefficient. Approximately 70-90% of the time to value is spent acquiring, cleaning, integrating, and otherwise wrangling the required data into submission. For all the industry excitement about Machine Learning techniques, the largest area for improvement is in dramatically reducing the time before the Data Scientists can apply their magic.

RTI has developed a cloud-based, automated workflow system for both data wrangling and data science. By automating the data ingestion and integration process, we can efficiently generate and reuse knowledge graph layers that span space, time, entities like people and organizations, and many other research domains. This creates a “plug and play” environment for complex knowledge capture and curation, which then drives downstream Machine Learning and Analytics processes. The entire workflow from raw data acquisition to AI insight can be replicated, customized and deployed on different computing platforms based on contractual requirements of the research project.

Description:
RTI’s session will provide an overview of our approach and demonstrate several core technologies required for automated knowledge curation and representation. We will also cover the integration with downstream machine learning and simulation systems. Examples will include a national-scale knowledge graph we created to accelerate research into the Opioid Crisis.


Knowledge graph representation for experience sampling data: A proof of concept

Mr Ammar Ammar (Maastricht University) - Presenting Author
Mr Remzi Celebi (Maastricht University)
Miss Lianne Ippel (Maastricht University)
Mr Philippe Delespaul (Maastricht University)

How data are represented is crucial for researchers to facilitate analyses and explore a dataset. Recently, knowledge graph representation has gained popularity among researchers and large companies to model their data for different applications (e.g., web search, question answering). In a knowledge graph, a researcher creates a network of linked concepts, entities, and instances, from which data-driven rules can be derived. Therefore, knowledge graphs are powerful tools to explore and understand data via an underlying network or ‘graph structure’. A semantic model is used to interpret data in a knowledge graph where each node and relation has a type. While knowledge graphs are often encountered in health sciences, their application within the social sciences is rare. This presentation demonstrates how to model longitudinal survey data using the Resource Description Framework (RDF). RDF is a standard knowledge graph representation on the web, that researchers can use to interpret and inspect data. The RDF uses the following data structure: Subject - Predicate - Object, e.g., Patient - feels - unhappy. For this study, we use data collected via the Experience Sampling Method smartphone application, called Psymate™. This app notifies the user 10 times a day to fill out a short questionnaire about the current mood and whereabouts. The notifications, or beeps, are unevenly spread throughout the day, though on average 90 minutes apart.
To represent the ESM data, we created an RDF data model consisting of Subject, SurveyBeep, Answer, Date classes. Subject represents an individual that takes part in a survey. SurveyBeep represents a survey entity completed by an individual for a specific moment in time (called a beep). Individuals' answer (Answer) corresponds to the answer to each question in a survey. Each question in a survey is defined as a relation between SurveyBeep and Answer and is a sub-type of ‘question’ relation. Date class indicates the date when a survey is completed. Also, the nextDate relation is used to capture the longitudinal relationship between two surveys (SurveyBeep) completed on consecutive dates.
After modeling the data as RDF, we used GraphDB (http://graphdb.ontotext.com/) tool to store, query, and visualize this data. In addition to visualizing a small part of the knowledge graph, we can use GraphDB to analyze the data with SPARQL language, which allows us to query the RDF data more easily compared to relational databases. For example, we can use a SPARQL query to find which subjects' moods have changed dramatically in a few consecutive days. This knowledge representation allows us to formulate the behavior change of an individual over time in form a RDF triples as path and subgraph in the knowledge graph. In the future, we plan to use graph mining algorithms to extract meaningful patterns from this knowledge graph. We will focus on mining the frequent paths and subgraphs which could help us to understand the subjects’ behavior change over time.


From files to views: Expanding the analytical capabilities of the NIBRS dataset

Mr Ian Thomas (RTI International) - Presenting Author
Dr Marcus Berzofsky (RTI International)
Dr Dan Liao (RTI International)
Dr Alexia Cooper (Bureau of Justice Statistics)
Mr Jason Nance (RTI International)
Mr Alex Giarrocco (RTI International)
Mr Alex Harding (RTI International)
Mr Chew Rob (RTI International)
Dr Kelle Barrick (RTI International)

The National Incident-Based Reporting System (NIBRS) is an incident-based reporting system used by law enforcement agencies in the United States for collecting and reporting data on crimes. Every year the Federal Bureau of Investigation (FBI) releases a NIBRS dataset which represents over 6 million crime incidents per year. Traditionally this has been provided in a “flat file” format which is computationally intensive to analyze and difficult to use. Recently the FBI has been delivering this data in a new, relational database, format that makes it easier to access and transform using modern data engineering practices. Taking advantage of this new delivery format, we have created an analysis database that is much faster to use and an interactive tool, called NIBRS Explorer, that make it much easier to use this data for research.

The NIBRS incident structure, that is the blueprint for the reports filled out by police in, NIBRS certified, Law Enforcement Agencies, defines multiple segments, and allows for multiple instances of particular segments for each incident. Each of these segments contains multiple fields adding up to a significant amount of data per incident record. Representing each incident as a long list of values (flat file) requires a great deal of computing power to parse and combine fields in order to analyze millions of records. Representing each incident in the FBI’s new normalized relational database is more efficient, but still requires significant computing power to select and join the fields needed for analysis.

Our analysis database allows for faster analyses by creating a set of tables that structure the data in a format optimized for computation. By using this structure, researchers can get results instantly or with-in minutes, instead of having to wait hours or days for a result.

In addition to long run times, a steep learning curve, caused by the sheer number of fields contained in each record, creates another barrier to working with both the old and new delivery formats. This is further complicated by the fact that the indicators of interest are sometimes represented by a combination of fields or values. The combinations of these indicators are defined throughout multiple documentation sources. For example, a “Part 1 Violent Crime” consists of offenses named: ‘Murder and Nonnegligent Manslaughter, ‘Rape’, ‘Robbery', or 'Aggravated Assault’. Other offenses should not be included when counting “Part 1 Violent Crime”. Calculating the “Part 1 Violent Crime” indicator involves not only learning the definition, but also parsing all the offenses to count only the ones that meet the definition.

To reduce learning curve and encourage the use of standard definitions we created an indicator recoding layer in our analysis database where key concepts are defined and made available for analysts to query. NIBRS Explorer, allows users to browse the table structure, build fast analyses and review the definitions of key indicators.

This live demonstration will run analyses that illustrate the speed improvements gained and exhibit the interactive tool to used explore the data, learn definitions, and generate analysis queries.


A SAS Macro for Calibration of Survey Weights

Dr Tony An (SAS Institute Inc.) - Presenting Author

In survey sampling, calibration is commonly used for adjusting weights to ensure that estimates for covariates in the sample match known auxiliary information, such as marginal totals from census data. Calibration can also be used to adjust for unit nonresponse. This presentation discusses a macro for calibration that was developed using SAS/STAT 15.1 and SAS/IML software. The macro enables you to input the design information, the controls for the auxiliary variables, and your preferred calibration method, including either linear or exponential method. Because unbounded calibration method can result in extreme calibration weights, this macro also supports bounded versions of both linear and exponential calibration methods. The macro creates calibration replication weights according to the sample design and the specified calibration method. Examples are given to illustrate how to use the macros.