Lumen: Leveraging Unstructured Data to Enhance Crime Analysis

I.  Introduction to the Problem

Police records contain an enormous amount of unstructured data. For example, one suburban Denver agency has generated an average of 14,000 incident records, with 20,000 written narratives, each year over the last ten years.  In many cases (often not by design) critical information such as M.O. details, property descriptions, gang names, and even names of people and organizations appear only in the narrative.

And yet, crime analysis to date focuses almost exclusively on information that is not found in a narrative but rather in structured data.  For example, hotspot analysis based on crime types, risk terrain modeling, DDACTS, and all predictive policing methodologies rely exclusively on data ultimately derived from structured fields in a database. While these types of analysis are known to be quite useful and powerful, they are fundamentally limited when dealing with unstructured, contextual information, such as that found in report narratives.

Consider the following examples:

  • An analyst has been asked to study the impact of increased enforcement efforts on a particular problematic bar in its jurisdiction. The bar can be associated with incidents in any number of ways, including type of incident (e.g., assaults), DUI and public intoxication arrests made of people who had been drinking at the bar, and others. Most of this information is not represented in any particular RMS field, but is stated by an officer in the narrative.
  • An analyst has been asked to determine any trends and spatial patterns associated with copper theft or backflow devices. This property is often only described in the narratives.
  • A police chief is interested in assessing the spread of heroin in the community. Heroin is present not only through actual possession or intent to distribute offenses but also through officers finding signs of heroin use during searches, and during field interviews in which interviewees, witnesses, and suspects discuss their own or others’ heroin addiction. This information is contained in fields as well as in narratives.

In all of these cases, critical information resides in multiple places, and the specific locations where the information resides will vary from one case to the next even in the same records management system (RMS) in one agency.

Traditional tools used in crime analysis, whether Microsoft Access, Crystal Reports, and even many crime-analysis specialized tools currently in use, all approach this problem in essentially the same way: through structured query of one or more fields, using query languages such as SQL.  The difficulty with this approach is that it cannot easily find data without knowing where to look first.  Essentially, the user must enumerate all possible fields where the data of interest might be stored.  This is a labor-intensive and error-prone task.  In many cases, such analyses are not even attempted due to the level of effort required.  This lack of system integration suggests the need for a tool to look at the narrative portions of data where qualitative information is provided in order to get a deeper understanding of the problem or analysis being addressed.

II.  Enterprise Search

In recent years, enterprise search has become a tool of great value to many institutions.  In the enterprise search model, data from many internal sources is indexed and made searchable, in much the same way that Google or Bing indexes web pages and provides a search interface.

Enterprise search can provide very high performance queries and return search results that would be difficult or impossible to obtain using structured SQL queries. Furthermore, through the use of stemming and other techniques, queries of written text can identify relevant records even if the text does not exactly match what the user is searching for.  For example, a user might search for “copper theft,” and a properly configured enterprise search engine can return documents containing “copper theft” as well “copper thefts” automatically.  Typically, an enterprise search engine is also capable of returning results ordered by relevance.  This can help when combing through large numbers of records.

The downside of generic enterprise search for crime analysis is that the focus is primarily on search, and there is little in the way of analytical capabilities provided.  The users of such systems are left to provide their own analysis.  Furthermore, there is a great deal of valuable information in the structured data that should not be neglected, but a generic enterprise search system is incapable of using this structure.

III.  Lumen: Hybrid Enterprise Search and Structured Analytics

In order to address these issues, Numerica Corporation developed Lumen, tailored specifically for law enforcement and crime analysis users.  Lumen offers the following capabilities related to unstructured and structured search and analytics:

  • Full text enterprise search of records from SQL databases as well as documents and files.   RMS, CAD, LPR, Intel system, attachments, and even file archives can all be indexed and made searchable with a Google/Bing-like search interface.
  • Structured search and analytics of structured data such as SQL databases. Unstructured search can be combined with structured search, to provide more accurate and tailored results. Furthermore, analytics based on times, locations, and field values from structured data such as RMS incident records can be combined with unstructured full text searches, providing for a more robust searching capability.

Consider the examples described in Section I.   For each example, we show below how Lumen can be used to generate the analysis required.

Example 1: Problem Bar

In this case, the analyst is searching for incidents in which the bar is involved. The name of the bar is unique enough that a search for the name of the bar finds incidents associated with it and not spurious associations that happen to match the name.  The name of the bar appears in both narratives and location comments, so a search of a single field would be insufficient.  Figure 1 shows the results that were obtained (results have been anonymized).  Most of the search results appear near the bar, but a significant number are in other locations.  Many of these are DUI and other incidents in which one or more people involved had been drinking at the bar previously.

Figure 1. Full text search for a bar name in incidents since 2010.

Figure 1. Full text search for a bar name in incidents since 2010.

Figure 2 shows the trend over time.  The number of incidents matching the search is plotted on a yearly basis.  (An important point to note is that, even though a single incident may contain multiple matches against the search term, it is counted only once in this chart.)  The dramatic spike in incidents from 2010-2012 led to increased efforts to deal with the problem; the steep drop off in later years is likely a result of that effort.

Figure 2. Trend since 2005 showing increased prevalence of incidents related to the bar of interest.

Figure 2. Trend since 2005 showing increased prevalence of incidents related to the bar of interest.

Figure 3 shows the nature of the charges in each incident.  In this analytic, there can be more than one charge per incident, and the total number of charges of each type is displayed.  Many of the incidents are directly related to alcohol, as expected.

Figure 3. Charges in incidents related to the bar over the last six years.

Figure 3. Charges in incidents related to the bar over the last six years.

Example 2: Copper Theft and Backflow Devices

In this case, the spatial and temporal trends of two different types of property theft are analyzed: theft or burglary of copper pipes and wires, and theft or burglary of backflow devices. In both cases, the words “copper,” “wire,” “pipe,” and “backflow” can show up in narratives but also in property descriptions. As seen in Figure 4, the number of incidents in recent years for these two searches appear highly correlated.  (In fact, both are fairly well correlated with metal commodity spot prices).

Figure 4. Comparison of copper theft vs. backflow device theft over a 10-year period.

Figure 4. Comparison of copper theft vs. backflow device theft over a 10-year period.

Figures 5 and 6 make clear, however, that while there is some overlap in incident locations, there are a number of locations that are different between the two series.

Figure 5. Choropleth map showing locations of copper theft.

Figure 5. Choropleth map showing locations of copper theft.

Figure 6. Choropleth map showing backflow device theft locations.

Figure 6. Choropleth map showing backflow device theft locations.

Example 3: Heroin

Heroin use is increasing in many jurisdictions across the United States. Heroin’s appearance in an incident can come in many forms, ranging from actual heroin possession, to possessing drug paraphernalia, or even just a suspect admitting to an officer during an interview that he or she is a heroin addict.  A full text search can find all of these.  Figure 7 shows a breakdown by year and charge description for all incidents involving heroin over the last six years.  “Assist other agency” is high on the list because this particular agency has a narcotics dog that is used by other agencies. Many charge types show a significant increase in the number of incidents.

Figure 7. Analysis of incidents involving heroin over time.

Figure 7. Analysis of incidents involving heroin over time.

 IV.  Takeaways

Records management systems in policing house tremendous amounts of data, yet crime mapping and analysis has depended upon traditional data provided in coded data fields in these databases. However, police narratives also contain other rich data about incidents not provided in coded fields.  As such, this information has traditionally been an untapped resource for analyzing crime patterns, etc. Narratives and other unstructured data sources in police reports have significant utility in modern data-driven policing, despite traditionally having little utility due to the sheer volume of textual information and the lack of available tools for examining it. Today, however, through hybrid enterprise search/structured search and analytics tools such as Lumen, analysts can create products that exploit the value of unstructured data in combination with the rich structured data they already used every day. Ultimately, through the use of both data sources and analytic tools, crime analysts will be better equipped to identify critical aspects of incidents and facilitate greater problem solving.

Nick Coult

As vice president of the Interactive Intelligence Systems business unit at Numerica Corporation, Nick oversees the teams creating innovative intelligence solutions for customers in law enforcement, defense, and private industry. Nick previously served as a program director for integrated air and missile defense at Numerica. Prior to joining the company in 2008, Nick spent ten years working with leading mathematicians, scientists, and engineers on innovative, unique solutions and products for problems in seismic exploration, space physics, data compression, and image processing. He has led the development and launch of five separate product offerings and is the author or co-author of eight published papers and one patent. Education: M.B.A. – Massachusetts Institute of Technology; Ph.D. – Applied Mathematics, University of Colorado at Boulder; M.S. – Applied Mathematics, University of Colorado at Boulder; B.A. – Mathematics, Carleton College

You must be logged in to post a comment