Data Recovery

Beyond Keywords: Is Keyword Search Becoming Obsolete In The New Age Of Forensic Digital Investigation?

by James Billingsley

Keyword searching is the primary tool investigators use to identify relevant evidence in a data set. However, poorly chosen keywords can miss important items or return too many irrelevant results. As data volumes grow, investigators must find better ways to focus on the items of interest within very large data sets. Expert forensic technician and investigator James Billingsley explains how visualising communication networks, timelines, maps and links between data sources can rapidly establish key players, their locations and their involvement in a matter of interest – all supported by forensic artefacts required for provenance.

Searching for the answers

Before the advent of computing, investigators who sought evidentiary documents that were relevant to their case faced the painstaking task of sifting through all the available pieces of paper and handwritten notes until only the significant ones remained.

The global adoption of computers and digitisation introduced a time-saving tool like no other: keyword searching. Investigators didn’t even have to read the documents; they would simply compile a list of keywords relevant to the investigation and use computer searches to find any instances of these words on the electronic media.

Making searches more effective

Keywords can be a powerful examination technique, especially when search queries are well crafted. Over many years, investigators and technology vendors developed ways to make keyword searching more effective:

  • Forensic carving. We developed sophisticated ways to extract searchable text from areas of unused data, deleted data and volatile (temporarily stored) data. This would help uncover evidence that was created by automated processes, or deleted or modified in attempts to hide it.
  • Decoding and text extraction. Not all data can easily be searched. Often it must first be decoded and presented in a text-searchable form. Decryption, decoding, optical character recognition (OCR) and many other techniques ensure the largest number of evidence sources can be searched for keywords.
  • Indexing. In an effort to make keyword searching faster, repeatable and more accurate, most software tools moved to indexing the data on electronic devices. This process entailed calculating the location of each keyword within the data before carrying out any searches. Indexing sped up the overall search process and made it possible to construct complex search criteria and receive near-instantaneous results.

Common keyword problems

Effective keyword searching also requires experience. It can be largely ineffective if not applied appropriately. Specifically, keyword search technique has always presented the same two problems:

  • Too few results. If investigators fail to predict the exact keywords that will lead to the relevant data, they will likely miss important evidence. At worst this could lead them to an inaccurate investigation conclusion. Even a simple typing error could have a substantial effect on results.
  • Too many results. Large lists of untargeted keywords often return huge volumes of irrelevant data. Reviewing these false positives wastes investigation time and money. Compiling large lists of keywords, in an attempt to capture all relevant data, further exacerbates the problem of returning results irrelevant to the case.

The following are some examples of bad keywords.

  • “dave” – Using a suspect’s name as a keyword may seem a sensible approach. However, if the suspect has named their user account on the computer after themselves, searching for “dave” will return masses of irrelevant data as the term forms a key part of the file system directory structure.
  • “GE” – Using short keywords or acronyms – including company names, slang words or a person’s initials – will often return huge volumes of irrelevant data. Such a short term will occur with great frequency within the data. For example, the letters “ge” occur more than 30 times just in this article.
  • “window” – A term such as this will return large volumes of irrelevant data. It forms a key part of the directory structure and documentation of Microsoft’s Windows operating system, for example.

Seeking a better way

Even with the most finely crafted searches, the number of results can be hard for a human being to take in. As we continue to aggressively digitise every facet of our lives, the explosion of data demands new, better ways to present the results to investigators so they can more easily digest and comprehend the data. This is where data visualisation comes into play.

The mind’s eye

“The purpose of visualisation is insight, not pictures.”[1]

Visualisations form the single easiest way for the human brain to receive and interpret large quantities of data. In investigations, visualising data achieves three primary goals:

  • Intelligence gathering: discovering and establishing facts significant to the investigation.
  • Interpretation: understanding and giving meaning to the intelligence.
  • Communication: effectively reporting findings to a wider audience.

While many fields of knowledge have enthusiastically embraced data visualisation, its adoption has been slow in digital forensics and investigation. Those who have made the transition have seen their investigation workflows transformed by the way they can quickly present and interpret vast and growing volumes of keywords, statistics and raw data.

For example, Figure 1 shows a large quantity of communication records and messages, extracted from mobile phones and a desktop computer, presented to the investigator in a tabular text fashion.

Figure 1: A list of communication records and associated metadata in tabular form.

Figure 1: A list of communication records and associated metadata in tabular form.

Discerning any patterns in these communications would require painstaking work and attention. This is because our brains interpret this data using verbal processing. However, by visualising this data, we can quickly extract value from it and communicate it in our investigation findings.

In Figure 2 we see the same communication data displayed in a network visualisation. Immediately we can comment on the primary communicators, the people they speak to, and the frequency of contact.

Figure 2: The same data as Figure 1 presented as a network of communications between devices.

Figure 2: The same data as Figure 1 presented as a network of communications between devices.

In Figure 3 we see the same communication data displayed as a timeline. At a glance we can comment on how frequently each mobile device was used, how recently it was used and if there were any gaps in usage.

Figure 3: The same data as Figure 1 presented in a timeline.

Figure 3: The same data as Figure 1 presented in a timeline.

A number of forensic artefacts, such as the EXIF data in digital photographs, contain geolocation coordinates. Other data sources may contain IP addresses which can be resolved to geographical coordinates. These coordinates can be extracted and plotted onto a map.

Figure 4 shows a map of the locations extracted from Skype data on the desktop computer, showing where the user logged onto Skype using specific devices and the locations of the other people contacted. This provides an example of how we might gather information to understand the movements of a suspect. With this information, we can better understand the context of the suspect’s communications.

Figure 4: Geographical coordinates extracted from IP addresses plotted onto a map.

Figure 4: Geographical coordinates extracted from IP addresses plotted onto a map.

In Figure 5, we have extracted and visualised references to company names, countries, and sums of money. This very quickly shows the investigator that there are references to significant sums of money, company names, and countries in the Skype communications extracted from the desktop computer.

Figure 5: Visualising connections between company names, countries, and sums of money.

Figure 5: Visualising connections between company names, countries, and sums of money.

In Figure 6 the investigator has focused solely on these interesting communication files from the desktop computer and exposed names of any people connected with those files. This provides intelligence around new identities who may be involved in the case.

We can also clearly see the message exchange highlighted in the centre contains references to all the other items of interest. We have targeted our search from a massive and incomprehensible list down to a single communication very quickly and efficiently.

Figure 6: Drilling down to the Skype communications extracted from a desktop PC and revealing the names of people mentioned in them.

Figure 6: Drilling down to the Skype communications extracted from a desktop PC and revealing the names of people mentioned in them.

In Figure 7, the investigator has focused on a file of interest and exposed linked forensic artefacts that speak to the activity history of that file.

Figure 7: Showing links between a file of interest and forensic artefacts such as link files and Windows Registry keys.

Figure 7: Showing links between a file of interest and forensic artefacts such as link files and Windows Registry keys.

You will note that throughout this approach we did not employ a single keyword search. This avoided the need for guesswork around which keywords may or may not be relevant to our case. We have also avoided a long-winded linear text review of masses of responsive data. Visualisations have evolved our investigation technique beyond sifting through masses of information. Instead, they allow the most relevant information to bubble to the surface as we dynamically alter our visual point of reference.

Moving beyond traditional keyword searching

“The greatest value of a picture is when it forces us to notice what we never expected to see.”[2]

Traditional keyword searching has underpinned digital forensics since its inception. Given manageable data quantities and sufficient available time, keyword search analysis provides a great opportunity to identify digital evidence essential to an investigation.

However, over the past 10 years typical case sizes have grown from a few gigabytes to multiple terabytes. This means investigators have no choice but to embrace new technologies and techniques if they are to deliver results efficiently today and through the next 10 years of technological advance.

Every information management discipline is actively grappling with how best to manage masses of varied electronic data. Digital forensics is under the same pressure. Practitioners must look beyond their traditional techniques in order to best deal with the continued data growth, increasing case backlogs and growing financial and organisational pressures.

Data visualisation is the next logical step forward for digital forensics. As technology changes at an incredible pace, digital forensics must keep pace with the world around it and be open to explore these new techniques, methods and tools, while adhering to its strong scientific traditions.

[1] Ben Shneiderman, “Research Agenda: Visual Overviews for Exploratory Search”, National Science Foundation workshop on Information Seeking Support Systems, June 26-27, 2008
[2] John Tukey, Exploratory Data Analysis, Addison-Wesley, 1977

About the author

James Billingsley
Principal Solutions Consultant, Cybersecurity & Investigations, Nuix

James has more than a decade of experience in computer forensics. Before joining Nuix, he worked in 7Safe’s Security Investigation & Assessment team as a senior breach investigation consultant and a senior eDiscovery consultant. He previously worked as a senior computer forensics investigator at CCL-Forensics. James has contributed to web browser forensics software tools which law enforcement agencies and international corporations around the world use.

Nuix is running a global survey of digital forensic investigators to find out what investigative methodologies you rely on – if any – beyond keyword searching. Your experience is important to us and all participants will enter a prize draw for a £300/€400 gift card. You will also receive a copy of the report once completed. Complete our short survey here.

About scar

Scar de Courcier is an assistant editor at Forensic Focus.

Discussion

2 thoughts on “Beyond Keywords: Is Keyword Search Becoming Obsolete In The New Age Of Forensic Digital Investigation?

  1. Very impressive detailing of the use of visualization to enhance and refine keyword search strategies, thereby streamlining and improving the process.

    Posted by Richard Stewart | March 18, 2016, 7:28 pm
  2. Nice representation. Can i get more details or scenarios where visualization is needed in digital investigation?

    Posted by Humaira | October 20, 2016, 6:40 am

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 995 other followers

%d bloggers like this: