Metadata: The Hidden Treasure

Sudhanshu Chauhan
June 12, 2012 by
Sudhanshu Chauhan

In today's Information age, Data is very crucial for every organization. From Information security point of view also data is what everybody is behind, be it the Hacker or the Pentester. Data loss for any organization can have a very negative impact financially as well as reputation wise. Generally organizations are aware of the information they are revealing through different online mediums, but what about the data that is being exposed without the knowledge of the organization and which could be crucial from security perspective. In this article we are going to learn about the information hidden in the documents, files present in the public domain which could be sensitive from security perspective and also how to deal with it.

First here are some basic terminologies that need to be understood before going any further.

Earn two pentesting certifications at once!

Earn two pentesting certifications at once!

Enroll in one boot camp to earn both your Certified Ethical Hacker (CEH) and CompTIA PenTest+ certifications — backed with an Exam Pass Guarantee.

Earn two pentesting certifications at once!

Earn two pentesting certifications at once!

Enroll in one boot camp to earn both your Certified Ethical Hacker (CEH) and CompTIA PenTest+ certifications — backed with an Exam Pass Guarantee.

Metadata: Simple data can be described as raw values which need to be processed for the purpose of generating information and deriving knowledge. Meta data is commonly described as 'data about data'; however this definition is not complete and does not covers all properties of metadata. A better definition as described by Wikipedia (http://en.wikipedia.org/wiki/Metadata) is as following.

Metadata (metacontent) is defined as data providing information about one or more aspects of the data, such as:

  • Means of creation of the data

  • Purpose of the data

  • Time and date of creation

  • Creator or author of data

  • Location on a computer network where the data was created

  • Standards used

Metadata has been utilized for various purposes from cataloging archives, data virtualization to SEO (Search Engine Optimization) for web sites. All this metadata is put up intentionally by the owner for the purpose of better and easy management of information; whereas in this article we are going to talk about the metadata that user puts-up without being aware of (most of the time).

Information gathering: This is the very first and very essential phase of any security assessment project. The focus is on collecting as much information as possible related to the target. Success of any Pentest highly relies on the information gathering phase as it the information collected during this phase that is leveraged in later stages for the purpose of intrusion. The task of gathering the information can be done utilizing various methods such as OSINT (Open Source Intelligence) tools eg. Search Engines, scanners, fingerprinting tools (active and passive) etc.

OSINT (Open Source Intelligence): Open Source Intelligence implicates finding, selecting and procuring information from the sources which are publicly available. This information can be exploited to harvest acumen based on which critical decisions can be taken. Open source intelligence can be collected from variety of sources such as Newspapers; Web based content; Public documents etc. From Cyber security point of view it is mostly the web based content that is the main source of open source intelligence. The advantage of open source intelligence is that it is present in the public domain and hence it is easy to access. It is a very crucial part of the information gathering phase of security testing.

First tool of the trade in the list is Metagoofil.

Metagoofil: Metagoofil is a linux based tool developed in python which extracts metadata from public documents which are available on the target website(s). Metagoofil supports different document types like df, doc, xls, ppt, odp, ods, docx, xlsx, pptx. The tool utilizes different python libraries like GoogleSearch, Hachoir, PdfMiner etc. for the purpose of locating the files and extracting metadata. The output of the tool is displayed as a report in HTML format, which can be easily viewed on a browser. BackTrack 5 comes preinstalled with the application. The latest version of the tool is 2.1 and can be downloaded from http://code.google.com/p/metagoofil/downloads/detail?name=metagoofil-2.1_BH2011_Arsenal.tar.gz&can=2&q=

Steps of operation:

  • Perform Advanced Google Search (Google Dorks) to find the documents on the target website.

  • Download the number of files as specified from the search output to local disk.

  • Extract Metadata using different libraries.

  • Save and display the output (extracted metadata) in HTML format on the web browser.

The result contains User names, Software versions, e-mails, servers and paths found during the operation. The latest version also extracts MAC (Media Access Control) address from Microsoft office documents. Now with this open source information one can prepare a better Pentest plan. The extracted information can be exploited to perform a brute force attack on various services or to execute a social engineering attack. Figure 1 displays the metagoofil interface along with the options present.

Figure 1. Metagoofil Interface

Metagoofil options as listed in the application:

        -d: domain to search

        -t: filetype to download (pdf,doc,xls,ppt,odp,ods,docx,xlsx,pptx)

        -l: limit of results to search (default 200)

        -h: work with documents in directory (use "yes" for local analysis)

        -n: limit of files to download

        -o: working directory

        -f: output file

Figure 2 shows the tool in action. The tool searches for the files using search engine libraries and then downloads the specified number of files for metadata extraction.

Figure 2. Metagoofil in action

Figure 3, 4 and 5 display the result of the tool. The result consists of the list of User names, Software versions, emails, servers, paths and files analyzed. The result HTML file also shows the output in the form of the bar graph.

Figure 3. Metagoofil Result 1

Figure 4. Metagoofil Result 2

Figure 5. Metagoofil Result 3

Second tool in the list is Exif Tool

Exif Tool: Exit tool is a software application which can read, write and edit metadata in an extensive variety of files. It is a platform-independent Perl library and also available as a command-line application. The Tool supports many different metadata formats which include EXIFGPSIPTCXMPJFIFGeoTIFFPhotoshop IRBFlashPixID3 etc. as well as the manufacturer specific notes of many digital cameras. The list of supported file types is very extensive and can be found on http://www.sno.phy.queensu.ca/~phil/exiftool/. The download link of the tool is http://www.sno.phy.queensu.ca/~phil/exiftool/exiftool-8.92.zip. An online version of the tool is also available at http://regex.info/exif.cgi. Figure 6 shows the exif tool interface. The output of the Exif tool is demonstrated in figure 7. It demonstrates the metadata extracted from an image file.

Figure 6. Exif tool interface

Figure 7. Exif tool displaying the extracted metadata

Next in the list is FOCA

FOCA: FOCA means seal in Spanish language. FOCA or Fingerprinting Organizations with Collected Archives is a tool to discover files on target website and extract metadata from it. FOCA is a Windows based tool for the metadata extraction. Unlike previous tools it provides GUI for easy usage. It is similar in operation to the metagoofil tool previously discussed. It uses search engine for the purpose of discovering files and extracts metadata which can be utilized for the forthcoming steps of pentesting. There also exists an online version of the application, which can be found at http://www.informatica64.com/foca/. The new features in the latest version (i.e. 3.0.2) as described on the official website are as follows (http://www.informatica64.com/foca.aspx):

  • New Interface

  • Panel multithreaded tasks.

  • Search proxy service.

  • Search for services registered in the DNS

  • Search and analysis of ICA and RDP files.

  • Search for proxies

  • Leakage analysis (based on output errors).

  • Search for domains with anti-spam policy.

  • Search DS_Store files in each folder.

  • URL Search in files "robots.txt"

  • Autosave project, among others.

Figure 8 shows the home screen of the application. User need to input project name and the domain that need to be parsed for the discovery and extraction of metadata.

Figure 8. FOCA Home Screen

Figure 9. FOCA interface displaying the list of files on the defined domain

Figure 9 shows the list of the files found on the domain specified by the user. FOCA utilizes different search engines for the purpose of discovering the list of files. After discovering the list, the user needs to download the file(s) so that the metadata can be extracted from them. Figure 10 displays the extracted metadata from the files downloaded.

Figure 10. FOCA displaying the extracted metadata

The last tool in the list is Exif2maps.sh

Exif2maps.sh: It is a script which can pull off the GPS location data from images. iPhone stores GPS Exif data with the images. The tool simple extracts the google maps link containing the coordinates. The script can be found at http://www.securityaegis.com/stealing-gps-data-from-images-in-pentests/. Figure 11 shows the output of tool as a link to the google maps. Figure 12 displays the output location on the google maps.

Figure 11. Result of Exif2maps.sh script

Figure 12. Location displayed on google maps


As we have seen that how much critical information is revealed through the documents and files uploaded without us realizing it. The solution to this problem is DLP or Data Loss Prevention tools. Some of these tools are as following:

MetaShield Protector: MetaShield Protector is a solution which helps to prevent data loss through office documents published on the web site. It is installed and integrated at Web Server level of the web site. On a request for any document, it cleans it on the fly and then delivers it. MetaShield Protector can be found at http://www.metashieldprotector.com/.

MAT: MAT or Metadata Anonymisation Toolkit presents a solution for the purpose of metadata removal. It is developed in Python and utilizes Hachoir library for the purpose. Formats supported by the toolkit as listed on the official website (https://mat.boum.org/):

  • Portable Network Graphics (.png)
  • JPEG (.jpg, .jpeg, ...)
  • Open Documents (.odt, .odx, .ods, ...)
  • Office OpenXml (.docx, .pptx, .xlsx, ...)
  • Portable Document Fileformat (.pdf)
  • Tape ARchives (.tar, .tar.bz2, .tar.gz, ...)
  • Zip (.zip)
  • MPEG AUdio (.mp3, .mp2, .mp1, ...)
  • Ogg Vorbis (.ogg, ...)
  • Free Lossless Audio Codec (.flac)
  • Torrent (.torrent)

MyDLP: A free data leakage prevention solution with multi-site configuration. It provides a comprehensive open source DLP solution. MyDLP is available under GPL license. The community and the enterprise version of the solution are hosted at http://www.mydlp.com/products.

OpenDLP: A complete DLP suite with centralized web frontend for the purpose of management. OpenDLP is hosted at https://code.google.com/p/opendlp/.

Doc Scrubber: A freeware to scrub off hidden data from word documents (.doc). Doc Scrubber can be downloaded from http://www.javacoolsoftware.com/dsdownload.html.

Exif Tool: A software application which can read, write and edit metadata in an extensive variety of files.

Removing Geo-tags: Picasa, the image organizing and editing application by Google can help to remove geo-tags from images. The link to the help and support page is



Data Loss Prevention is a quite serious matter for organizations today. It could cost more than the cost of the data itself, if this data gets into the hand of a malicious intruder. We saw that how we reveal sensitive information through the documents and files we upload, without even realizing it. This information can be exploited by an attacker/pentester for the purpose of intrusion. The criticality of such information is such that it can be the difference between a successful and a failed penetration. It has been demonstrated that metadata extraction can be easily accomplished with the help of the tools mentioned earlier (all of which are free). This gives so much power in the hands of a skilled pentester/attacker, which he/she can utilize to launch a well thought out attack. An IT Administrator can utilize the tools described to detect metadata leakage of the organization and check it.

All this information need not be there on the open web, but most of the organizations don't realize the existence of this information and hence stay ignorant towards it. Companies implement many policies to prevent data leakage, like blocking Social Networking Websites, third party e-mail services etc. but nobody realizes this medium, which is leaking information without their knowledge. Policies and procedures need to be developed for document sanitization before hosting them online. Strong policies and the mentioned mitigation methods if employed properly can help to prevent such data loss and help the organization to implement defense in depth.

Best Practices for Data Loss Prevention:

What should you learn next?

What should you learn next?

From SOC Analyst to Secure Coder to Security Manager — our team of experts has 12 free training plans to help you hit your goals. Get your free copy now.

What should you learn next?

What should you learn next?

From SOC Analyst to Secure Coder to Security Manager — our team of experts has 12 free training plans to help you hit your goals. Get your free copy now.
  • Identify and prioritize risk areas

  • Ensure complete coverage

  • Protect all the data (not just the sensitive one)

  • Plan appropriate incidence response

  • Awareness and Training

Would you like to test your skills further with a CTF challenge? Check this out:

Sudhanshu Chauhan
Sudhanshu Chauhan

Sudhanshu Chauhan is a researcher at InfoSec Institute. He is a B.Tech (CSE) graduate from Amity University. His areas of interest include (but are not limited to) Web Application Security and Bypasssing Security Measures(IDS/IPS, AV etc.).