Cloud security

Open-Source Intelligence Collection in Cloud Platforms

Frank Siemons
August 21, 2018 by
Frank Siemons

Threat Intelligence

One of the most popular specialized fields within the security domain is threat intelligence. In the recent years, organizations have been focusing more and more on proactive, preventative security. Within that space, threat intelligence analysis is one of the most successful tools available.

Information is collected around observed malicious infrastructure such as IPs and domains, and around malware via hashes and other indicators of compromise (IOC's). This information can be collected either through commercial, paid subscription services or free data feeds.

Learn Cloud Security

Learn Cloud Security

Get hands-on experience with cloud service provider security, cloud penetration testing, cloud security architecture and management, and more.

These feeds allow for preventative and often automated blocks; they assist in operations such as threat-hunting, provide context to ongoing attacks and can even lead to successful attacker attribution. Despite the costs to operate a threat intelligence team as part of the broader security posture of an organization, this is a very valuable service.

Open-Source Intelligence

The next step up in the threat intelligence area is to gather intelligence from public sources on the Internet that could indicate something suspicious is going on, without (yet) having access to specific indicators of recent or ongoing attacks. Think, for instance, of chatter about a leaked router configuration on a dark web cracker forum, or a dumped database on Pastebin. Another example could be an employee threatening to hack their employer's system before resigning on their Twitter page.

Any potentially-targeted company should be aware of these threats before they materialize in an actual attack. After all, it is much better to prevent a breach than to detect it and subsequently clean up the damage via consultants and lawyers.

Now the big challenge is how to collect the relevant data from the wide range of sources. A potential attacker will not post a clear message on his or her Twitter account that they will "attack web server X tonight at 7:00 PM." Correlation is needed. An attacker might mention "web server X" tomorrow, might have mentioned the targeted organization last month on a different forum and might have posted a router configuration on Pastebin a week before that (indicating being active in the offensive security space).

For this reason, the amount of data that will need to be extracted and monitored and its retention window will be very significant. Another important requirement is the use of as much automation as possible. It is too hard to only manually browse the web looking for this content, let alone to manually correlate between different platforms and within large time windows.

Options and Automation

So a level of automation is essential to successful OSINT collection and analysis. Many specialized OSINT providers collect data from many different sources, both at the request of customer-specific queries and with preconfigured broad terms of the vendors choice. Recorded Future is currently the best-known paid service in this space, but there are alternatives such as Digital Shadows Searchlight and Norse DarkWatch. These commercial offerings can be expensive but can add additional visibility by providing information only available in the underground forums which they have gained a foothold in (something nearly impossible for a smaller organization). And because most of these platforms are hosted within the provider's cloud, they require little to no infrastructure maintenance, while providing high availability and API management.

Another option is to use the many customizable open-source tools that are available on the Internet or to develop custom scripts from the ground up. A good example of what is freely available is Tweepy, a Python library to interact with the Twitter API.

Of course, access to an API needs to be granted first. For Twitter, this is a free service, but other platforms such as Pastebin require a once-off fee or an ongoing paid subscription.

An alternative to using API interaction is the use of a so-called web scraper, which can download information from a site such as Pastebin in an automated, scripted fashion. There can be legal issues around the use of these, however, so some research into obtaining permission is required.

Using a Cloud Platform

Due to the cost-limiting factor, many organizations will choose to implement some form of scripted OSINT collection, built and maintained in-house. As mentioned, the data that is to be collected will need to be of a significant size to be able to make any meaningful correlations. That data will need to be stored somewhere.

The great thing about running a simple collection of Python scripts, however, is that there is very little system overhead. A standard Ubuntu system with 4GB of RAM will be enough for most organizations. Combined with the need for 24/7 operation and high availability, this makes it an ideal candidate to place within a public cloud system.

Another benefit is that the use of a platform such as Microsoft Azure or Amazon AWS hides the source and intent of the queries from administrators of forums, social media and other websites. Sharing the keywords in a search query could be a breach by itself if they are too specific! As long as the search keywords are relatively broad, the area of interest will be hard to link to a specific business that is sending the search queries.

A multi-stage, hybrid OSINT environment could even add to this — by, for instance, downloading information on 10 randomly-selected businesses or products (or even an entire sector) to a staging system located in a public cloud, followed by the extraction of a feed, only containing the actual organizations keywords and brand names, into the local business itself. In that case, the final step is to run the very specific keyword queries locally before the essential but resource-intensive correlation stage.


OSINT collection is a very interesting field. On one side it collects technical information, and on the other side it collects information on people and events. The real science and power lie in the correlation between these two, allowing for the most dynamic and most proactive security posture an organization can obtain. There will be some cost involved, and there will be some effort required to build and operate an OSINT monitoring platform.

Once again, the cloud brings some new opportunities ranging from the availability of commercial products located within a cloud platform to the ability to obfuscate and gain efficiency via public cloud options.

The first step is to start looking at the requirements. The next step is to find a matching solution, not the other way around.


What is the dark web? How to access it and what you'll find, CSO Online

Recorded Future Competitors and Alternatives, Gartner Peer Insights

Learn Cloud Security

Learn Cloud Security

Get hands-on experience with cloud service provider security, cloud penetration testing, cloud security architecture and management, and more.

Building a Keyword Monitoring Pipeline with Python, Pastebin and Searx, Automating OSINT

Frank Siemons
Frank Siemons

Frank Siemons is an Australian security researcher at InfoSec Institute. His trackrecord consists of many years of Systems and Security administration, both in Europe and in Australia.

Currently he holds many certifications such as CISSP and has a Master degree in InfoSys Security at Charles Sturt University. He has a true passion for anything related to pentesting and vulnerability assessment and can be found on His Twitter handle is @franksiemons