PyFlag Manual
Flag (Forensic and Log Analysis GUI) is a tool designed to simplify the examination of forensic evidence in the form of Hard disk images, logs and network captures. This manual documents some of the basic aspects of FLAG, but is by no means complete. There is a complete API documentation produced by epydoc in the docs directory. The API documentation is designed for developers who wish to contribute to PyFlag development.
Basic Concepts
Flag Cases
A central concept to flag is the case. A case is simply an area to collect related information regarding a particular incident. Internally a case is kept in its own database, and tables are added to the case as different forms of evidence are added.
To create a new case, click the Case Management tab and add a new case.
Resetting the case deletes all data from the case, which is essentially equivalent to dropping the case database and recreating it.
IO Sources
Another central concept to FLAG is the IO source. An IO source is simply a way of specifying a source of data for flag. The concept of IO Source is an abstraction of data sources. For example, a hard disk image is a source of data, however we could use a number of different types of hard disk images, e.g. dd images, encase evidence files etc.
Hence FLAG uses an IO source to handle data, and the specifics of how to access this form of data are abstracted. IO Sources are currently heavily utilised in the disk forensic module, but may be extended to other modules in the future. Although the following examples are concentrating on hard disk images, in the future similar IO Subsystems will be used for other aspects of FLAG, such as log files, and network captures.
The following IO Source types are currently supported. Other IO sources may be added in the future:
- Standard
- Advanced
- SGZIP
- With todays very large hard disks it is sometimes difficult to manipulate dd images. Since dd images are uncompressed, when analysing a 120GB hard disk (which is now commodity on most PCs), the analysis platform must be able to handle a single 120GB file, which may need to be archived etc.
Many people archive very large dd images by using a standard compression program such as gzip or bzip2. This helps the archive of the file, but it is impossible to directly use the compressed file in the analysis without decompressing it first. The main reason for this is because most general purpose compression formats are not designed for seeking randomly through the file.
Most industry standard forensic packages provide a method for manipulating compressed hard disk images directly. FLAG supports a number of different formats at this time namely sgzip, and eyewitness compression format (which is mainly used by Encase(tm), and FTK(tm)).
The sgzip format is based on gzip, but provides a seekable capability. This is achieved by compressing blocks (default size of 32kb) individualy. Then a seek operation simply needs to locate the right 32kb block and decompress that. The specific details of the file format are found in the file sglib.h.
Sgzip is a robust format, which means that if the image file is damaged in some way (e.g. some of it is corrupted, or truncated) it is still possible to retrieve most of the data from within it (contrast this with for example Encase, which can not recover from a corrupted evidence file). To create an sgzip file, use the supplied sgzip utility:
dd if=/dev/hda | sgzip -v > image.sgz
- It is also possible to decompress the sgzip file back into a regular dd image:
sgzip -vd image.sgz
- EWF (Eye Witness Format)
- Eye witness format is a proprietary format which is mainly used by Encase and FTK. This format also compresses data in 32kb chunks to achieve a seekable compressed file. This file format must also be split across files smaller than 2gb (generally 640mb is used).
Although FLAG can also create EWF files, at this stage they are not (yet) readable by Encase. It is perfectly valid to generate EWF files using FLAG for usage within FLAG, however since the EWF format is fragile (i.e. can not tolerate corruption), this is not recommended and it is better to use sgzip for this purpose. The other major disadvantage of the EWF format is that it is impossible to write an EWF file into a pipe. Hence it is not possible to image over the network (using netcat for example). Sgzip is a better format choice here as well.
Most of the time FLAG is used to analyse images taken using encase, or to repair correpted encase images (The flag ewf implementation is quite flexible and can be used to repair encase images, whereas encase itself will not import those in most cases). See the evtool for examples of how EWF images can be manipulated. To use EWF images in FLAG simply select all the files (with extension .E01,.E02 etc) files in the IO Sources.
Table Viewer Widget and the Navigation Bar
The most powerfull widget in flag is the table viewer. This widget allows for extremely sophisticated searching of the dataset and is so important that an entire section is dedicated to it here. The figure below shows a typical usage of the Table widget, although it is used in many places within PyFlag.

The following components can be seen:
- Current Case: This shows the currently selected case name
- Next and Previous Page: Often the number of rows of data can not fit within a single screen full. In this case the Next page arrow will be unshaded indicating that another page is available to view.
- Main Menu: The menu button takes you directly to the main menu for the currently selected module. From there you may select another module to look at or simply another report.
- Column Name: Each column in the table has a name. Clicking on the column name indicates that you wish to order the result alphabetically by that column. At any one time, a single column will appear to have a pink background. This indicates that the result is ordered by this column. Clicking the column again flips the sense of the ordering (from ascending to descending and back again).
Note the colouring of the rows alternating between gray and white. For each unique value of the ordering column these colours alternate. The result is that by ordering the result set on one of the columns, it is quick to see which rows contain identical values with each column, because they will appear like groups of different colours.
- Group by Column: The Group by column allows the user to count how many entries in the result set occur with each value of the column. For example in the figure above, by grouping by Source IP (and ordering by counts), it is possible to see which source IP produced the greatest number of hits.
Once in the group by screen, clicking on the individual source IP address allows the user to view all the hits produced by that IP address
- Search Column: It is often handy to be able to eliminate a subset of the result set from the table, and concentrate on those results which match a specific criteria. For example, we may want to see those hits produced between certain date range, or those IP addresses requesting certain file extensions, or possibly both conditions at once.
The search capability within the table widget allows for complex search criteria to be performed. Conditions are added cumulatively with a logical "and" seperating them. So for example, you can add the condition date > 2001-10-01 and date < 2002-10-01. The query is entered into the text area where the sense of the query is determined by the first character of the query string. So for example typing "<2001-10-01" in the date column will show all dates prior to october 1st 2001. The following modifiers are supported:
- > The values are greater than the specified value
- < The values are less than the specified value
- = The values are exactly equal to the specified value
- ! The values are not equal to the specified value
- Search term with a % in it The values are similar to the specified value with % being the wildcard
- search term The values are similar to the specified value with wildcards automatically added before and after the search terms (so it may match in the middle of the entry)
Note that as new searching conditions are added, they will be listed at the top of the table. You may click on any of those links to remove that search term, while still preserving the others.
Modules
PyFlag has an extensible, open architecture which allows developers to add arbitrary modules to the program core easily. The modules all reside within the plugins directory. PyFlag will automatically import all modules within that directory and make these available to the user via the menu.
Following is a discussion on each module, and the functionality available through each module.
Disk Forensics
The Flag Disk Forensics Module provides the the following capabilities:
- Browsing a disk image and searching, viewing and downloading files within it.
- Calculating a timeline of modified, accessed and created timestamps for all files within an image.
- Calculting file types (file magic) and hashes (MD5sums) and comapring these against the NIST NSRL hash set (if present).
- Browsing/Searching windows registry hives.
These map to the reports which appear in the "Disk Forensics" tab in flag.
Before using these reports, the filesystem image must be loaded into flag. This is a three stage process:
- Create a new case, as described above
- Create an IO Source using the "Load IO Data Source" report in the "LoadData" tab, this is done by selecting an appropriate IO subsystem as described above and filling out the form appropriatedly
- Load the filesystem using the "Load Filesystem Image" report in the "LoadData" tab, the form simply asks for the case to load the data into and the IOSource to find the filesystem in. This will invoke the Sleuthkit software which extracts filesystem metadata and loads it into the case database. This step can take a while (10 minutes or so) for very large filesystems.
Once the image is loaded the reports in the DiskForensics tab can be run. Note that the "MD5 Hash Comparison" can take a long time. Currently, before using the "Browse Registry" report, you must extract the registry hives from the image, this can be done my dowloading them using the "Browse Filesystem" report.
Unstructured Disk Forensics
Sometimes it is impossible to recover files directly off a hard disk image. This may be due to the disk being corrupted, or the files being deleted, while the filesystem does not support file undeletetion (for example NTFS). In these cases it may be possible to recover some files by looking at the raw disk as a big chunk of binary data, without structure or filesystem, hence the term unstructured forensics.
Most filesystems try to keep files unfragmented as much as possible. This is usually a performance consideration, but on the balance, files reside in allocated sequential blocks. This property can be exploited for forensically recovering the files. Since most files have a definite file header (sometimes called file magic), it is possible to search the raw disk for this magic and extract data.
This is the purpose of the "Extract Files" report. To look for possible files on the disks. This is not perfect since sometimes files are overwritten, or fragmentation corrupts them. Often though, this is enough evidence that contraband files were found (e.g. illegal pornography), or that document fragments can be retrieved (Often it is possible to read the text of office documents, despite them being corrupted). The image below illustrates the Unstructured Forensics report.

As can be seen, thumbnails are generated on the fly for each suspected file type. The filename given to each extracted file consistes of the offset within the image, and the extension based on the file type. (Note that Microsoft office documents all have the .doc extension, because all Microsoft office documents have the same magic).
By clicking of each image, it is possible to download the file, view a hexdump of the file or see strings within the file.
Log Analysis
Flags provides simple yet powerful log analysis capabilities based on the flag table view. Flag allows you to load arbitrary plain text log files by first describing the file format. The loading process is as follows:
- Create a new log file type using the "Create Log Preset" report in the "Log Analysis" tab. A form will be displayed which involves the following steps:
- Select a sample log file, once selected, click "submit" and the form will be redraws with a preview of the log file
- Select a field delimiter. Usually a simple delimiter (space, comma) will do, but you can also use a regular expression, in this case the submatches (ie. between '()' brackets) will become the fields.
- Once the delimiter is selected, press "submit" again to update the form, a preview of the split lines will be displayed. You can now select any prefilter to apply to the text before splitting. Prefilters can be used to perform simple processing on the text such as to change date formats etc.
- Assign names and types to fields. Here you must give each field a name, if the name is "ignore", that field will be discarded when the data is loaded. Choose the most appropriate type for each field, ie. numbers shoud be "int" and times should be "timestamp" if these are selected appropriately, searching will be more powerfull (eg, you can search for date ">2003-01-01" meaning dates after 01/01/2003, this would not be possible if date was a varchar). Here you can also choose which fields to create indexes for, indexes can significantly speedup searches, but will increase the size of the database.
- Once satisfied that the table preview looks ok, and all fields and types have been assigned, select the checkbox to see a final preview. This will load the data into a temporary database table and display the results. This allows you to see if the types selected are appropriate for the data.
- Give the preset a name, and store it in the flag database. Note that the new preset log type will now be available to all flag cases.
- Load the log file using the "Load Preset Log File" report in the "LoadData" tab. This simply requires choosing a name for the new table, selecting the log file type (as created above) and choosing the log file to load.
Once the logfile data is loaded, it can be analysed using the "List Log File Contents" report in the "LogAnalysis" tab. This report simply shows a table of the loaded log. The table can be search my multiple criteria at a time, and sorted or grouped by any column.
Network Packet Analysis
The Network capture analysis is based on disecting the packets using ethereal and loading the results into the database for analyisis. Network analysis can be performed in two modes, corresponding to the flag tabs:
- TCPDump Analysis, this involves loading all packet data into the database for close inspection. This level of analysis allows the investigator to do things such as:
- View statistics such as which protocols are used and their relative quantities.
- Reassemble and view TCP sessions, including replaying HTTP sessions directly to the browser
- View full packed breakdowns similarly to ethereal
- Protocol specific analysis for several protocols including HTTP and DNS.
To load a full packet capture, use the "Load Pcap File" report in the "Load Data" tab. You can then use the reports in the "TCPDumpAnalysis" tab to analyse the data.
- Knowledge Base Analysis. Rather than loading all packet data into the database, the knowledge base mode analyses the traffic and makes assertions based on what it sees, eg. "ip 1.1.1.1 is talking to 2.2.2.2" (because we say an ip packet with those src and dst fields) or "ip 1.1.1.1 is listening on TCP port 80" (because we saw a TCP syn,ack from that ip/port). This mode is faster and creates a much smaller database. The knowledge base objects and relationships can be queried and graphed in various ways using the reports in the 'KnowledgeBase' tab in flag. To load the packet capture into flag, use the "Build Knowledge Base" report in the "Load Data" tab.
Last modified: Tue Mar 16 21:43:28 EST 2004