Creating fake images/videos has never been easier. In today’s day and age, there are plenty of tools available publicly that can be used to manipulate content. These range from the traditional methods (e.g. Photoshop, GIMP) to the more sophisticated neural network based approaches (e.g. DeepFakes , NeuralTextures). The challenge comprises several original images captured from different digital cameras with various scenes either indoor or outdoor. The images are divided into images which we call “pristine” or “never manipulated” and images we call “forged” or “fakes”. The “forged” images comprise a set of different manipulation techniques such as copy/pasting and splicing with different degrees of photorealism as we describe below.
All the pristine and fake images are divided into a training and a testing set. In the training set the images will be provided along with its corresponding class and mask. In the testing set the images will be provided without any class or mask.
For a human, it is almost impossible to determine every time whether the digital content is pristine or manipulated. Basically the most common methods of image tampering are Copy-move, Image splicing, Image in painting. Copy-move is a kind of attack where a little part of the image is copied and pasted at some other regions in the same image. Image Inpainting is removing some details in the images and smoothing at the removed regions with its neighboring colors and properties such that the human cannot recognize that image was attacked. Finally, Image splicing is selecting some region in the one particular image and pasting that region in another suitable image.
There should be a proper algorithmic system that should detect the attacked images. Having said that, detecting whether the image is manipulated or not, is not just sufficient and it does not give the complete solution, hence the algorithm should localize the manipulated regions in the given image.
Why Only with AI?
For a manipulated image at tampered regions, there will be a sudden change in statistical parameters especially for image splicing kind of attack. Having said that tampering can be detected with the change in statistical parameters, can we achieve the solution for this problem with only traditional image processing techniques? Yes, we can, but it might require a lot of manual effort, a lot of research and years of time, let me explain this in detail.
For example, if my objective is to detect and eliminate the higher frequencies in the given image. Yeah, this is a very simple objective right? We can achieve this somehow by attenuating the higher frequencies in it on hiring the Gaussian Filter with some mean and std.deviation. Note that for this objective we have a well suitable filter, but for our main objective (i.e. is to localize the tampered region) there is no pre-defined filter function and to determine perfect filter based on objectives it might take several attempts that defiantly takes years of research and time.
Also in the same above example for all kinds of images, the transfer function of the filter is the same so weights of the filter never change as per our requirement. Whenever we have a complex objective and no idea which filter/combinations of filters to use, then engineers or the researches simply choose Deep learning techniques.
Using DL techniques like CNN automatically learns the filter transfer functions that are required for our objective. While in the training phase CNN continuously updates and learns the filter weights w.r.t change in the obtained result to actual result and this is the main core part of any Artificial Intelligence algorithm.
- Reading Data- Reading the fake, masks and pristine images.
- Exploratory Data Analysis- Different plots such as barplot between number of fake and pristine images etc.
- Applying models for classification part- Using different models such as simple convolutional model,vgg16 and resnet
- Ground truth- Computing ground truth for every images.
- SRM filters- Using SRM filters to find portion of images where it is forged.
- Design a CNN model to predict whether the image is forged or not.
- Design a CNN model to find the portion of image where it is forged
This post assumes familiarity with Deep learning basic concepts like tensorflow data pipelines, various cnn models such as resnet, vgg16 and the concepts related to segmentation such as SRM filters etc.
Before going into the dataset overview, the terminology used will be made clear
- Fake image: An image that has been manipulated/doctored using the two most common manipulation operations namely: copy/pasting and image splicing.
- Pristine image: An image that has not been manipulated except for the resizing needed to bring all images to a standard size as per competition rules.
- Image splicing: The splicing operations can combine images of people, adding doors to buildings, adding trees and cars to parking lots etc. The spliced images can also contain resulting parts from copy/pasting operations. The image receiving a spliced part is called a “host” image. The parts being spliced together with the host image are referred to as “aliens”.
Before going into the further details of the Deep learning architectures, we need to get suitable data for our objective. Here I am considering the IEEE IFS-TC Image Forensics Challenge data that was containing both pristine and manipulated image data. The manipulated images in the data were created using all 3 kinds of manipulated methods that I was discussed above. Let us quickly grasp the data with some exploratory data analysis.
We can conclude that the number of pristine images in the dataset is more then the number of fake images. Hence the images have different number of channels, so we need to convert every images in the same number of channels that we’ll do it using tensorflow data pipelines.
Implementing tensorflow data pipeline:
Now we are converting each images into same number of channels. Also we are doing a train test split into 80:20
Now we are going to predict the accuracy using 3 different models:
- Convolutional Model
Using the CNN model, we are getting Precision score of 0.63, Recall score of 0.69 and F1 score: 0.60. In this CNN model. we have used CONV2d with dropout value of 0.5. After this we’ve proceed with Resnet and vgg16 models.
Using this vgg16 model, we are getting Precision score of 0.75, Recall score of 0.78 and F1 score of 0.73 . Here we have used vgg16 models with 32 filters and kernel size as 3. Also, we’ve used relu as the activation unit. Now we are trying with resnet50 model
Using this resnet model, we are getting Precision score of 0.71, Recall score of 0.79 and F1 score of 0.70 . Here we have used vgg16 models with 32 filters and kernel size as 3. Also, we’ve used relu as the activation unit. Now we are trying with resnet50 model.
Now, we are writing a function to predict whether the image is fake or not.
Importing images from fake images folder and predicting their class label
Importing images from Pristine images folder and predicting their class label
Detecting portion of forged images:
The RGB stream models visual tampering artifacts, such as unusually high contrast along object edges. The noise stream first obtains the noise feature map by passing input RGB image through an SRM filter layer, and leverages the noise features to provide additional evidence for manipulation classification. A bilinear pooling layer after it enables the network to combine the spatial co-occurrence features from the two streams. Finally, passing the results through a fully connected layer, the network produces required mask output.Recently, local noise features based methods, like the steganalysis rich model (SRM), have shown promising performance in image forensics tasks. These methods extract local noise features from adjacent pixels, capturing the inconsistency between tampered regions and authentic regions. Cozzolino et al. explore and demonstrate the performance of SRM features in distinguishing tampered and authentic regions. They also combine SRM features by including the quantization and truncation operations with a Convolutional Neural Network (CNN) to perform manipulation localization . Rao et al. use an SRM filter kernel as initialization for a CNN to boost the detection accuracy. Most of these methods focus on specific tampering artifacts and are limited to specific tampering techniques. We also use these SRM filter kernels to extract low-level noise that is used as the input to a Faster R-CNN network, and learn to capture tampering traces from the noise features. Moreover, a parallel RGB stream is trained jointly to model mid- and high-level visual tampering artifacts
We got the accuracy of 93.5%. The RGB stream models visual tampering artifacts, such as unusually high contrast along object edges, and regresses bounding boxes to the ground-truth. The noise stream first obtains the noise feature map by passing input RGB image through an SRM filter layer, and leverages the noise features to provide additional evidence for manipulation classification. The RGB and noise streams share the same region proposals from RPN network which only uses RGB features as input. The RoI pooling layer selects spatial features from both RGB and noise streams. The predicted bounding boxes (denoted as ‘bbx pred’) are generated from RGB RoI features. A bilinear pooling [23, 17] layer after RoI pooling enables the network to combine the spatial co-occurrence features from the two streams. Finally, passing the results through a fully connected layer and a softmax layer, the network produces the predicted label (denoted as ‘cls pred’) and determines whether predicted regions have been manipulated or not.