An Affine Resistant Watermarking Scheme for Audio Signals


Abstract

In this paper, we present a novel approach for embedding a digital watermark inaudibly into an audio clip, in the time domain, according to the difference between two half blocks of each block. The proposed scheme does not require any host-related information for watermark extraction. The embedded watermark is robust to common audio signal manipulations, such as MP3 compression, time shifting, cropping, time scaling, D/A A/D conversion, insertion, deletion, re-sampling, re-quantization and filtering.

Two kinds of information are hidden in the audio: the owner's information and a synchronization template. The owner's information is a binary image provided by the copyright owner, which can be words, numbers, a signature, a personal seal or an organization's logo. The synchronization template is generated by a random number generator controlled by a secret key and is used for synchronizing the signal skew caused by time shifting, cropping and time scaling attacks. The two kinds of information are combined together and dispersed by another secret key before embedding. The proposed technique can be applied to automatically search for a protected audio from an audio database (or from the World Wide Web) by first matching the synchronization template, and then show the owner's information if it is claimed to have been watermarked.


Experiment results

Fig. 14 shows three typical 64°—64 binary images, which are embedded as the owner's information. We demonstrate the ability of inaudibility and robustness to common audio signal manipulations of the proposed watermark scheme with three pieces of audio signals: two classical music themes and a pop music theme. All of them are sampled at 44.1 kHz and 16 bits/sample in mono. Table 1 details the audio clips, the embedding and extracting informations, and Table 2 details the effect of different attacks. A threshold value of 0.45 for template matching was used in all experiments. The embedding and extracting are done on a Pentium4 2.53 GHz PC. When embedding watermark, our system select the best one from 50 randomly starting points (i.e., P=50) for each slice. When extracting watermark, our system detects all possible time shifts and time scales from -2% to +2%, and the extraction can be done in real time. The signal to noise ratio (SNR) is defined as follows:

where x(n) and y(n) are samples of the original and the watermarked audio clips, respectively. The SNR values in the brackets of Table 1 is calculated based on the adjustable blocks, i.e. where x(n)°ŕy(n) , only.

(a)(b)(c)
Fig. 14 Three typical 64°—64 binary images for representing owner's information: (a) English words, (b) the logo of IEEE organization, and (c) four Chinese words corresponding to that of (a).


Table 1. Simulation results for three different audio clips.
Audio/Music Name Mendelssohn Mozart Black or white
Description Mendelssohn, Violin Concerto E-moll OP. 64 Symphonic music, Mozart No.41. Pop music, Michael Jackson
Length 4 min 36 sec 4 min 38 sec 3 min 15 sec
File size 23.2 MB 23.4 MB 16.9 MB
Download original audio Mendelssoh
n.wav
(23.2 MB)
Mozart.wav
(23.4 MB)
BlackOrWhi
te.wav
(16.4 MB)
Download watermarked audio Mendelssoh
n_1WM.wav
(23.2 MB)
Mozart_1WM.
wav
(23.4 MB)
BlackOrWhi
te_1WM.wav
(16.4 MB)
Time of embedding (sec) 140.4 65.4 100.3
SNR (dB) 50.1 (34.1) 53.3 (32.3) 55.5 (37.1)
Averaged score of subjective test from 0 (bad) to 5 (good) 4.28 4.5 4.14
Extracted watermark without any attack including correlation of template matching and owner's information 0.91 0.87 0.97
Time of extraction (sec.) 268.3 262.6 193.4


The inaudibility of our watermark scheme has been verified through subjective tests. The hypothesis is that the difference between a watermarked audio and the original is not significant (i.e. inaudible). Fourteen listeners were randomly presented with the original and the watermarked audio clips and were asked to report scores of the difference. The weighted difference is mapped to the five-grade scale as used in conventional subjective tests. An audio clip scores 5-point denotes imperceptible while scoring 1-point denotes very annoying. The resulting average score corroborates with our hypothesis.

The owner's information, shown in Fig. 14(a), is adopted for embedding. Table 1 shows the extracting results without any attack, and Table 2 shows that of various attacks. The correlation of template matching, depicted in Table 1 and Table 2, are high enough to show the applicability of the proposed scheme in searching for watermark protected audio clips. We test the robustness of our work against several kinds of common audio manipulations (or attacks). The CoolEditTM 2000 is used to generate all the following attacks.

  • I. The owner's information, shown in Fig. 14(a), is adopted for embedding. Table 1 shows the extracting results without any attack, and Table 2 shows that of various attacks. The correlation of template matching, depicted in Table 1 and Table 2, are high enough to show the applicability of the proposed scheme in searching for watermark protected audio clips. We test the robustness of our work against several kinds of common audio manipulations (or attacks). The CoolEditTM 2000 [16] is used to generate all the following attacks.
  • II. Random cropping. The watermarked audio is randomly cropped and left a half segment in length. Due to the fact that each slice is an independent processing unit, we can extract watermarks from the remaining frames after block synchronization. Because the number of frames is reduced, the visual quality of the extracted owner's information will be affected. Table 2 shows, we can still successfully recognize the owner's information and the correlations of template matching are reasonably high.
  • III. Time scaling. The watermarked audio is scaled by 1.23% for testing, including the following three different kinds: time stretching (preserves pitch), pitch stretching (preserves tempo) and resampling (preserves neither). The time scaling of resampling attacks has almost no effect on our extracting scheme. The time scaling of time stretching and pitch stretching attacks do have some effect, but simulation results show that the effect is tolerable. The shifting and scaling of each slice can be detected by template matching, as explained in Section 3.1.
  • IV. D/A A/D conversion. The watermarked audio is converted from digital to analog, and then records the audio from analog to digital. Note that, the signal amplitude will be changed after D/A A/D conversion. The simulation results show that proposed scheme can also resist the attack of D/A A/D conversion.
  • V. Insertion. The watermarked audio is inserted with a 30-second audio, which is not embedded with watermark, in every other 30 seconds, as shown in Fig. 15. The extracting scheme can detect that some audio segments have been embedded, and can also roughly point out where are the protected segments. Then, the owner information is extracted from those protected segments, as shown in Table 2. Note that, the correlation of template bits is also extracted from those protected segments only.
  • VI. Deletion. The watermarked audio is deleted 30 seconds in every other 30 seconds. Slices can be synchronized independently, so the bits of each slice can be extracted correctly.
  • VII. MP3 compression. In multimedia content, lossy compression is a very common procedure for efficient transmission and storage. To test the robustness against lossy compression, the watermarked audio is compressed and decompressed by MPEG-I Layer 3 (MP3) at 64 kbps. Results indicate a decrease in the correlation (refer to Table 2). However, the correlation is still higher than the threshold. The extracted owner's information can also easily be recognized by naked eyes.
  • VIII. Re-sample and Re-quantization. The watermarked audio with original 44100 Hz sampling rate and 16 bits/sample is re-sampled down to 11025 Hz and re-quantizated down to 8 bits/sample. Then the low-resolution audio is up-sampled to 44100 Hz and re-quantizated to 16 bits/sample. Although the above procedure caused audible noise, there is almost no effect on the correlation of template matching and the extracted owner's information is hardly affected. These kinds of attack have only very limited effect on the block polarity, as defined in eqn. (1) or eqn. (2). This explains why the proposed scheme is robust to re-sampling and re-quantization attacks.
  • IX. Low-pass filtering. To test the robustness against filtering, a low-pass filter was applied to the watermarked audio with a cutoff frequency of 4 kHz. The loss of high frequency components is clearly audible; however, the embedded watermark can be detected successfully.

Table 2. Simulation results of various attacks for three different audio clips, where each number in front of each extracted watermark represents the corresponding correlation of template bits.
Attacks  \  Audio/Music Name Mendelssohn Mozart Black or white
Time shifting 0.91 wav (23.6 MB) 0.87 wav (23.8 MB) 0.97 wav (17.3 MB)
Cropping, with half left 0.90 wav (11.6 MB) 0.87 wav (11.7 MB) 0.96 wav (8.5 MB)
Time scaling, time stretch (preserves pitch) 0.77 wav (22.9 MB) 0.75 wav (23.1 MB) 0.83 wav (16.7 MB)
Time scaling, pitch stretch (preserves tempo) 0.76 wav (23.2 MB) 0.75 wav (23.4 MB) 0.83 wav (16.9 MB)
Time scaling, resample (preserves neither) 0.91 wav (23.5 MB) 0.87 wav (23.6 MB) 0.97 wav (17.1 MB)
D/A A/D conversion 0.74 wav (22.9 MB) 0.64 wav (23.6 MB) 0.71 wav (16.7 MB)
Insertion 0.89 wav (40.3 MB) 0.87 wav (45.4 MB) 0.95 wav (34.6 MB)
Deletion 0.89 wav (12.6 MB) 0.87 wav (12.6 MB) 0.95 wav (9.4 MB)
MP3 compression at 64 kbps 0.85 mp3 (2.1 MB)
wav (23.2 MB)
0.70 mp3 (2.1 MB)
wav (23.4 MB)
0.64 mp3 (1.5 MB)
wav (16.9 MB)
Re-sampling to 11025 Hz and re-quantization to 8 bits 0.85 wav (2.9 MB)
wav (23.2 MB)
0.74 wav (2.9 MB)
wav (23.4 MB)
0.95 wav (2.1 MB)
wav (16.9 MB)
Low-pass filter with cutoff frequency at 4 kHz 0.91 wav (23.2 MB) 0.88 wav (23.4 MB) 0.97 wav (16.9 MB)


Our approach can also embed multiple owners' information at different time instances with the same template. Fig. 16 shows an example for extracting different owners' information during different extracting time slots, and this kind of extraction process appears to be watching an animation. The correlation, Rs=0.92, is high enough to ensure that the audio has been protected. We can recognize the meaningful owners' information, by naked eyes, during different extracting time instances.


Fig. 16. This example demonstrates multiple owners°¶ information embedding at different time instances. From left to right, different owner°¶s information can be extracted at different time instances.


Executable programs

  • AWM_v2.4 (108 KB). The vision only support .wav or .pcm file format, and please remove redundant file header in .wav file. The audio must be sampled with 44.1kHz sample rate, 16bits/sample and in mono. This program will save some extracting information to disk, so ensure your disk can be writen. You can embed your own watermarks, which must be 64x64 binary image. When extracting watermark, you can select different range of time scaling from "Scaling Available" in "Extract Option". If you embed more one watermark to an audio, plesse select "Show Progress" to extract watermark progressly. If your computer is too fast, you can set "Extract Delay" to 1ms in "Extract Option".

Web page: http://www.cmlab.csie.ntu.edu.tw/~dynamic/AWM/index.html


The audio in this web are for research only. It's never for commercial use.
Back