Spark Word Count program in Python

Here is the word count program in Python using Spark (pyspark) and Hadoop (hdfs). In this tutorial, you will get to know how to process the data in spark using spark RDDs, store or move a file in a Hadoop HDFS, and how to read that file for spark processing using python cmd line arguments.

Note: Make sure you have Hadoop on your system and it's in running mode. You can check it by using 'jps' command on your terminal. ( If Hadoop is not installed then you can follow a step by step guide from here )

1. Python code to count the occurrences of a word in a file.

from __future__ import print_function
import sys
import findspark
findspark.init()
from operator import add
from pyspark.sql import SparkSession

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: wordcount <file>", file=sys.stderr)
        sys.exit(-1)

    spark = SparkSession\
        .builder\
        .appName("PythonWordCount")\
        .getOrCreate()

    lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
    counts = lines.flatMap(lambda x: x.split(' ')) \
                  .map(lambda x: (x, 1)) \
                  .reduceByKey(add)
    output = counts.collect()
    for (word, count) in output:
        print("%s: %i" % (word, count))

    spark.stop()

2. Move the word file to HDFS,
    hadoop fs -put localfile /home/username/file/path/words.txt /hdfs/path

3. Run the python code.
    python word_count.py hdfs:////usr/words.txt

4. Yeah!! you have done an output is return with each word and its number of occurrence.

What are block ciphers? Explain with examples the ECB and CBC modes of block ciphers.

A block cipher encrypts a block of plaintext as a whole and produces a ciphertext block of equal length. A block cipher takes a fixed-length block of text of length b bits and a key as an input and produces a b-bit block of ciphertext. Typically, a block size of 64 or 128 bits is used. The five modes are meant to cover a wide variety of applications of encryption for which a block cipher could be used. They are as follows:
  1. Electronic Codebook (ECB)
  2. Cipher Block Chaining (CBC)
  3. Cipher Feedback (CFB)
  4. Output Feedback (OFB)
  5. Counter (CTR)
Fig. Block Cipher
1) Electronic Codebook (ECB)
    This is the simplest mode, in which plaintext is handled one block at a time and each block of plaintext is encrypted using the same key. The term codebook is used because, for a given key, there is a unique cipher text for every b-bit block of plaintext. Therefore, we can imagine a gigantic codebook in which there is an entry for every possible b-bit plaintext showing its corresponding ciphertext. For a message longer than b bits, the procedure is simply to break the message into b-bit blocks, padding the last block if require.
    Decryption is done one block at a time, always using the same key. In the figure shown below, the plaintext consists of a sequence of b-bit blocks (P1, P2, ..., Pn) the corresponding sequence of ciphertext blocks is C1, C2, ..., Cn.
ECB is define as,
Ci = E(K, Pi)    i = 1,..,N
Pi = D(K, Ci)    i = 1,..,N
Fig. Electronic Codebook (ECB)
The ECB method is ideal for a short amount of data such as an encryption key.  The most significant characteristic of ECB is that if the same b-bit block of plaintext appears more than once in the message, it always produces the same ciphertext.

2) Cipher Block Chaining Mode (CBC)
    Unlike ECB, CBC doesn't produce the same ciphertext of repeated plaintext. In this mode, the input to the encryption algorithm is XOR of the current plaintext and the preceding ciphertext block; the same key is used for each block. The input to the encryption function for each plaintext block bears no fixed relationship to the plaintext block. Therefore, repeating patterns of b bits are not exposed. As with the ECB mode, the CBC mode requires that the last block be padded to a full b bits if it is a partial block.
    For decryption, each cipher block is passed through the decryption algorithm. The result is XOR with the preceding ciphertext block to produce the plaintext block. To produce the the first block of ciphertext, an initialization vector (IV) is XOR with the first block of plaintext. On decryption the IV is XORed with the output of the decryption algorithm to recover the first block of plaintext. The IV is a data block that is of same size as that of cipher block.
CBC is define as,
C1 = E(K, [P1 ⊕ IV])
Ci = E(K, [Pi ⊕ Ci-1])    i = 2,..,N
P1 = D(K, [C1 ⊕ IV])
Pi = D(K, Ci) ⊕ Ci-1    i = 2,..,N

Fig. Cipher Block Chaining (CBC)
The IV must be known to both the sender and receiver but be unpredictable by a third party. In particular, for any given plaintext it must not be possible to predict the IV that will be associated to the plaintext in advance of the generation of the IV. For maximum security, the IV should be protected against unauthorized changes this could be done by sending the IV using ECB encryption.
    Therefore, CBC can be used for encrypting messages of length greater than b bits and to achieve confidentiality the CBC mode can be used for authentication.

Difference between IIR and FIR systems.

IIR FIR
Impulse response is infinite. Impulse response is finite.
May or may not be stable. Always stable.
Have sharp cut-off hence faster. Comparativelyy slow.
IIR is recursive. FIR is non-recursive.
Multi-rated signals are not supported in IIR. Multi-rated signals are supported in FIR.
Phase response is not linear. Phase response is linear.
IIR is comparatively less accurate. FIR is more accurate.
IIR consists of zero and poles. FIR consists of only zeros (Poles always lies on origin).
IIR requires less memory than FIR. FIR requires more memory than IIR.
y(n)=k=1Naky(nk)+k=1Mbkx(nk) y(n)=k=0Mbkx(nk)

Absolute Loader


  • The absolute loader is a kind of loader in which relocated object files are created, loader accepts these files and places them at a specified location in the memory.
  • This type of loader is called absolute loader because no relocating information is needed, rather it is obtained from the programmer or assembler.
  • The starting address of every module is known to the programmer, this corresponding starting address is stored in the object file then the task of loader becomes very simple that is to simply place the executable form of the machine instructions at the locations mentioned in the object file.
  • In this scheme, the programmer or assembler should have knowledge of memory management. The programmer should take care of two things:
    • Specification of starting address of each module to be used. If some modification is done in some module then the length of that module may vary. This causes a change in the starting address of immediate next modules, it's then the programmer's duty to make necessary changes in the starting address of respective modules.
    • While branching from one segment to another the absolute starting address of respective module is to be known by the programmer so that such address can be specified at respective JMP instruction.
Fig. Process of Absolute Loader

Advantages:
  1. It is simple to implement.
  2. This scheme allows multiple programs or the source programs written in different languages. If there are multiple programs written in different languages then the respective language assembler will convert it to the language and common object file can be prepared with all the ad resolution.
  3. The task of loader becomes simpler as it simply obeys the instruction regarding where to place the object code to the main memory.
  4. The process of execution is efficient.
Disadvantages:
  1. In this scheme, it's the programmer's duty to adjust all the inter-segment addresses and manually do the linking activity. For that, it is necessary for a programmer to know the memory management.
  2. If at all any modification is done to some segment the starting address of immediate next segments may get changed the programmer has to take care of this issue and he/she needs to update the corresponding starting address on any modification in the source.

Explain 'Compile and Go' Loader

    In this type of loader, the instruction is read line by line, its machine code is obtained and it is directly put in the main memory at some known address. That means the assembler runs in one part of memory and the assembled machine instructions and data is directly put into their assigned memory locations. After completion, the assembly process assigns the starting address of the program to the location counter. The typical example ie WATFUR-77, a FORTRAN compiler which uses such "load and go" scheme. This loading scheme is also called as "assemble and go".

Fig. Compile and Go Loader
Advantages:
  • This scheme is simple to implement because assembler is placed at one part of the memory and loader simply loads assembled machine instructions into the memory.
Disadvantages:
  • In this scheme, some portion of memory is occupied by assembler which is simply a wastage of a memory. As this scheme is a combination of assembler and loader activities this combination program occupies a large block of memory.
  • There is no production of .obj file, the source code is directly converted to executable form. Hence even though there is no modification in the source program it needs to be assembled and executed each time which then become a time-consuming activity.
  • It cannot handle multiple source program or multiple programs written in different languages. This is because assembler can translate one source language to another target language.
  • The execution time will be more in this scheme as every time program is assembled and then executed.

Explain the term authentication with respect to ASP.Net securrity.


  • Authentication is the process of obtaining identification credentials such as name and password from a user and validating those credentials against same authority.
  • If the credentials are valid, the entity that submitted the credentials is considered an authenticated identity.
There are three ways of doing authentication and authorization in ASP.NET:
  • Windows authentication: In this methodology ASP.NET web pages will use local windows users and groups to authenticate and authorize resources.
  • Forms Authentication: This is a cookie-based authentication where username and password are stored on client machines as cookie files or they are sent through URL for every request. Form-based authentication presents the user with an HTML-based Web page that prompts the user for credentials.
  • Passport Authentication: Passport authentication is based on the passport website provided by the Microsoft. So when user logins with credentials it will be reached to the passport website (i.e. hotmail, devhood, windows live etc) where authentication will happen. If Authentication is successful it will return token to your website.
  • Anonymous access: If you do not want any kind of authentication then you will for Anonymous access.
To enable a specified authentication provider for an ASP.NET application, you must create an entry in the application's configuration file as follows:
    // web.config file
    <authentication mode = "[Windows/Forms/Passport/None]">
    </authentication>

Write short note on Decision Tree based Classification Approach

  • Training dataset should be class-labeled for learning of decision trees in decision tree induction.
  • A decision tree represents rules and it is very a popular tool for classification and prediction.
  • Rules are easy to understand and can be directly used in SQL to retrieve the records from the database.
  • To recognize and approve the discovered knowledge acquired from decision model is a crucial task.
  • There are many algorithms to build decision tree:
    • ID3 (Iterative Dichotomiser)
    • C4.5 (Successor of ID3)
    • CART (Classification and Regression Tree)
    • CHAID (Chi-square Automatic Interaction Detector)
Decision Tree representation
  • A decision tree classifier has a tree type structure which has leaf-nodes and decision nodes.
  • A leaf node is that last node of each branch and indicates the class label or value of a target attribute.
  • A decision node is the node of a tree which has leaf node or sub-tree. Some test to be carried on each value of decision node to get the decision of class label or to get next sub-tree.
Decision Tree represents for play tennis