Spark Word Count program in Python

 

Here is the word count program in Python using Spark (pyspark) and Hadoop (hdfs). In this tutorial, you will get to know how to process the data in spark using spark RDDs, store or move a file in a Hadoop HDFS, and how to read that file for spark processing using python cmd line arguments.

Note: Make sure you have Hadoop on your system and it's in running mode. You can check it by using 'jps' command on your terminal. ( If Hadoop is not installed then you can follow a step by step guide from here )

1. Python code to count the occurrences of a word in a file.

from __future__ import print_function
import sys
import findspark
findspark.init()
from operator import add
from pyspark.sql import SparkSession

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: wordcount <file>", file=sys.stderr)
        sys.exit(-1)

    spark = SparkSession\
        .builder\
        .appName("PythonWordCount")\
        .getOrCreate()

    lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
    counts = lines.flatMap(lambda x: x.split(' ')) \
                  .map(lambda x: (x, 1)) \
                  .reduceByKey(add)
    output = counts.collect()
    for (word, count) in output:
        print("%s: %i" % (word, count))

    spark.stop()

2. Move the word file to HDFS,
    hadoop fs -put localfile /home/username/file/path/words.txt /hdfs/path

3. Run the python code.
    python word_count.py hdfs:////usr/words.txt

4. Yeah!! you have done an output is return with each word and its number of occurrence.