what is RDD in pyspak

RDD in Spark

RDD, short for Resilient Distributed Dataset, is a crucial concept in Apache Spark. It represents a read-only collection of records that are distributed and partitioned across nodes within a cluster. RDDs can be transformed into other RDDs through various operations, and once created, they are immutable, meaning they cannot be modified. Instead, new RDDs are generated from existing ones.

An RDD is a fundamental abstraction in Spark that addresses the limitations of Hadoop. Unlike Hadoop, which replicates data redundantly across machines for fault tolerance, Spark’s RDDs store data across the cluster’s nodes and can recover lost data using a lineage graph. This approach provides fault tolerance while minimizing data redundancy.There are several ways to create an RDD:
1. Loading an external dataset: 

2. Using the Parallelize method: 

3. Transforming an existing RDD:Table of Contents

Launching Spark-Shell

1. Download and unzip Spark: 

Visit the official Spark website and download the latest version of Spark. Unzip the downloaded file to a desired location on your system.

2. Set up Scala: 

Download Scala from scala-lang.org and install it on your machine. Set the SCALA_HOME environment variable and add the Scala bin directory to your PATH variable.

3. Start the Spark shell: 

Open the command prompt and navigate to the bin folder of the Spark installation directory. Execute the command “spark-shell” to launch the Spark shell, which serves as the driver program for Spark.

Please note that the code snippets provided in this introduction are written in Scala, but RDDs support Python, Java, and Scala objects, including user-defined classes.

1. Loading an external data set: 

Spark provides the `textFile` method in the `SparkContext` class, which can be used to load data from various sources such as Hadoop, HBase, Amazon S3, etc. For example, you can create an RDD by loading a text file using `sc.textFile(“file_path”)`.

2. Parallelizing a collection: 

You can create an RDD by parallelizing an existing collection in your driver program. The `parallelize` method in the `SparkContext` class can be used to convert a collection into an RDD. For example, you can create an RDD from an array of integers using `sc.parallelize(Array(1, 2, 3, 4, 5))`.

3. Transforming an existing RDD: 

You can create a new RDD by applying transformations on an existing RDD. Spark provides various transformation operations like `map`, `filter`, `flatMap`, etc., which can be used to transform the data in an RDD and create a new RDD with the transformed data. For example, you can create a new RDD by applying a `map` transformation on an existing RDD.

4. Caching an RDD: 

If you have an RDD that you want to reuse multiple times in your application, you can cache it in memory using the `cache` method. Caching an RDD allows Spark to store the RDD’s partitions in memory, enabling faster access during subsequent operations on the RDD. For example, you can create an RDD by loading a text file and then cache it using the `cache` method.

5. Reading from other data sources: 

Spark provides APIs to read data from various data sources directly into an RDD. For example, you can use the `jdbc` method to read data from a relational database into an RDD. Similarly, you can use other methods like `cassandraTable`, `hbaseTable`, `jsonFile`, etc., to create RDDs from different data sources.

These are some of the common ways to create RDDs in Spark. Depending on your specific use case and data source, you can choose the appropriate method to create an RDD that suits your needs.

Introduction to Spark DataFrame

Basically A Spark DataFrame is a distributed collection of data that is organized(tabler format columns and rows) into named columns. It provides various operations such as filtering, aggregations, and grouping, and can be seamlessly integrated with Spark SQL. DataFrames can be created from structured data files, existing RDDs, external databases, or Hive tables. They act as an abstraction layer built on top of RDDs, and in later versions of Spark (2.0+), they are followed by the introduction of the dataset API. Unlike datasets, DataFrames were introduced not only in Scala but also in PySpark. They provide a logical columnar format that simplifies working with RDDs and offers the same functions as RDDs. From a conceptual standpoint, DataFrames are similar to relational tables, offering optimization features and techniques for efficient data processing.

Creating a DataFrame can be done using various methods such as utilizing Hive tables, external databases, structured data files, or existing RDDs. These approaches allow for the creation of named columns, forming DataFrames for data processing in Apache Spark. Applications can leverage SQLContext or SparkSession to create DataFrames

Operations on Spark DataFrames:

In Spark, a DataFrame represents a distributed and organized collection of data in named columns, similar to a relational database or a structured data frame in languages like R or Python. It offers enhanced optimizations for efficient data manipulation.

The following are some fundamental operations for structured data processing using DataFrames:

1. Data Input: 

To create a DataFrame, you can read data from a file, such as a JSON file, and the field names will be inferred automatically. Here’s an example:

“`scala

val dfs = sqlContext.read.json("student.json")

“`

2. Displaying Data: 

To visualize the data in a DataFrame, you can use the `show()` command. For instance:

“`scala

dfs.show()

“`

This will present the student data in a tabular format.

3. Examining Schema: 

To view the structure or schema of a DataFrame, you can use the `printSchema()` method. Here’s an example:

“`scala

df.printSchema()

“`

This will output the structure of the DataFrame.

Advantages of spark DataFrame 

DataFrames are distributed collections of data (partition data) where it has organized into named columns. They are resemble tables in relational databases and offer a wide range of optimizations. DataFrames enable SQL queries and can be used for pyspark programming big data processing both structured- data and unstructured-data.

The Catalyst(tungsten)optimizer simplifies and enhances optimization processes. These libraries are available in multiple languages such as Python, Scala, Java, and R.

DataFrames provide seamless compatibility with Hive, allowing unmodified Hive queries to be executed on existing Hive warehouses. They exhibit excellent scalability, ranging from a few kilobytes on personal systems to petabytes on large clusters.

DataFrames facilitate easy integration with various big data technologies and frameworks. The abstraction they offer for RDDs is efficient, resulting in faster processing speeds.

By employing these operations, you can effectively work with DataFrames in Spark and perform various data manipulation tasks.

← Back

Thank you for your response. ✨

Warning
Warning
Warning
Warning
Warning
Warning.

Read more: what is RDD in pyspak

Percentage quantity aptitude

Percentage

Percent means many hundredths.Example: z% is z percent which means z hundredths. It will be written as:

z% = z100

pq as percent: (pq x 100)%

Commodity

If the price of a commodity increases by R%, then the reduction in consumption so as not to increase the expenditure is:[R(100 + R)x 100]%

If the price of a commodity decreases by R%, then the increase in consumption so as not to decrease the expenditure is:[R(100 – R)x 100]%

Population

The population of a city is P and let it increases at the rate of R% per annum:

Population after t years: P(1 + R100)t

  • If P is R% more than Q, then Q is less than P by how many percent?
    [R(100 + R)x 100]%
  • If P is R% more than Q, then Q is more than P by how many percent?
    [R(100 – R)x 100]%

Population t years ago: P(1 + R⁄100)t

Depreciation

Let V be the present value of machine. Suppose it depreciates at the rate of R% per annum:

Machine’s value after t years:P(1 – R100)t

Machine’s value t years ago: P(1 – R⁄100)t

Quantity aptitude Sequence series.

1.Sequence

A sequence represents numbers formed in succession and arranged in a fixed order defined by a certain rule.

2.Airthmetic Progression ( A.P.)

It is a type of sequence where each number/term(except first term) differs from its preceding number by a constant. This constant is termed as common difference.

3.A.P. Terminologies

  • First number is denoted as ‘a’.
  • Common difference is denoted as ‘d’.
  • nth number is denoted as ‘Tn‘.
  • Sum of n number is denoted as ‘Sn‘.

A.P. Examples

  • 1, 3, 5, 7, … is an A.P. where a = 1 and d = 3 – 1 = 2.
  • 7, 5, 3, 1, – 1 … is an A.P. where a = 7 and d = 5 – 7 = -2.

General term of A.P.

Tn = a + (n - 1)d

Where a is first term, n is count of terms and d is the difference between two terms.

Sum of n terms of A.P.

Sn = (n/2)[2a + (n - 1)d

Where a is first term, n is count of terms and d is the difference between two terms. There is another variation of the same formula:

Sn = (n/2)(a + l)

Where a is first term, n is count of terms, l is the last term.

Geometrical Progression, G.P.

It is a type of sequence where each number/term(except first term) bears a constant ratio from its preceding number. This constant is termed as common ratio.

G.P. Terminogies

  • First number is denoted as ‘a’.
  • Common ratio is denoted as ‘r’.
  • nth number is denoted as ‘Tn‘.
  • Sum of n number is denoted as ‘Sn‘.

G.P. Examples

  • 3, 9, 27, 81, … is a G.P. where a = 3 and r = 9 / 3 = 3.
  • 81, 27, 9, 3, 1 … is a G.P. where a = 81 and r = 27 / 81 = (1/3).

General term of G.P.

Tn = ar(n-1)

Where a is first term, n is count of terms, r is the common ratio

Sum of n terms of G.P.

Sn = a(1 - rn)/(1 - r)

Where a is first term, n is count of terms, r is the common ratio and r < 1. There is another variation of the same formula:

Sn = a(rn - 1)/(r - 1)

Where a is first term, n is count of terms, r is the common ratio and r > 1.

Arithmetic Mean

Airthmetic mean of two numbers a and b is:

Arithmetic Mean = (1/2)(a + b)

Geometric Mean

Geometric mean of two numbers a and b is

Geometric Mean = √ab

General Formulaes

1 + 2 + 3 + ... + n = (1/2)n(n+1)
12 + 22 + 32 + ... + n2 = n(n+1)(2n+1)/6
13 + 23 + 33 + ... + n3 = [(1/2)n(n+1)]2

Introduce Yourself (Example Post)

This is an example post, originally published as part of Blogging University. Enroll in one of our ten programs, and start your blog right.

You’re going to publish a post today. Don’t worry about how your blog looks. Don’t worry if you haven’t given it a name yet, or you’re feeling overwhelmed. Just click the “New Post” button, and tell us why you’re here.

Why do this?

  • Because it gives new readers context. What are you about? Why should they read your blog?
  • Because it will help you focus you own ideas about your blog and what you’d like to do with it.

The post can be short or long, a personal intro to your life or a bloggy mission statement, a manifesto for the future or a simple outline of your the types of things you hope to publish.

To help you get started, here are a few questions:

  • Why are you blogging publicly, rather than keeping a personal journal?
  • What topics do you think you’ll write about?
  • Who would you love to connect with via your blog?
  • If you blog successfully throughout the next year, what would you hope to have accomplished?

You’re not locked into any of this; one of the wonderful things about blogs is how they constantly evolve as we learn, grow, and interact with one another — but it’s good to know where and why you started, and articulating your goals may just give you a few other post ideas.

Can’t think how to get started? Just write the first thing that pops into your head. Anne Lamott, author of a book on writing we love, says that you need to give yourself permission to write a “crappy first draft”. Anne makes a great point — just start writing, and worry about editing it later.

When you’re ready to publish, give your post three to five tags that describe your blog’s focus — writing, photography, fiction, parenting, food, cars, movies, sports, whatever. These tags will help others who care about your topics find you in the Reader. Make sure one of the tags is “zerotohero,” so other new bloggers can find you, too.

Design a site like this with WordPress.com
Get started