Introduction

HADOOP adoption is sluggish due to talent paucity and complexity of public cloud platforms.

We at OPEX LABS have simplified Hadoop adoption by launching a self-serving platform Hadoop As A Service (HAAS365 – www.haas365.org).

HAAS365 enables enterprises and data scientists to setup and experience Hadoop tools & distributions in just minutes with low risk and reduced OPEX, thereby transitioning them and their teams into the BigData analytics world in no time over VESPER Container platform.

How to register

Creating an account over HAAS365

  1. Go to www.haas365.org
  2. Select Register Button
  3. Fill the Registration form
  4. Read and confirm the Terms of Use and Privacy policy
  5. Submit the Registration form. The submission sends a confirmation mail to your email account.
  6. Access your email (Inbox/Spam/Junk email folders) and confirm your email ID by clicking the URL link specified in the email.
    Kindly note that the URL link would be valid only for 3 days.
  7. If you have any issues confirming your email ID, pleased send an email to support@haas365.com
  8. To activate your account, you NEED to send an email to support@haas365.com

Explore HAAS365 at www.haas365.org and let us know your feedback. We are committed to your satisfaction and will try to make your HADOOP adoption experience a smooth, positive and conclusive one for your enterprise.

Parsing data in Pig

Parsing the data in Pig seems to be challenging if the data is not well structured. Usually, the data will be formatted using the delimiters. But if the data contains special characters or mismatch in the data type would result in either wrong record or omitting the data. In this section, we talk about the common mistake observed while parsing the wrong data & how to correct it.

For the entire exerciser  BX-Books Data.

pig -x local

grunt> books = LOAD '/home/opexlabs/Books00' USING PigStorage(';') AS (isbn:chararray, title:chararray, author:chararray, year:int, publisher:chararray);

grunt>Dump books;

Above example year is defined as INT in the schema, but the data stored in the file has as string representation as below



Data - "0452264464";"Beloved (Plume Contemporary Fiction)";"Toni Morrison";"1994";"Plume";

Tuple - ("0452264464","Beloved (Plume Contemporary Fiction)","Toni Morrison"<strong>,,</strong>"Plume")

Due to the data type mismatch the Pig will ignore the data. The issue can be avoided by either

  1. Declare the data type as String & replace the double quote to represent integer after parsing
    shell&gt;pig -x localgrunt&gt; books = LOAD '/home/opexlabs/Books' USING PigStorage(';') AS (isbn:chararray, title:chararray, author:chararray, year:chararray, publisher:chararray);grunt&gt; books_parsed = GENERATE isbn, title, author, <a href="http://pig.apache.org/docs/r0.16.0/func.html#replace">REPLACE</a>(year,'\\"','') AS year,publisher;grunt&gt;DUMP books_parsed;
    grunt&gt;DESCRIBE books_parsed;
    
    
  2. Declare the data type as String & replace the double quote to represent numeric value
    Pig -x localgrunt&gt; books = LOAD '/home/opexlabs/Books' USING PigStorage(';') AS (isbn:chararray, title:chararray, author:chararray, year:chararray, publisher:chararray);grunt&gt; books_parsed = GENERATE isbn, title, author, <a href="http://pig.apache.org/docs/r0.16.0/func.html#regex-extract">REGEX_EXTRACT</a>(year,'([0-9]+)',1) AS year,publisher;grunt&gt;DUMP books_parsed;
    grunt&gt;DESCRIBE books_parsed;
    
    
  3. By using CSVExcelStorage covers most of the conversion issues in data & is recommend to be   used in case of CSV files instead of PigStorage. The code would look like
    Pig -x localgrunt&gt; books = LOAD '/home/opexlabs/Books' USING org.apache.pig.piggybank.storage.CSVExcelStorage(';') AS (isbn:chararray, title:chararray, author:chararray, year:chararray, publisher:chararray);grunt&gt;DUMP books;
    grunt&gt;DESCRIBE books;