Sunday, September 8, 2019

AWS EMR some basic learnings (python and spark)

As a new user of this service it was a bit confusing to start with, especially as there seem to be endless number of contradicting articles about how to add steps and what they should execute.

My main issue was with this part of the documentation:

Steps=[
{
'Name': 'string',
'ActionOnFailure': 'TERMINATE_JOB_FLOW'|'TERMINATE_CLUSTER'|'CANCEL_AND_WAIT'|'CONTINUE',
'HadoopJarStep': {
'Properties': [
{
'Key': 'string',
'Value': 'string'
},
],
'Jar': 'string',
'MainClass': 'string',
'Args': [
'string',
]
}
},
],


It looks like Jar is not an optional parameter, so how am I supposed to run python? I don’t have any jars...

So here is what I discovered:
AWS EMR provides two jars:

script-runner.jar and command-runner.jar

script-runner.jar can execute scripts, so it will get a script file as a parameter.
command-runner.jar is simillar to ssh connection and running commands.

For my use-case i think the command-runner was the best fit, so for a simpliest command of running spark-submit command with a python file my step becomes:


{
"Name": “Python Step",
"ActionOnFailure": "CONTINUE",
'HadoopJarStep': {
"Properties":[],
"Jar":"command-runner.jar",
“Args": [
'spark-submit',
's3://buket/my_spart_python_file.py',
]
}
}

Args tranlates each , to space, so the command that is going to be executed is: "spark-submit s3://buket/my_spart_python_file.py"

Any additional parameter I would like to execute, I would just add it with , and construct the command just as if I’m running bash commands.

For example '--executor-memory’, ‘5g’, ‘—something-else’, ‘else'

Also, to enable logging and debuging need to add another step:
{
'Name': 'Setup Hadoop Debugging',
'ActionOnFailure': 'TERMINATE_CLUSTER',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['state-pusher-script']
}
}