Blog

As part of the digital transformation, companies are moving their infrastructure and applications to the cloud at a faster rate than ever.

There are many approaches to migrating to the cloud – each with their own benefits and drawbacks. It’s important to be knowledgeable on each option in order to make the best decision possible for your organization.

The three primary choices are Rehost, Replatform, and Refactor, which we will walk through now.

 

 

 

Rehost

Rehosting, often also called lift and shift, is the simplest of the migration options. Applications are simply moved from on-premise to the cloud without any code modification. This is considered a beginner’s approach to migration. Some of the benefits include that it’s a very fast option and requires very little resources. There is also minimal application disruption and it is cheaper than maintaining an on-premises environment. Because this migration is so simple, companies don’t typically benefit from cloud-native features like elasticity, which can be achieved from the other migration techniques.

Overall, if a company is looking for a quick and easy migration that doesn’t disrupt the existing application workflow, Rehosting is the best choice. This is a fast solution for organizations that need to reduce their on-premises physical infrastructure costs as soon as possible. Thankfully, companies can always re-architect and optimize their application once they are already in the cloud.

 

Replatform

Replatforming involves moving a company’s assets to the cloud with a little up-versioning. A portion of the application is changed or optimized before moving to the cloud. Even a small amount of cloud optimization (without changing the core application structure) can lead to significant benefits. This approach takes advantage of containers and VMs, only changing application code if needed to use base platform services.

Replatform provides a suitable middle ground between rehosting and refactoring. It allows companies to take advantage of cloud functionality and cost optimization without using the resources required for refactoring. This approach also allows developers to use the resources they are used to working with, including development frameworks and legacy programming languages. This approach is slower than rehosting and doesn’t provide as many benefits as refactoring.

Organizations should choose this approach if they are looking to leverage more cloud benefits and if minor changes won’t change their applications functioning. Also, if a company’s on-premises infrastructure is complex and is preventing scalability and performance, some slight modifications that would allow them to harness these features in the cloud would be very worthwhile.

 

Refactor

The most complex option is refactoring, which includes a more advanced process of rearchitecting and recoding some portion of an existing application. Unlike Replatforming, this option makes major changes in the application configuration and the application code in order to best utilize cloud-native frameworks and functionality. Due to this, refactoring typically offers the lowest monthly cloud costs. Customers who refactor are maximizing operational cost efficiency in the cloud. Unfortunately, this approach is also very time consuming and resource-intensive.

Companies should choose to refactor when there is a strong business need to add features and performance to the application that is only available in the cloud, including scalability and elasticity. Refactoring puts a company in the best position to boost agility and improve business continuity.

 

There is no migration approach that is always the best option for every case. Rather, companies should take into consideration their short- and long-term business goals and choose what is right for their current situation. If you need help deciding, contact us to discuss options – we’re always happy to talk!

 

 Like

If you’re here, you’re probably experiencing a common issue: trying to access a certain port on an EC2 Instance located in a private subnet of the Virtual Private Cloud (VPC). A couple of months ago, we got a call from one of our customers that was experiencing the same issue. They wanted to open up their API servers on the VPC to one of their customers, but they didn’t know how. In particular, they were looking for a solution that wouldn’t compromise the security of their environment. We realized this issue is not unique to our customer, so we thought a blog post explaining how we solved it would be helpful!

To provide some context, once you have an API server within your VPC, it is closed to the outside world. No one can access or reach that server because of the strong firewall around it. There are a few ways around this, including Virtual Private Network (VPN) connections to your VPC, which allows you to open up private access. Unfortunately, this is not a viable solution if you need to open up your API server to the world, which was the case with our customer. The goal was to provide direct access from the internet outside the VPC for any user without VPN connection.

In order to solve this issue for our customer, one of the architecture changes we recommended was adding an internet-facing AWS TCP Network Load Balancer on the public subnet of the VPC. In addition to this load balancer, we also needed to create an instance-based target group.

Keep reading to learn how you can do this – we even included an architecture diagram to make things easier! Please note that our example includes fake IP addresses.

Problem: Accessing an API endpoint in an EC2 Instance in a Private Subnet from the Internet.

Suggested AWS Architecture Diagram:

 

 

Features of our diagram:

  • Multi AZ: we used a private and public subnet in the same VPC in two different availability zones.
  • Multi EC2 (API Servers): we deployed an API server in each private subnet in each availability zone.
  • Multi NAT Gateways: a NAT gateway will allow the EC2 instances in the private subnets to connect to the internet and achieve high availability. We deployed one NAT gateway in the public subnets in each availability zone.
  • TCP Load balancer health checks: a TCP load balancer will always redirect any user’s requests to the healthy API servers. In case one AZ goes down, there will be another AZ that can handle any user’s requests.

Although we did not make this change, you can also implement Multi-Region to handle a region failure scenario and enable higher availability.

VPC Configurations:

SubnetAZCIDRIGW Route OutNAT GW

Route

Out

public-subnet-aus-east-1a172.16.0.0/24YesNo
public-subnet-bus-east-1b172.16.3.0/24YesNo
private-subnet-aus-east-1a172.16.1.0/24NoYes
private-subnet-bus-east-1b172.16.2.0/24NoYes

 

EC2 Configuration:

NameAZSubnetPrivate IPSecurity Group
API Server1us-east-1aprivate-subnet-a172.16.1.**Allow inbound traffic to TCP port 5000 from 0.0.0.0/0 or any specific source IP address on internet.
API Server2us-east-1bprivate-subnet-b172.16.2.**Allow inbound traffic to TCP port 5000 from 0.0.0.0/0 or any specific source IP address on internet.

 

Solution:

  • Create a TCP network load balancer:
    • Internet facing
    • Add listener on TCP port 5000
    • Choose public subnets with same availability zone (AZ) as your private subnets
  • Create an instance based target group:
    • Use TCP protocol on port 5000
    • For health check, either use TCP on port 5000 or HTTP health check path
    • Add the two API servers to the target instances to achieve high availability and balance the request load between different servers
  • Once the target instances (API servers) become healthy, you will be able to access the API endpoints from the public internet directly using the new TCP load balancer DNS name or elastic IP address on port 5000

 

References:

https://docs.aws.amazon.com/elasticloadbalancing/latest/network/create-network-load-balancer.html

 Like

IDC recently produced a report analyzing the Windows Server Market (access a summary here). The report discovered that more and more organizations are transitioning their Windows workloads to the public cloud. Windows Server Deployments in the cloud more than doubled from 8.8% in 2014 to 17.6% in 2017. Migrating to the cloud allows organizations to grow and work past the limitations of on-premises data centers, providing improved scalability, flexibility, and agility.

The report found that of all Windows Cloud Deployments, 57.7% were hosted on Amazon Web Services (AWS). AWS’s share of the market was nearly 2x that of the nearest cloud provider in 2017. Second in line was Microsoft Azure, which hosted 30.9% of Windows instances deployed on the public cloud IaaS market. Over time, the Windows public cloud IaaS market will continue to expand because of the growing usage of public cloud IaaS among enterprises and the movement of Windows workloads into public cloud IaaS.

When it comes to Windows, AWS helps you build, deploy, scale, and manage Microsoft applications quickly and easily. AWS also has top-notch security and a cost-effective pay-as-you-go model.  On top of this, AWS provides customers with a fully-managed database service to run Microsoft SQL Server. These services give customers the ability to build web, mobile, and custom business applications.

This begs one to ask the question: why is AWS the most sought-after cloud hosting service, considering there are other options in the market? We believe it is because of the plethora of services AWS offers. As a software company in particular, if you’re looking to migrate and rearchitect your legacy system, AWS provides an excellent set of services that companies can utilize to make their application cloud native. The richness of these services often allows companies to decouple their legacy architecture without needing to overhaul the entire legacy application within their Windows platform.

Let’s walk through a few examples. For starters, take an application that would traditionally host storage-intensive data such as video, pictures, and documents. This application can keep its Windows platform as is and move the files into Amazon S3 with minimal architecture change. AWS allows its customers to decouple their storage and move it into S3 in a cost-effective, secure, and efficient manner. Next, applications with heavy back-end processing can benefit from services such as lambda, EMR, and Spark. AWS allows customers to decouple the compute needs into relevant services for their application. Lastly, imagine an application that hosts significant historical data with requirements to search or query that data. Traditionally, some customers would need to keep all the historical and archived data in a database, but with AWS they can maintain the same architecture and move the archived data to S3 and use services like Amazon Athena and Elasticsearch to query and search this data without a need for a database. This helps to reduce their footprint on a database and cut down on costs.

Examples like these demonstrate why we believe AWS is a superior cloud-hosting service for Windows workloads. If a company is looking at their architecture holistically, AWS provides a comprehensive solution for compute, networking, storage, security, analytics, and deployment. This has been proven to us time and time again through customer migrations.

dbSeer has a strong track record of helping customers successfully migrate to the AWS Cloud. Contact us anytime to learn more!

 Like

Earlier this year, we wrote a blog about how to use AWS Auto Scaling with Logi Analytics Applications. In that blog, we promised to release a step-by-step guide outlining the technical details of how a Logi Application can be configured to harness the scalability and elasticity features of AWS. If you were wondering when that would be released, the wait is over and you have come to the right place! Without further ado…

Enabling a multi-web server Logi application on AWS Windows instances requires the right configuration for some of the shared Logi files (cache files, secure key, bookmarks, etc.). To support these shared files, we need a shared network drive that can be accessed by the different Logi webservers. Currently EFS (Elastic File Storage) is not supported on Windows on AWS. Below we have defined how EFS can be mounted on Windows servers and setup so that you can utilize the scalability feature of Logi.

Setting Up the File Server
Overview:
In order for our distributed Logi application to function properly, it needs access to a shared file location. This can be easily implemented with Amazon’s Elastic File System (EFS). However, if you’re using a Windows server to run your Logi application, extra steps are necessary, as Windows does not currently support EFS drives. In order to get around this constraint, it is necessary to create Linux based EC2 instances to serve as an in-between file server. The EFS volumes will be mounted on these locations and then our Windows servers will access the files via the Samba (SMB) protocol.

Steps:

  1. Create EC2
    • Follow the steps as outlined in this AWS Get Started guide and choose:
      • Image: “Ubuntu Server 16.04 LTS (HVM), SSD Volume Type”
      • Create an Instance with desired type e.g.: “t2.micro”
  1. Create AWS EFS volume:
    • Follow the steps listed here and use same VPC and availability zone as used above
  2. Setup AWS EFS inside the EC2:
    • Connect to the EC2 instance we created in Step 1 using SSH
    • Mount the EFS to the EC2 using the following commands:
      sudo apt-get install -y nfs-common
      mkdir /mnt/efs
      mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 EFS_IP_ADDRESS_HERE:/ /mnt/efs
  3. Re-export NFS share to be used in Windows:
    • Give our Windows user access to its files. Let’s do this using samba. Again, drop the following to your shell for installing SMB(Samba) services in your Ubuntu EC2
    • Run the following commands:
      apt-get install -y samba samba-common python-glade2 system-config-samba
      cp -pf /etc/samba/smb.conf /etc/samba/smb.conf.bak
      cat /dev/null -> /etc/samba/smb.conf
      nano /etc/samba/smb.conf
    • And then, paste the text below inside the smb.conf file:
      [global]
      workgroup = WORKGROUP
      server string = AWS-EFS-Windows
      netbios name = ubuntu
      dns proxy = no
      socket options = TCP_NODELAY[efs]
      path = /mnt/efs
      read only = no
      browseable = yes
      guest ok = yes
      writeable = yes
    • Create a Samba user/password. Use the same credentials as your EC2 user
      sudo smbpasswd –a ubuntu
    • Give Ubuntu user access to the mounted folder:
      sudo chown ubuntu:ubuntu /mnt/efs/
    • And finally, restart the samba service:
      sudo /etc/init.d/smbd restart

 

Setting up the Application Server

Overview:
Logi applications require setup in the form of settings, files, licenses, and more. In order to accommodate the elastic auto-scaling, we’ll set up one server – from creation to connecting to our shared drive to installing and configuring Logi – and then make an Amazon Machine Image (AMI) for use later.

Steps:

  1. Create EC2:
    • Follow the steps as outlined in this AWS Get Started guide and choose:
      • Image: “Microsoft Windows Server 2016 Base”
      • Instance type: “t2.micro” or whatever type your application requires
  2. Deploy code:
    • Clone your project repository and deploy the code in IIS
  3. Set Up User Access:
    • Allow your application in IIS to access the shared folder (EFS) that we created inside the File server
    • From the control panel, choose users accounts → manage another account → add a user account
    • Use same username and password we created for the samba user in Ubuntu file server
    • In IIS, add the new Windows user we created above to the application connection pool, IIS → Application Pools → right click on your project application pool → identity → custom account → fill in the new username and password we created earlier.
  4. Test EFS (shared folder) connection:
    • To test the connection between Windows application server and Ubuntu file server, go to:
      • This PC → computer tap → map network drive → in folder textbox type in “\\FILE_SERVER_IP_ADDRESS\efs” → If credentials window appears for you, just use the new username and password we created earlier.

 

Configuring the Logi Application

Sticky and Non-Sticky Sessions
In a standard environment with one server, a session is established with the first HTTP request and all subsequent requests, for the life of the session, will be handled by that same server. However, in a load-balanced or clustered environment, there are two possibilities for handling requests: “sticky” sessions (sometimes called session affinity) and “non-sticky” sessions.

Use a sticky session to handle HTTP requests by centralizing the location of any shared resources and managing session state. You must create a centralized, shared location for cached data (rdDataCache folder), saved Bookmark files, _metaData folder, and saved Dashboard files because they must be accessible to all servers in the cluster.

Managing Session State
IIS is configured by default to manage session information using the “InProc” option. For both standalone and load-balanced, sticky environments, this option allows a single server to manage the session information for the life of the session.

Centralization of Application Resources
In a load-balanced environment, each web server must have Logi Server installed and properly licensed, and must have its own copy of the Logi application with its folder structure, system files, etc. This includes everything in the _SupportFiles folder such as images, style sheets, XML data files, etc., any custom themes, and any HTML or script files. We will achieve this by creating one instance with all the proper configurations, and then using an AMI.

Some application files should be centralized, which also allows for easier configuration management. These files include:

Definitions: Copies of report, process, widget, template, and any other necessary definitions (except _Settings) can be installed on each web server as part of the application, or centralized definitions may be used for easier maintenance (if desired).

The location of definitions is configured in _Settings definition, using the Path element’s Alternative Definition Folder attribute, as shown above. This should be set to the UNC path to a shared network location accessible by all web servers, and the attribute value should include the _Definitions folder. Physically, within that folder, you should create the folders _Reports, _Processes, _Widgets, and _Templates as necessary. Do not include the _Settings definition in any alternate location; it must remain in the application folder on the web server as usual.

“Saved” Files: Many super-elements, such as the Dashboard and Analysis Grid, allow the user to save the current configuration to a file for later reuse. The locations of these files are specified in attributes of the elements.

As shown in the example above, the Save File attribute value should be the UNC path to a shared network location (with file name, if applicable) accessible by all web servers.

Bookmarks: If used in an application, the location of these files should also be centralized:

As shown above, in the _Settings definition, configure the General element’s Bookmark Folder Location attribute, with a UNC path to a shared network folder accessible by all web servers.

 

Using SecureKey security:
If you’re using Logi SecureKey security in a load-balanced environment, you need to configure security to share requests.

In the _Settings definition, set the Security element’s SecureKey Shared Folder attribute to a network path, as shown above. Files in the SecureKey folder are automatically deleted over time, so do not use this folder to store other files. It’s required to create the folder rdSecureKey under myProject shared folder, since it’s not auto created by Logi.

Note
: “Authentication Client Addresses” must be replaced later with subnet IP addresses ranges of the load balancer VPC after completing the setup for load balancer below.
You can Specify ranges of IP addresses with wildcards. To use wildcards, specify an IP address, the space character, then the wildcard mask. For example to allow all addresses in the range of 172.16.*.*, specify:
172.16.0.0 0.0.255.255

Centralizing the Data Cache
The data cache repository is, by default, the rdDataCache folder in a Logi application’s root folder. In a standalone environment, where all the requests are processed by the same server, this default cache configuration is sufficient.

In a load-balanced environment, centralizing the data cache repository is required.

This is accomplished in Studio by editing a Logi application’s _Settings definition, as shown above. The General element’s Data Cache Location attribute value should be set to the UNC path of a shared network location accessible by all web servers. This change should be made in the _Settings definition for each instance of the Logi application (i.e. on each web server).

Note: “mySharedFileServer” IP/DNS address should be replaced later with file servers load balancer dns after completing the setup for load balancer below.

 

Creating and Configuring Your Load-Balancer

Overview:
You’ll need to set up load balancers for both the Linux file server and the Windows application/web server. This process is relatively simple and is outlined below, and in the Getting Started guide here.

Steps:

  1. Windows application/web servers load balancer:
    • Use classic load balancers.
    • Use the same VPC that our Ubuntu file server’s uses.
    • Listener configuration: Keep defaults.
    • Health check configuration: Keep defaults and make sure that ping path is exists, i.e. “/myProject/rdlogon.aspx”
    • Add Instances: Add all Windows web/application servers to the load balancer, and check the status. All servers should give “InService” in 20-30 seconds.
    • To enable stickiness, select ELB > port configuration > edit stickiness > choose “enable load balancer generated cookie stickiness”, set expiration period for the same as well.
  2. Linux file servers load balancer:
    • Use classic load balancers.
    • Use the same VPC that the EFS volume uses.
    • Listener configuration: 
    • Health check configuration: Keep defaults and make sure that ping path is exists, i.e. “/index.html” 
      • NOTE: A simple web application must be deployed to the Linux file servers, in order to set the health check. It should be running inside a web container like tomcat, then modify the ping path for the health checker to the deployed application path.
    • Add Instances: Add all Ubuntu file servers to the load balancer, and check the status, all servers should give “InService” in 20-30 seconds.

 

Using Auto-Scaling

Overview:
In order to achieve auto-scaling, you need to set up a Launch Template and an Auto-Scaling Group. You can follow the steps in the link here, or the ones outlined below.

Steps:

  1. Create Launch Configuration:
    • Search and select the AMI that you created above. 
    • Use same security group you used in your app server EC2 instance. (Windows)
  2. Create an Auto Scaling Group
    • Make sure to select the launch configuration that we created above.
    • Make sure to set the group size, aka how many EE2 instances you want to have in the auto scaling group at all times.
    • Make sure to use same VPC we used for the Windows application server EC2s.
    • Set the Auto scaling policies:
      • Set min/max size of the group:
        • Min: minimum number of instances that will be launched at all times.
        • Max: maximum number of instances that will be launched once a metric condition is met.
    • Click on “Scale the Auto Scaling group using step or simple scaling policies” 
    • Set the required values for:
      • Increase group size
        • Make sure that you create a new alarm that will notify your auto scaling group when the CPU utilization exceeds certain limits.
        • Make sure that you specify the action “add” and the number of instances that we want to add when the above alarm triggered.
      • Decrease group size
        • Make sure that you create a new alarm that will notify your auto scaling group when the CPU utilization is below certain limits.
        • Make sure that you specify the action and the number of instances that we want to add when the above alarm is triggered.
    • You can set the warm up time for the EC2, if necessary. This will depend on whether you have any initialization tasks that run after launching the EC2 instance, and if you want to wait for them to finish before starting to use the newly created instance.
    • You can also add a notification service to know when any instance is launched, terminated, failed to launch or failed to terminate by the auto scaling process. 
    • Add tags to the auto scaling group. You can optionally choose to apply these tags to the instances in the group when they launch. 
    • Review your settings and then click on Create Auto Scaling Group.

 

We hope this detailed how-to guide was helpful in helping you set up your Logi Application on AWS. Please reach out if you have any questions or have any other how-to guide requests. We’re always happy to hear from you!

 

References:

https://en.wikipedia.org/wiki/Load_balancing_(computing)

https://devnet.logianalytics.com/rdPage.aspx?rdReport=Article&dnDocID=2222

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html

https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/EC2_GetStarted.html

https://docs.aws.amazon.com/elasticloadbalancing/latest/network/network-load-balancer-getting-started.html

https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-sticky-sessions.html

https://docs.aws.amazon.com/autoscaling/ec2/userguide/autoscaling-load-balancer.html

https://docs.aws.amazon.com/toolkit-for-visual-studio/latest/user-guide/tkv-create-ami-from-instance.html

 

 

 Like

We recently set up a Spark SQL (Spark) and decided to run some tests to compare the performance of Spark and Amazon Redshift. For our benchmarking, we ran four different queries: one filtration based, one aggregation based, one select-join, and one select-join with multiple subqueries. We ran these queries on both Spark and Redshift on datasets ranging from six million rows to 450 million rows.

These tests led to some interesting discoveries. First, when performing filtration and aggregation based queries on datasets larger than 50 million rows, our Spark cluster began to out-perform our Redshift database. As the size of the data increased, so did the performance difference. However, when performing the table joins – and in particular the multiple table scans involved in our subqueries – Redshift outperformed Spark (even on a table with hundreds of millions of rows!).

The Details

Our largest Spark cluster utilized an m1.Large master node, and six m3.xLarge worker nodes. Our aggregation test ran between 2x and 3x faster on datasets larger than 50 million rows. Our filtration test had a similar level of gains, arriving at a similar level of data size.

However, our table scan testing showed the opposite effect at all levels of data size. Redshift generally ran slightly faster than our Spark cluster, the difference in performance increasing as the volume increased, until it capped out at 3x faster on a 450 million row dataset. This result was further corroborated by what we know about Spark. Even with its strengths over Map Reduce computing models, data shuffles are slow and CPU expensive. 

We also discovered that if you attempt to cache a table into RAM and there’s overflow, Spark will write the extra blocks to disk, but the read time when coming off of disk is slower than reading the data directly from S3. On top of this, when the cache was completely full, we started to experience memory crashes at runtime because Spark didn’t have enough execution memory to complete the query. When the executors crashed, things slowed to a standstill. Spark scrambled to create another executor, move the necessary data, recreate the result set, etc. Even though uncached queries were slower than the cached ones, this ceases to be true when Spark runs out of RAM.

Takeaway

In terms of power, Spark is unparalleled at performing filtration and aggregation of large datasets. But when it comes to scanning and re-ordering data, Redshift still has Spark beat. This is just our two cents – let us know what you guys think!

 Like

Spark SQL (Spark) provides powerful data querying capabilities without the need for a traditional relational database. While AWS Elastic Map Reduce (EMR) service has made it easier to create and manage your own cluster, it can still be difficult to set up your data and configure your cluster properly so that you get the most out of your EC2 instance hours. This blog will discuss some of the lessons that we learned when setting up our Spark. Hopefully we can spare you some of the time we experienced when setting ours up! While we believe these best practices should apply to the majority of users, bear in mind that each use case is different and you may need to make small adaptations to fit your project.

 

First Things First: Data Storage

Unsurprisingly, one of the most important questions when querying your data is how that data is stored. We learned that Parquet files perform significantly faster than CSV file format. Parquet is a columnar data storage format commonly used within the Hadoop ecosystem. Spark reads Parquet files much faster than text files due to its capability to perform low-level filtration and create more efficient execution plans. When running on 56 million rows of our data stored as Parquet instead of CSV, speed was dramatically increased while storage space was decreased (see the table below).

Storing Data as Parquet instead of CSV

Partitioning the Data

Onto the next step: partitioning the data within your data store. When reading from S3, Spark takes advantage of partitioning systems and requests only the data that is required for the query. Assuming that you can partition your data along a frequently used filter (such as year, month, date, etc.), you can significantly reduce the amount of time Spark must spend requesting your dataset. There is a cost associated with this, though – the data itself must be partitioned according to the same scheme within your S3 bucket. For example, if you’re partitioning by year, all of your data for 2017 must be located at s3://your-bucket/path/to/data/2017, and all of your data for 2016 must be located at s3://your-bucket/path/to/data/2016, and so on.

You can also form multiple hierarchical layers of partitioning this way (s3://…/2016/01/01 for January 1st, 2016). The downside is that you can’t partition your data in more than one way within one table/bucket and you have to physically partition the data along those lines within the actual filesystem.

Despite these small drawbacks, partitioning your data is worth the time and effort. Although our aggregation query test ran in relatively the same time on both partitioned and unpartitioned data, our filtration query ran 7x faster, and our table scan test ran 6x faster.

Caching the Data

Next, you should cache your data in Spark when you can fit it into the RAM of your cluster. When you run the command ‘CACHE TABLE tablename’ in Spark, it will attempt to load the entire table into the storage memory of the cluster. Why into the storage memory? Because Spark doesn’t have access to 100% of the memory on your machines (something you likely would not want anyways), and because Spark doesn’t use all of the accessible memory for storing cached data. Spark on Amazon EMR is only granted as much RAM as a YARN application, and then Spark reserves about one third of the RAM that it’s given for execution memory, (i.e. temporary storage used while performing reads, shuffles, etc.). As a rule of thumb, the actual cache storage capacity is up to 1/3 of the total memory in your RAM.

Furthermore, when Spark caches data, it doesn’t use the serialization options available to a regular Spark job. As a result, the size of the data actually increases by about two to three times when you load it into memory. Any overflow that doesn’t fit into Spark storage does get saved to disk, but as we will discuss later, this leads to worse performance than if you read the data directly from S3.

 

Configuring Spark – The Easy Way and the Hard (the Right) Way 

The Easy Way

Spark has a bevy of configuration options to control how it uses the resources on the machines in your cluster. It is important to spend time with the documentation to understand how Spark executors and resource allocation works if you want to get the best possible performance out of your cluster. However, when just starting out, Amazon EMR’s version of Spark allows you to use the configuration parameter “maximizeResourceAllocation” to do some of the work for you. Essentially, this configures your cluster to make one executor per worker node, with each executor having full access to all the resources (RAM, cores, etc.) on that node. To do this, simply add “maximizeResourceAllocation true” to your “/usr/lib/spark/conf/spark-defaults.conf” file.

While this may be an ‘acceptable’ way to configure your cluster, it’s rarely the best way. When it comes to cluster configuration, there are no hard and fast rules. How you want to set up your workers will come down to the hardware of your machines, the size and shape of your data, and what kind of queries you expect to run. That being said, there are some guidelines that you can follow.

 The Right Way: Guidelines

First, each executor should have between two and five cores. While it’s possible to jam as many cores into each executor as your machine can handle, developers found that you start to lose efficiency with a core count higher than five or six. Also, while it is possible to make an individual executor for each core, this causes you to lose the benefits of running multiple tasks inside a single JVM. For these reasons, its best to keep the number of cores per executor between two and five, making sure that your cluster as a whole uses all available cores. For example, on a cluster with six worker machines, each machine having eight cores, you would want to create twelve executors, each with four cores. The number of cores is set in spark-defaults.conf, while the number of executors is set at runtime by the “–num-exectors #” command line parameter.

Memory is the next thing you must consider when configuring Spark. When setting the executors’ memory, it’s important to first look at the YARN master’s maximum allocation, which will be displayed on your cluster’s web UI (more on that later). This number is determined by the instance size of your workers and limits the amount of memory each worker can allocate for YARN applications (like your Spark program).

Your executor memory allocation needs to split that amount of memory between your executors, leaving a portion of memory for the YARN manager (usually about 1GB), and remaining within the limit for the YARN container (about 93% of the total).

The following equation can be used for shorthand:

One Last Thing – the Spark Web UI

Earlier in this post we mentioned checking the YARN memory allocation through the Spark Web UI. The Spark Web UI is a browser-based portal for seeing statistics on your Spark application. This includes everything from the status of the executors, to the job/stage information, to storage details for the cluster. It’s an incredibly useful page for debugging and checking the effects of your different configurations. However, the instructions on Amazon’s page for how to connect are not very clear.

In order to connect, follow Amazon’s instructions – located here –  but in addition to opening the local port 8157 to a dynamic port, open a second local port to port 80 (to allow http connections), as well as a third local port to port 8088. This port – the one connected to your cluster’s port 8088 – will give you access the YARN resource manager, and through that page, the Spark Web UI.

Note: the Spark Web UI will only be accessible when your Spark application is running. You can always access the YARN manager, but it will show an empty list if there is no Spark application to display.

 

Now it’s Your Turn!

Unlocking the power of Spark SQL can seem like a daunting task, but performing optimizations doesn’t have to be a major headache. Focusing on data storage and resource allocation is a great starting point and will be applicable for any project. We hope that this post will be helpful to you in your SparkSQL development, and don’t forget to check out the Spark Web UI!

 Like

Hey there! This summer, dbSeer has been keeping pretty busy. We completed a database migration project with one of our customers, Subject7, and then turned it into a case study to share with our great supporters like you.

In the project, our certified AWS architects (and all-around awesome people) designed a new network architecture from the ground-up and moved 50 database instances to Amazon RDS. They did all this while still reducing Subect7’s costs by 45%. If that’s not amazing, tell me what is…I’m waiting.

We know you want to learn more, so you can see the full case study here.

 

If you’re short on time, check out below to see the project at a glance:

Who was the client?

Subject7, who created a no-coding, cloud-based automated testing solution for web and mobile applications.

What was the opportunity?

Subject 7 sought to enhance their back-end architecture with the most optimal resource allocation to prepare for future expansion.

What was dbSeer’s solution?

dbSeer designed a new network architecture from the ground-up, which included moving to Amazon RDS. Once on AWS, dbSeer found the most optimal resource allocation.

What were the results?

dbSeer migrated nearly 50 database instances to RDS with minimum downtime. Subject 7 is now able to scale the back-end server to any size without impact to users. AWS costs decreased by nearly 45% and Subjet7 achieved a positive ROI in only 2 months.

 

If you’re interested in learning more, or have specific questions, or just want to say hi, we always love connecting with our readers. Don’t hesitate to reach out, which you can do here.

 Like
Data Architecture

We’ve got some pretty exciting news to share. Earlier this year, we completed a Migration and Big Data project with one of our customers, the market leader in telecom interconnect business optimization. The project went so well that we made a case study to showcase what we did.

In a nutshell, we demonstrated how they could take advantage of the elasticity and scalability of AWS services to support market expansion. We did this by re-architecting their event processing engine and leveraging AWS elastic services and open source-technologies to provide unlimited scalability. As a result of our work, they increased their processing speeds by 60x at a fraction of the cost.

If you want more details on this project (and trust me, you do) check out this link (Delivring-on-the-AWS-Promise-Migration-Case-Study).

You can learn all you’ve ever wanted about the opportunity client presented, the solution we designed, and the results we earned.

 Like

Amazon RDS is a managed relational database service that provides multiple familiar database engines to choose from (Amazon Aurora, MySQL, MariaDB, Oracle, Microsoft SQL Server, and PostgreSQL). Amazon RDS handles routine database tasks such as provisioning, patching, backup, recovery, failure detection, and repair.

Compared to the hosted databases, RDS is easy to use and the admin effort is very low. Increasing the performance and storage is easy. Monitoring, daily backups and restores can be configured easily.

Existing hosted databases can easily be migrated to AWS using AWS Migration Service. This service supports homogeneous migrations such as Oracle to Oracle, as well as heterogeneous migrations between different database platforms, such as Oracle to Amazon Aurora or Microsoft SQL Server to MySQL.

 

RDS Configuration

Above architecture diagram, shows a proposed AWS Architecture for an Enterprise Web Applications.

 

VPC

AWS recommends having your application inside a VPC (Virtual Private Cloud). For a multi-tiered web application, it is recommended to have a private and public subnet within the VPC. The database server should be launched in the private subnet, so that it is isolated and secure. The webservers should be launched in the public subnet. Security and routing needs to be configured so that only the web servers can communicate with the database servers in the private subnet. Since the web application is public facing, there would be an internet route configured for the public subnet.

 

Multi-AZ Deployment

AWS RDS allows multi-AZ deployment to support high availability and reliability.  With this feature, AWS automatically provisions and maintains a synchronous standby replica in a different Availability Zone. AWS synchronously replicates the data from the primary to the secondary database instance. If the primary database instance goes down for any reason AWS will automatically fail over to the secondary database instance.

 

Read Replicas

Read Replicas, can help you scale out beyond the capacity of a single database deployment for read-heavy database workloads. Updates made to the source DB instance are asynchronously copied to the Read Replica. This mechanism would be very useful in case you have a web application and reporting application both using the same database instance. In this scenario, all read only traffic would be routed to the read replicas. The primary database would be used for read and write traffic for the web application.

 

Backup and Maintenance

AWS automatically creates backups of the RDS instance. Amazon RDS creates a storage volume snapshot of your DB instance, backing up the entire DB instance. To reduce performance impact, backups and maintenance should be configured when application usage is low.

 Like

Managing web traffic is a critical part of any web application, and load balancing is a common and efficient solution. Load balancing distributes workloads, aiming to maximize throughput, minimize response time and avoid overloading a single resource. Using auto-scaling in combination with load balancing allows for your system to grow and shrink its distributed resources as necessary, providing a seamless experience to the end user. The scalability and elasticity features of AWS can be easily utilized by any web application built on AWS.

Considerations on setting up a Logi App on AWS

Applications built using Logi have shared components such as the Cached Data Files, Bookmark Files. To provide the seamless experience to the end user, Logi needs to be configured correctly to share these common files across multiple instances as the system scales the resources.

 

To support the shared resources, a shared file system is needed which needs to be shared across the multiple instances as resources get allocated and deallocated. Amazon Elastic File System or EFS is the shared file system on AWS. Currently EFS is only supported on Linux based instances and is not supported on Windows.

To support auto scaling, shared file locations need to be defined in the application settings so that new resources are pre-configured when auto scaling allocates new resources. Whenever you add a new server to the auto-scaling group, it has to be pre-configured with both your Logi application and the correct connections to the distributed resources.

Recommendations with Logi Apps on AWS

The solution to the EFS challenge on Windows Logi Apps involves adding a middle layer of Linux based EC2 instances to the architecture. Mount the EFS volumes on the Linux Instances, and these can be accessed on the Windows servers via the SMB protocol. By adding these Linux instances – and an associated load balancer – it becomes possible to use an EFS volume despite the Windows operating system being unable to mount it directly.

Use of Amazon Machine Images (AMIs) allows the developer to create a single instance with the Logi web app, the server and user settings for accessing the EFS, and their specific Logi application. This AMI can then be used by the auto scaling group to allocate and deallocate new instances. With the setting and report definitions saved in the AMI, and the shared elements such as Bookmarks and data cache saved on the EFS, it becomes possible to use AWS load balancers to implement a distributed and scalable Logi web application.

A detailed, step-by-step guide outlining all the technical details on how a Logi application can be configured to harness the scalability & elasticity features of AWS can be found here.

 Like