Blog

Earlier this year, we wrote a blog about how to use AWS Auto Scaling with Logi Analytics Applications. In that blog, we promised to release a step-by-step guide outlining the technical details of how a Logi Application can be configured to harness the scalability and elasticity features of AWS. If you were wondering when that would be released, the wait is over and you have come to the right place! Without further ado…

Enabling a multi-web server Logi application on AWS Windows instances requires the right configuration for some of the shared Logi files (cache files, secure key, bookmarks, etc.). To support these shared files, we need a shared network drive that can be accessed by the different Logi webservers. Currently EFS (Elastic File Storage) is not supported on Windows on AWS. Below we have defined how EFS can be mounted on Windows servers and setup so that you can utilize the scalability feature of Logi.

Setting Up the File Server
Overview:
In order for our distributed Logi application to function properly, it needs access to a shared file location. This can be easily implemented with Amazon’s Elastic File System (EFS). However, if you’re using a Windows server to run your Logi application, extra steps are necessary, as Windows does not currently support EFS drives. In order to get around this constraint, it is necessary to create Linux based EC2 instances to serve as an in-between file server. The EFS volumes will be mounted on these locations and then our Windows servers will access the files via the Samba (SMB) protocol.

Steps:

  1. Create EC2
    • Follow the steps as outlined in this AWS Get Started guide and choose:
      • Image: “Ubuntu Server 16.04 LTS (HVM), SSD Volume Type”
      • Create an Instance with desired type e.g.: “t2.micro”
  1. Create AWS EFS volume:
    • Follow the steps listed here and use same VPC and availability zone as used above
  2. Setup AWS EFS inside the EC2:
    • Connect to the EC2 instance we created in Step 1 using SSH
    • Mount the EFS to the EC2 using the following commands:
      sudo apt-get install -y nfs-common
      mkdir /mnt/efs
      mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 EFS_IP_ADDRESS_HERE:/ /mnt/efs
  3. Re-export NFS share to be used in Windows:
    • Give our Windows user access to its files. Let’s do this using samba. Again, drop the following to your shell for installing SMB(Samba) services in your Ubuntu EC2
    • Run the following commands:
      apt-get install -y samba samba-common python-glade2 system-config-samba
      cp -pf /etc/samba/smb.conf /etc/samba/smb.conf.bak
      cat /dev/null -> /etc/samba/smb.conf
      nano /etc/samba/smb.conf
    • And then, paste the text below inside the smb.conf file:
      [global]
      workgroup = WORKGROUP
      server string = AWS-EFS-Windows
      netbios name = ubuntu
      dns proxy = no
      socket options = TCP_NODELAY[efs]
      path = /mnt/efs
      read only = no
      browseable = yes
      guest ok = yes
      writeable = yes
    • Create a Samba user/password. Use the same credentials as your EC2 user
      sudo smbpasswd –a ubuntu
    • Give Ubuntu user access to the mounted folder:
      sudo chown ubuntu:ubuntu /mnt/efs/
    • And finally, restart the samba service:
      sudo /etc/init.d/smbd restart

 

Setting up the Application Server

Overview:
Logi applications require setup in the form of settings, files, licenses, and more. In order to accommodate the elastic auto-scaling, we’ll set up one server – from creation to connecting to our shared drive to installing and configuring Logi – and then make an Amazon Machine Image (AMI) for use later.

Steps:

  1. Create EC2:
    • Follow the steps as outlined in this AWS Get Started guide and choose:
      • Image: “Microsoft Windows Server 2016 Base”
      • Instance type: “t2.micro” or whatever type your application requires
  2. Deploy code:
    • Clone your project repository and deploy the code in IIS
  3. Set Up User Access:
    • Allow your application in IIS to access the shared folder (EFS) that we created inside the File server
    • From the control panel, choose users accounts → manage another account → add a user account
    • Use same username and password we created for the samba user in Ubuntu file server
    • In IIS, add the new Windows user we created above to the application connection pool, IIS → Application Pools → right click on your project application pool → identity → custom account → fill in the new username and password we created earlier.
  4. Test EFS (shared folder) connection:
    • To test the connection between Windows application server and Ubuntu file server, go to:
      • This PC → computer tap → map network drive → in folder textbox type in “\\FILE_SERVER_IP_ADDRESS\efs” → If credentials window appears for you, just use the new username and password we created earlier.

 

Configuring the Logi Application

Sticky and Non-Sticky Sessions
In a standard environment with one server, a session is established with the first HTTP request and all subsequent requests, for the life of the session, will be handled by that same server. However, in a load-balanced or clustered environment, there are two possibilities for handling requests: “sticky” sessions (sometimes called session affinity) and “non-sticky” sessions.

Use a sticky session to handle HTTP requests by centralizing the location of any shared resources and managing session state. You must create a centralized, shared location for cached data (rdDataCache folder), saved Bookmark files, _metaData folder, and saved Dashboard files because they must be accessible to all servers in the cluster.

Managing Session State
IIS is configured by default to manage session information using the “InProc” option. For both standalone and load-balanced, sticky environments, this option allows a single server to manage the session information for the life of the session.

Centralization of Application Resources
In a load-balanced environment, each web server must have Logi Server installed and properly licensed, and must have its own copy of the Logi application with its folder structure, system files, etc. This includes everything in the _SupportFiles folder such as images, style sheets, XML data files, etc., any custom themes, and any HTML or script files. We will achieve this by creating one instance with all the proper configurations, and then using an AMI.

Some application files should be centralized, which also allows for easier configuration management. These files include:

Definitions: Copies of report, process, widget, template, and any other necessary definitions (except _Settings) can be installed on each web server as part of the application, or centralized definitions may be used for easier maintenance (if desired).

The location of definitions is configured in _Settings definition, using the Path element’s Alternative Definition Folder attribute, as shown above. This should be set to the UNC path to a shared network location accessible by all web servers, and the attribute value should include the _Definitions folder. Physically, within that folder, you should create the folders _Reports, _Processes, _Widgets, and _Templates as necessary. Do not include the _Settings definition in any alternate location; it must remain in the application folder on the web server as usual.

“Saved” Files: Many super-elements, such as the Dashboard and Analysis Grid, allow the user to save the current configuration to a file for later reuse. The locations of these files are specified in attributes of the elements.

As shown in the example above, the Save File attribute value should be the UNC path to a shared network location (with file name, if applicable) accessible by all web servers.

Bookmarks: If used in an application, the location of these files should also be centralized:

As shown above, in the _Settings definition, configure the General element’s Bookmark Folder Location attribute, with a UNC path to a shared network folder accessible by all web servers.

 

Using SecureKey security:
If you’re using Logi SecureKey security in a load-balanced environment, you need to configure security to share requests.

In the _Settings definition, set the Security element’s SecureKey Shared Folder attribute to a network path, as shown above. Files in the SecureKey folder are automatically deleted over time, so do not use this folder to store other files. It’s required to create the folder rdSecureKey under myProject shared folder, since it’s not auto created by Logi.

Note
: “Authentication Client Addresses” must be replaced later with subnet IP addresses ranges of the load balancer VPC after completing the setup for load balancer below.
You can Specify ranges of IP addresses with wildcards. To use wildcards, specify an IP address, the space character, then the wildcard mask. For example to allow all addresses in the range of 172.16.*.*, specify:
172.16.0.0 0.0.255.255

Centralizing the Data Cache
The data cache repository is, by default, the rdDataCache folder in a Logi application’s root folder. In a standalone environment, where all the requests are processed by the same server, this default cache configuration is sufficient.

In a load-balanced environment, centralizing the data cache repository is required.

This is accomplished in Studio by editing a Logi application’s _Settings definition, as shown above. The General element’s Data Cache Location attribute value should be set to the UNC path of a shared network location accessible by all web servers. This change should be made in the _Settings definition for each instance of the Logi application (i.e. on each web server).

Note: “mySharedFileServer” IP/DNS address should be replaced later with file servers load balancer dns after completing the setup for load balancer below.

 

Creating and Configuring Your Load-Balancer

Overview:
You’ll need to set up load balancers for both the Linux file server and the Windows application/web server. This process is relatively simple and is outlined below, and in the Getting Started guide here.

Steps:

  1. Windows application/web servers load balancer:
    • Use classic load balancers.
    • Use the same VPC that our Ubuntu file server’s uses.
    • Listener configuration: Keep defaults.
    • Health check configuration: Keep defaults and make sure that ping path is exists, i.e. “/myProject/rdlogon.aspx”
    • Add Instances: Add all Windows web/application servers to the load balancer, and check the status. All servers should give “InService” in 20-30 seconds.
    • To enable stickiness, select ELB > port configuration > edit stickiness > choose “enable load balancer generated cookie stickiness”, set expiration period for the same as well.
  2. Linux file servers load balancer:
    • Use classic load balancers.
    • Use the same VPC that the EFS volume uses.
    • Listener configuration: 
    • Health check configuration: Keep defaults and make sure that ping path is exists, i.e. “/index.html” 
      • NOTE: A simple web application must be deployed to the Linux file servers, in order to set the health check. It should be running inside a web container like tomcat, then modify the ping path for the health checker to the deployed application path.
    • Add Instances: Add all Ubuntu file servers to the load balancer, and check the status, all servers should give “InService” in 20-30 seconds.

 

Using Auto-Scaling

Overview:
In order to achieve auto-scaling, you need to set up a Launch Template and an Auto-Scaling Group. You can follow the steps in the link here, or the ones outlined below.

Steps:

  1. Create Launch Configuration:
    • Search and select the AMI that you created above. 
    • Use same security group you used in your app server EC2 instance. (Windows)
  2. Create an Auto Scaling Group
    • Make sure to select the launch configuration that we created above.
    • Make sure to set the group size, aka how many EE2 instances you want to have in the auto scaling group at all times.
    • Make sure to use same VPC we used for the Windows application server EC2s.
    • Set the Auto scaling policies:
      • Set min/max size of the group:
        • Min: minimum number of instances that will be launched at all times.
        • Max: maximum number of instances that will be launched once a metric condition is met.
    • Click on “Scale the Auto Scaling group using step or simple scaling policies” 
    • Set the required values for:
      • Increase group size
        • Make sure that you create a new alarm that will notify your auto scaling group when the CPU utilization exceeds certain limits.
        • Make sure that you specify the action “add” and the number of instances that we want to add when the above alarm triggered.
      • Decrease group size
        • Make sure that you create a new alarm that will notify your auto scaling group when the CPU utilization is below certain limits.
        • Make sure that you specify the action and the number of instances that we want to add when the above alarm is triggered.
    • You can set the warm up time for the EC2, if necessary. This will depend on whether you have any initialization tasks that run after launching the EC2 instance, and if you want to wait for them to finish before starting to use the newly created instance.
    • You can also add a notification service to know when any instance is launched, terminated, failed to launch or failed to terminate by the auto scaling process. 
    • Add tags to the auto scaling group. You can optionally choose to apply these tags to the instances in the group when they launch. 
    • Review your settings and then click on Create Auto Scaling Group.

 

We hope this detailed how-to guide was helpful in helping you set up your Logi Application on AWS. Please reach out if you have any questions or have any other how-to guide requests. We’re always happy to hear from you!

 

References:

https://en.wikipedia.org/wiki/Load_balancing_(computing)

https://devnet.logianalytics.com/rdPage.aspx?rdReport=Article&dnDocID=2222

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html

https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/EC2_GetStarted.html

https://docs.aws.amazon.com/elasticloadbalancing/latest/network/network-load-balancer-getting-started.html

https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-sticky-sessions.html

https://docs.aws.amazon.com/autoscaling/ec2/userguide/autoscaling-load-balancer.html

https://docs.aws.amazon.com/toolkit-for-visual-studio/latest/user-guide/tkv-create-ami-from-instance.html

 

 

 Like

We recently set up a Spark SQL (Spark) and decided to run some tests to compare the performance of Spark and Amazon Redshift. For our benchmarking, we ran four different queries: one filtration based, one aggregation based, one select-join, and one select-join with multiple subqueries. We ran these queries on both Spark and Redshift on datasets ranging from six million rows to 450 million rows.

These tests led to some interesting discoveries. First, when performing filtration and aggregation based queries on datasets larger than 50 million rows, our Spark cluster began to out-perform our Redshift database. As the size of the data increased, so did the performance difference. However, when performing the table joins – and in particular the multiple table scans involved in our subqueries – Redshift outperformed Spark (even on a table with hundreds of millions of rows!).

The Details

Our largest Spark cluster utilized an m1.Large master node, and six m3.xLarge worker nodes. Our aggregation test ran between 2x and 3x faster on datasets larger than 50 million rows. Our filtration test had a similar level of gains, arriving at a similar level of data size.

However, our table scan testing showed the opposite effect at all levels of data size. Redshift generally ran slightly faster than our Spark cluster, the difference in performance increasing as the volume increased, until it capped out at 3x faster on a 450 million row dataset. This result was further corroborated by what we know about Spark. Even with its strengths over Map Reduce computing models, data shuffles are slow and CPU expensive. 

We also discovered that if you attempt to cache a table into RAM and there’s overflow, Spark will write the extra blocks to disk, but the read time when coming off of disk is slower than reading the data directly from S3. On top of this, when the cache was completely full, we started to experience memory crashes at runtime because Spark didn’t have enough execution memory to complete the query. When the executors crashed, things slowed to a standstill. Spark scrambled to create another executor, move the necessary data, recreate the result set, etc. Even though uncached queries were slower than the cached ones, this ceases to be true when Spark runs out of RAM.

Takeaway

In terms of power, Spark is unparalleled at performing filtration and aggregation of large datasets. But when it comes to scanning and re-ordering data, Redshift still has Spark beat. This is just our two cents – let us know what you guys think!

 Like

Spark SQL (Spark) provides powerful data querying capabilities without the need for a traditional relational database. While AWS Elastic Map Reduce (EMR) service has made it easier to create and manage your own cluster, it can still be difficult to set up your data and configure your cluster properly so that you get the most out of your EC2 instance hours. This blog will discuss some of the lessons that we learned when setting up our Spark. Hopefully we can spare you some of the time we experienced when setting ours up! While we believe these best practices should apply to the majority of users, bear in mind that each use case is different and you may need to make small adaptations to fit your project.

 

First Things First: Data Storage

Unsurprisingly, one of the most important questions when querying your data is how that data is stored. We learned that Parquet files perform significantly faster than CSV file format. Parquet is a columnar data storage format commonly used within the Hadoop ecosystem. Spark reads Parquet files much faster than text files due to its capability to perform low-level filtration and create more efficient execution plans. When running on 56 million rows of our data stored as Parquet instead of CSV, speed was dramatically increased while storage space was decreased (see the table below).

Storing Data as Parquet instead of CSV

Partitioning the Data

Onto the next step: partitioning the data within your data store. When reading from S3, Spark takes advantage of partitioning systems and requests only the data that is required for the query. Assuming that you can partition your data along a frequently used filter (such as year, month, date, etc.), you can significantly reduce the amount of time Spark must spend requesting your dataset. There is a cost associated with this, though – the data itself must be partitioned according to the same scheme within your S3 bucket. For example, if you’re partitioning by year, all of your data for 2017 must be located at s3://your-bucket/path/to/data/2017, and all of your data for 2016 must be located at s3://your-bucket/path/to/data/2016, and so on.

You can also form multiple hierarchical layers of partitioning this way (s3://…/2016/01/01 for January 1st, 2016). The downside is that you can’t partition your data in more than one way within one table/bucket and you have to physically partition the data along those lines within the actual filesystem.

Despite these small drawbacks, partitioning your data is worth the time and effort. Although our aggregation query test ran in relatively the same time on both partitioned and unpartitioned data, our filtration query ran 7x faster, and our table scan test ran 6x faster.

Caching the Data

Next, you should cache your data in Spark when you can fit it into the RAM of your cluster. When you run the command ‘CACHE TABLE tablename’ in Spark, it will attempt to load the entire table into the storage memory of the cluster. Why into the storage memory? Because Spark doesn’t have access to 100% of the memory on your machines (something you likely would not want anyways), and because Spark doesn’t use all of the accessible memory for storing cached data. Spark on Amazon EMR is only granted as much RAM as a YARN application, and then Spark reserves about one third of the RAM that it’s given for execution memory, (i.e. temporary storage used while performing reads, shuffles, etc.). As a rule of thumb, the actual cache storage capacity is up to 1/3 of the total memory in your RAM.

Furthermore, when Spark caches data, it doesn’t use the serialization options available to a regular Spark job. As a result, the size of the data actually increases by about two to three times when you load it into memory. Any overflow that doesn’t fit into Spark storage does get saved to disk, but as we will discuss later, this leads to worse performance than if you read the data directly from S3.

 

Configuring Spark – The Easy Way and the Hard (the Right) Way 

The Easy Way

Spark has a bevy of configuration options to control how it uses the resources on the machines in your cluster. It is important to spend time with the documentation to understand how Spark executors and resource allocation works if you want to get the best possible performance out of your cluster. However, when just starting out, Amazon EMR’s version of Spark allows you to use the configuration parameter “maximizeResourceAllocation” to do some of the work for you. Essentially, this configures your cluster to make one executor per worker node, with each executor having full access to all the resources (RAM, cores, etc.) on that node. To do this, simply add “maximizeResourceAllocation true” to your “/usr/lib/spark/conf/spark-defaults.conf” file.

While this may be an ‘acceptable’ way to configure your cluster, it’s rarely the best way. When it comes to cluster configuration, there are no hard and fast rules. How you want to set up your workers will come down to the hardware of your machines, the size and shape of your data, and what kind of queries you expect to run. That being said, there are some guidelines that you can follow.

 The Right Way: Guidelines

First, each executor should have between two and five cores. While it’s possible to jam as many cores into each executor as your machine can handle, developers found that you start to lose efficiency with a core count higher than five or six. Also, while it is possible to make an individual executor for each core, this causes you to lose the benefits of running multiple tasks inside a single JVM. For these reasons, its best to keep the number of cores per executor between two and five, making sure that your cluster as a whole uses all available cores. For example, on a cluster with six worker machines, each machine having eight cores, you would want to create twelve executors, each with four cores. The number of cores is set in spark-defaults.conf, while the number of executors is set at runtime by the “–num-exectors #” command line parameter.

Memory is the next thing you must consider when configuring Spark. When setting the executors’ memory, it’s important to first look at the YARN master’s maximum allocation, which will be displayed on your cluster’s web UI (more on that later). This number is determined by the instance size of your workers and limits the amount of memory each worker can allocate for YARN applications (like your Spark program).

Your executor memory allocation needs to split that amount of memory between your executors, leaving a portion of memory for the YARN manager (usually about 1GB), and remaining within the limit for the YARN container (about 93% of the total).

The following equation can be used for shorthand:

One Last Thing – the Spark Web UI

Earlier in this post we mentioned checking the YARN memory allocation through the Spark Web UI. The Spark Web UI is a browser-based portal for seeing statistics on your Spark application. This includes everything from the status of the executors, to the job/stage information, to storage details for the cluster. It’s an incredibly useful page for debugging and checking the effects of your different configurations. However, the instructions on Amazon’s page for how to connect are not very clear.

In order to connect, follow Amazon’s instructions – located here –  but in addition to opening the local port 8157 to a dynamic port, open a second local port to port 80 (to allow http connections), as well as a third local port to port 8088. This port – the one connected to your cluster’s port 8088 – will give you access the YARN resource manager, and through that page, the Spark Web UI.

Note: the Spark Web UI will only be accessible when your Spark application is running. You can always access the YARN manager, but it will show an empty list if there is no Spark application to display.

 

Now it’s Your Turn!

Unlocking the power of Spark SQL can seem like a daunting task, but performing optimizations doesn’t have to be a major headache. Focusing on data storage and resource allocation is a great starting point and will be applicable for any project. We hope that this post will be helpful to you in your SparkSQL development, and don’t forget to check out the Spark Web UI!

 Like

Hey there! This summer, dbSeer has been keeping pretty busy. We completed a database migration project with one of our customers, Subject7, and then turned it into a case study to share with our great supporters like you.

In the project, our certified AWS architects (and all-around awesome people) designed a new network architecture from the ground-up and moved 50 database instances to Amazon RDS. They did all this while still reducing Subect7’s costs by 45%. If that’s not amazing, tell me what is…I’m waiting.

We know you want to learn more, so you can see the full case study here.

 

If you’re short on time, check out below to see the project at a glance:

Who was the client?

Subject7, who created a no-coding, cloud-based automated testing solution for web and mobile applications.

What was the opportunity?

Subject 7 sought to enhance their back-end architecture with the most optimal resource allocation to prepare for future expansion.

What was dbSeer’s solution?

dbSeer designed a new network architecture from the ground-up, which included moving to Amazon RDS. Once on AWS, dbSeer found the most optimal resource allocation.

What were the results?

dbSeer migrated nearly 50 database instances to RDS with minimum downtime. Subject 7 is now able to scale the back-end server to any size without impact to users. AWS costs decreased by nearly 45% and Subjet7 achieved a positive ROI in only 2 months.

 

If you’re interested in learning more, or have specific questions, or just want to say hi, we always love connecting with our readers. Don’t hesitate to reach out, which you can do here.

 Like
Data Architecture

We’ve got some pretty exciting news to share. Earlier this year, we completed a Migration and Big Data project with one of our customers, the market leader in telecom interconnect business optimization. The project went so well that we made a case study to showcase what we did.

In a nutshell, we demonstrated how they could take advantage of the elasticity and scalability of AWS services to support market expansion. We did this by re-architecting their event processing engine and leveraging AWS elastic services and open source-technologies to provide unlimited scalability. As a result of our work, they increased their processing speeds by 60x at a fraction of the cost.

If you want more details on this project (and trust me, you do) check out this link (Delivring-on-the-AWS-Promise-Migration-Case-Study).

You can learn all you’ve ever wanted about the opportunity client presented, the solution we designed, and the results we earned.

 Like

Amazon RDS is a managed relational database service that provides multiple familiar database engines to choose from (Amazon Aurora, MySQL, MariaDB, Oracle, Microsoft SQL Server, and PostgreSQL). Amazon RDS handles routine database tasks such as provisioning, patching, backup, recovery, failure detection, and repair.

Compared to the hosted databases, RDS is easy to use and the admin effort is very low. Increasing the performance and storage is easy. Monitoring, daily backups and restores can be configured easily.

Existing hosted databases can easily be migrated to AWS using AWS Migration Service. This service supports homogeneous migrations such as Oracle to Oracle, as well as heterogeneous migrations between different database platforms, such as Oracle to Amazon Aurora or Microsoft SQL Server to MySQL.

 

RDS Configuration

Above architecture diagram, shows a proposed AWS Architecture for an Enterprise Web Applications.

 

VPC

AWS recommends having your application inside a VPC (Virtual Private Cloud). For a multi-tiered web application, it is recommended to have a private and public subnet within the VPC. The database server should be launched in the private subnet, so that it is isolated and secure. The webservers should be launched in the public subnet. Security and routing needs to be configured so that only the web servers can communicate with the database servers in the private subnet. Since the web application is public facing, there would be an internet route configured for the public subnet.

 

Multi-AZ Deployment

AWS RDS allows multi-AZ deployment to support high availability and reliability.  With this feature, AWS automatically provisions and maintains a synchronous standby replica in a different Availability Zone. AWS synchronously replicates the data from the primary to the secondary database instance. If the primary database instance goes down for any reason AWS will automatically fail over to the secondary database instance.

 

Read Replicas

Read Replicas, can help you scale out beyond the capacity of a single database deployment for read-heavy database workloads. Updates made to the source DB instance are asynchronously copied to the Read Replica. This mechanism would be very useful in case you have a web application and reporting application both using the same database instance. In this scenario, all read only traffic would be routed to the read replicas. The primary database would be used for read and write traffic for the web application.

 

Backup and Maintenance

AWS automatically creates backups of the RDS instance. Amazon RDS creates a storage volume snapshot of your DB instance, backing up the entire DB instance. To reduce performance impact, backups and maintenance should be configured when application usage is low.

 Like

Managing web traffic is a critical part of any web application, and load balancing is a common and efficient solution. Load balancing distributes workloads, aiming to maximize throughput, minimize response time and avoid overloading a single resource. Using auto-scaling in combination with load balancing allows for your system to grow and shrink its distributed resources as necessary, providing a seamless experience to the end user. The scalability and elasticity features of AWS can be easily utilized by any web application built on AWS.

Considerations on setting up a Logi App on AWS

Applications built using Logi have shared components such as the Cached Data Files, Bookmark Files. To provide the seamless experience to the end user, Logi needs to be configured correctly to share these common files across multiple instances as the system scales the resources.

 

To support the shared resources, a shared file system is needed which needs to be shared across the multiple instances as resources get allocated and deallocated. Amazon Elastic File System or EFS is the shared file system on AWS. Currently EFS is only supported on Linux based instances and is not supported on Windows.

To support auto scaling, shared file locations need to be defined in the application settings so that new resources are pre-configured when auto scaling allocates new resources. Whenever you add a new server to the auto-scaling group, it has to be pre-configured with both your Logi application and the correct connections to the distributed resources.

Recommendations with Logi Apps on AWS

The solution to the EFS challenge on Windows Logi Apps involves adding a middle layer of Linux based EC2 instances to the architecture. Mount the EFS volumes on the Linux Instances, and these can be accessed on the Windows servers via the SMB protocol. By adding these Linux instances – and an associated load balancer – it becomes possible to use an EFS volume despite the Windows operating system being unable to mount it directly.

Use of Amazon Machine Images (AMIs) allows the developer to create a single instance with the Logi web app, the server and user settings for accessing the EFS, and their specific Logi application. This AMI can then be used by the auto scaling group to allocate and deallocate new instances. With the setting and report definitions saved in the AMI, and the shared elements such as Bookmarks and data cache saved on the EFS, it becomes possible to use AWS load balancers to implement a distributed and scalable Logi web application.

A detailed, step-by-step guide outlining all the technical details on how a Logi application can be configured to harness the scalability & elasticity features of AWS can be found here.

 Like
Self Service

Self-service business intelligence (BI) is all the rage. If Google analytics is any indication, it wasn’t until January 2015 that the keyword “self-service BI” appeared consistently on its search radar. There are a lot of claims that point toward self-service BI being the magic bean for business users trying to make sense of huge data volumes.

What self-service can resolve (in a way):

Self ServiceThe stalk rising from this magic bean is a fast track highway that enables business users to help themselves to information they need without any dependency on more tech savvy folks, or IT departments. Tableau and Qlik are examples of companies who claim to fulfill this highly sought need. No more waiting for IT to pull reports, no more waiting on tech savvy developers and coders to decipher volumes of business intelligence data that you have accumulated–but have no clue how to digest. Now, you can just go in yourself and pull beautiful visualizations that turns terabytes of raw data into meaningful, and presentable, information any business person can digest.

What story is the data is telling? Self-service models fall short in answering this question.

What self-service mostly does not resolve:

As convenient as it may be, self-service BI is insufficient when on its own. Yes, it resolves the issue of dependency, but in what way? In one of our white papers on dbseer.com, we discuss the ideal framework for maturing your analytics platform. It speaks to two approaches: the descriptive versus the diagnostic approach to understanding analytics.

The diagnostic approach & its insufficiencies

Self-service BI can easily fulfill a diagnostic approach. (For more on the framework for understanding your analytics, see this paper). In the diagnostic approach, you can slice and dice the data on your own. However, the descriptive approach that answers the where, what, and hows of your data cannot be so easily fulfilled in this way. What is the story the data is telling? Self-service models fall short in answering this question. After all, there is an expertise data scientists have to manipulate and extract information from data that many end-users may not have. The opportunity to unveil and attain these findings is lost if and when business end-users purely rely on self-service offerings.

The Better Solution:

In no way am I suggesting we get rid of self-service BI. There is a well-established need for it and SaaS vendors in the BI world should definitely offer it. However, the better solution is to use self-service solutions as the exception, and not the rule for your business intelligence needs. There is a lot of business intelligence to be attained that self-service solutions are incapable of unearthing.

Jack set his eyes on what was not his for the taking when he climbed the magic beanstalk, the golden egg, the harp, the coins. I’m not claiming you can’t have them (!), I’m just saying you should get the goods through the appropriate means, rather than necessarily helping yourself in the dark!

 Like

Overview

Machine learning algorithms, while powerful, are not perfect, or at least not perfect “right out of the box.” The complex math that controls the building of machine learning models requires adjustments based on the nature of your data in order to give the best possible results. There are no hard and fast rules to this process of tuning, and frequently it will simply come down to the nature of your algorithm and your data set, although some guidelines can be provided to make the process easier. Furthermore, while tuning the parameters of your model is arguably the most important method of improving your model, other methods do exist, and can be critical for certain applications and data sets.

Tuning Your Model

As stated above, there are no hard and fast rules for tuning your model. Which parameters you can adjust will come down to your choice of algorithm and implementation. Generally, though, regardless of your parameters, it’s desirable to start with a wide range of values for your tuning tests, and then narrow your ranges after finding baseline values and performance. It’s also generally desirable, in your first ‘round’ of testing, to test more than one value at once, as the interactions between different parameters can mean that as you adjust one, it only becomes effective within a certain range of values for another. The result is that some kind of comprehensive testing is needed, at least at first. If you’re using the sci-kit learn library, I highly recommend using their GridSearchCV class to find your baseline values, as it will perform all of the iterations of your parameters for you, with a default of three fold cross-validation.

Once you have your baseline, you can begin testing your parameter values one at a time, and within smaller ranges. In this round of testing, it’s important to pay close attention to your evaluation metrics, as your parameter changes will, hopefully, affect the various performance features of your model. As you adjust, you will find patterns, relationships, and learn more about the way in which your individual algorithm and data respond to the changes that you’re making. The best advice I can give is to test extensively, listen to your metrics, and trust your intuitions.

The next best advice I can give is don’t be afraid to start over. It may be that after running several rounds of fine tuning that your performance plateaus. Let yourself try new baseline values. The machine learning algorithms out there today are very complex, and may respond to your testing and your data in ways that only become clear after many failures. Don’t let it discourage you, and don’t be nervous about re-treading old ground once you’ve learned a little more from your previous experiments. It may end up being the key you need to find a breakthrough.

Alternate Methods – Boosting

Aside from tuning your model’s parameters, there are many other ways to try to improve your model’s performance. I’m going to focus on just a few here, the first of them being the concept of boosting. Boosting is the artificial addition of records to your data set, usually to boost the presence of an under-represented or otherwise difficult to detect class. Boosting works by taking existing records of the class in question and either copying them or altering them slightly, and then adding some desired number of duplicates to the training data. It is important that this process happens only with the training data and not the testing data when using cross-validation, or you will add biased records to your testing set.

It is important also only to use boosting on your data set up to a certain degree, as altering the class composition too drastically (for example, by making what was initially a minority class into the majority class) can throw off the construction of your models as well. However, it is still a very useful tool when handling a difficult to detect target class, imbalanced class distribution, or when using a decision tree model, which generally works best with balanced class distributions.

Alternate Methods – Bagging

This method of model improvement isn’t actually a way to improve your models, per se, but instead to improve the quality of your system. Bagging is a technique that can be used with ensemble methods, classification systems that employ a large number of weak classifiers to build a powerful classification system. The process of bagging involves either weighting or removing classifiers from your ensemble set based on their performance. As you decrease the impact of the poorly performing classifiers (“bagging” them), you increase the overall performance of your whole system. Bagging is not a method that can always be used, but it is an effective tool for getting the absolute most out of using an ensemble classification method.

Alternate Methods – Advanced Cleaning

As discussed in our pre-processing post, advanced pre-processing methods, such as feature selection and outlier removal, can also increase the effectiveness of your models.  Of these two methods, feature selection is frequently the easier and more effective tool, although if your data set is particularly noisy, removing outliers may be more helpful to you. Rather than reiterate what was said before, I recommend that you read our full post on these techniques, located here.

Conclusion

The process of model improvement is a slow but satisfying one. Proposing, running, and analyzing your experiments takes time and focus, but yields the greatest rewards, as you begin to see your models take shape, improve, and reveal the information hidden in your data.

 Like

Overview

How do we know when a model has learned? The theoretical examinations of this question go both wide and deep, but as a practical matter, what becomes important for the programmer is the ability of a classification model to make accurate distinctions between the target classes. However, making accurate distinctions is not always the same as having a highly accurate set of classifications. If that statement doesn’t make a ton of sense to you, allow me to provide you an example:

You have a data set that is composed of classes A and B, with 90% of records being of class A, and 10% being class B. When you provide this data to your model, it shows 90% accuracy, and you take this to be a good result, until you dig a little deeper into the process and find out that the classifier had given every single record a class A label. This means that even though the model completely failed to distinguish between the two classes, it still classified 90% of records accurately, because of the nature of the data set.

These sorts of examples are more common than you’d think, and they’re why we use a variety of different evaluation metrics when trying to comprehend the effectiveness of our models. In this post, we’ll go over a few of these metrics, as well as how they’re calculated, and how you can apply them both within and across classes.

Our Base Values

The first thing we have to do is create some more robust ways of defining our model’s actions than simply ‘correct classification’ and ‘incorrect classification’. To that end, the following values are calculated for each class as the model runs through the testing data:

  • True Positives (TP): The classifier applies label X, and the record was of class X.
  • False Positives (FP): The classifier applies label X, and the record was not of class X.
  • True Negatives (TN): The classifier applies any label that is not X, and the record was not of class X.
  • False Negatives (FN): The classifier applies any label that was not X, and the record was of class X.

As I said, these values are calculated for each class in your problem, so if, for example, a record is classified as class A, and its actual label was for class B, that would be +1 to Class A’s False Positives, and +1 to class B’s False Negatives. If you have a multi-class dataset, the same rules apply. In that example, if you had a class C as well, you would also add +1 to class C’s True Negatives, as the record was accurately not classified as belonging to C.

Useful Metrics

These four values allow us to get a much more detailed picture of how our classifier is performing. It is still possible to get an accuracy score for each class, by adding all the True Positives and True Negatives and dividing by the total number of records. However, you can also calculate many other metrics with these values. For the purposes of this post, we’re going to focus on just three: precision, recall, and F1 measure.

Precision (TP / (TP + FP)) is a metric that shows how frequently your classifier is correct when it chooses a specific class. The numerator – True Positives – is the number of records correctly classified as the given class, and the denominator – True Positives plus False Positives – is the number of times your classifier assigned that class label, whether correct or incorrect. With this metric, you can see how frequently your model is misclassifying a record by assigning it this particular class. A lower precision value shows that the model is not discerning enough in assigning this class label.

Recall (TP / (TP + FN)) is a metric that shows how frequently your classifier labels a record of the given class correctly. The numerator – True Positives – is the number of records correctly classified as the given class, and the denominator – True Positives plus False Negatives – is the number of records that should have been classified as the given class. With this metric, you can see what percentage of the target records your classifier is able to correctly identify. A lower recall value shows that the model is not sensitive enough to the target class, and that many records are being left out of the classification.

Finally, F1 measure (2 * ( (recall * precision) / (recall + precision))) is a combined score of recall and precision that gives a single measurement for how effective your classifier is. F1 score is most useful when trying to determine if a tradeoff of recall or precision for the other is increasing the general effectiveness of your model. You should not use F1 score as your only metric for model evaluation. Delving into your model’s specific precision and recall will give you a better idea of what about your model actually needs improving.

Micro/Macro Metrics

If your problem is such that you only care about a single target class, then it’s easy to stop at the evaluation of your model as above. However, for multi-class problems, it’s important to have a calculation to show the model’s general effectiveness across all classes, as opposed to each class individually. There are two ways to do this, both with their advantages and disadvantages.

The first is known as macro-averaging, which computes each metric for each class first, and then takes an average of those values. For example, if you have three classes, with precision 0.6, 0.7, and 0.2, you would add those values up to 1.5 and divide by 3 to get a macro-precision of 0.5.

Micro-averaging on the other hand takes all the values that would go into each individual metric and then calculates a single value based on those values. This can be a little confusing, so allow me to provide an example. For consistency’s sake, let’s use score values that yield the same precision values as above: your data could have class A with TP = 6, FP = 4; class B with TP = 3, FP = 7; and class C with TP = 20, FP = 100. This would give you the 0.6, 0.7, and 0.2 precision as above, but performing a micro-averaging, which means adding all the individual values for each class, though it were one class, (all TP / (all TP + all FP)) you get a micro-precision of 0.261.

This is much lower than the 0.5 macro-precision, but this example should not bias you away from one metric or towards another. There are times when either metric might give you more insight into the effectiveness of your classifier, and so you must use your judgment when choosing what metrics to pay attention to.

Conclusion

Building a complete picture of your model’s effectiveness takes more than just looking at the number of misclassified records, and we should be glad of that. As you delve into the various metrics available to you as a data scientist, you can begin to see patterns forming, and use those experiments and your intuitions to build better and more powerful models through the process of tuning, which we will cover in our next blog post.

 Like