Blog

Overview

One of the things you learn quickly while working in machine learning is that no “one size fits all.” When it comes to choosing a model, this is truer than ever. However, there are some guidelines one can follow when trying to decide on an algorithm to use, and while we’re on the subject of model creation, it’s useful to discuss good practices and ways in which you can fall into unforeseen mistakes.

Cross-Validation

Before we dig into the topic of model selection, I want to take a moment to address an important idea you should be considering when designing the structure of your project: cross-validation. Cross-validation is essentially a model evaluation process that allows you to check the effectiveness of your model on more than one data set by building multiple training sets from your existing data. This is done by moving a testing data ‘fold’ through your data, building models with what isn’t set aside for training.

For example: let’s say you have 100 records, and want to use 10 ‘fold’ cross-validation, essentially building 10 distinct models. For the first model, you might use the first 10 records for your testing data, and then the next 90 records as training data. Once you’ve built a model with that training data and tested it on that testing data, you move the testing fold down to the 11th-20th records, and use the 1st-10th and 21st-100th records combined together as your training data. This process repeats, moving the testing fold through the dataset, until you have 10 distinct models and 10 distince results, which will give you a more robust picture of how well your process has learned than if you just built one model with all your data.

Cross-validation is mostly a straightforward process, but there are a couple of things to watch out for while you’re performing it. The first possible issue is the introduction of bias into your testing data. You have to be careful with the data transformations that you perform while using cross-validation; for example, if you’re boosting the presence of a target class by adding new rows of that class’ data, you have to make sure that the boosting occurs after you perform the testing-training split. Otherwise, the results of your testing will appear to be better than they would be otherwise, since your training data will be tainted with new, targeted data.

Another thing to consider is whether to use stratified cross-validation or not. Stratified cross-validation is an enhancement to the cross-validation process where instead of using the arbitrary order of your data set or a random sampling function to build your testing and training data, you sample from your whole data set equal to a similar proportional representation of each class as is in your whole data set. For example, if your data is 75% class A and 25% class B, stratified cross-validation would attempt to make testing and training samples that maintain that balance of classes. This has the benefit of more accurately depicting the nature of your original problem than using a random or arbitrary system.

Concerns in Selection

The major topic to think about when deciding what machine learning model to use is the shape and nature of your data. Some of the high level questions you might ask yourself: Is this a multi-class or binary class problem? (Note, if you only care about a single target class within a multi-class dataset, it’s possible to treat it as a binary class problem.)  What is the distribution of classes? If the distribution is highly uneven, you may want to avoid certain types of models, such as decision tree based models, or consider boosting the presence of the under-represented class.

Another major question is whether your data is linearly separable or not. While it’s rare that the complicated datasets you will encounter are truly ‘linearly’ separable, datasets that contain more clearly defined class boundaries are good candidates for models such as support vector machines. That being said, it can be difficult to get a good picture of your dataset if it has a high number of features (also known as high dimensionality). In this case, there are still ways to map your dataset in a 2D plane, and it can be highly useful to do so as an initial step before model selection, in order to give you new insights into your data set. Rather than detail the approaches here, here is a link to a post by Andrew Norton which details how you can use the matplotlib library in python to visualize multi-dimensional data.

One of the final considerations that you have to make when selecting your model is the size of your data, both in terms of volume and dimensionality. Obviously, as these variables increase, so will the runtime of your model training, but it’s worth noting that there are models that will build relatively quickly – such as a Random Forest algorithm – and models that as your data gets larger and larger will become prohibitively slow – such as many neural network implementations. Make sure that you understand your data, your hardware resources, and your expectations of runtime before you start learning and working on a new training algorithm.

Concerns in Construction

When it comes to actually building your models, there’s nothing stopping you from just plugging your data right into your machine learning library of choice and going off to the races, but if you do, you may end up regretting it. It’s important to realize as you’re building the framework for your project that everything – from your number of cross-validation folds to aspects of your pre-processing to the type of model itself – is not only subject to change as you experiment, but also is highly likely to do so.

For that reason, it’s more critical than ever that you write modular, reusable code. You will be making changes. You will want to be able to pass a range of values to any given aspect of your code, such as the percentage of features to select from your raw data set. Make your life easier by starting the project with different pieces in different functions, and any values that may need to be updated as testing happens used as function parameters.

A similar concept applies to flow controls. It may be that you want to be able to turn on or off your feature selection functionality, or your class boosting, or switch quickly between different models. Rather than having to copy-paste or comment out large chunks of code, simply set up an area at the beginning of your scope with Boolean values to control the different aspects of your program. Then, it’ll be a simple change from True to False to enable or disable any particular part of your process.

Conclusion

I hope this post has given you some insights into things to think about before starting the construction of your machine learning project. There are many things to consider before any actual testing begins, and how you answer these questions and approach these problems can make that process either seamless or very frustrating. All it takes are a few good practices and a bit of information gathering for you to be well on your way to unlocking the knowledge hidden in your data.

 Like

Overview

One of the things you realize quickly going from guides, classes, and tutorials into hands-on machine learning projects is that real data is messy. There’s a lot of work to do before you even start considering models, performance, or output. Machine learning programs follow the “garbage in, garbage out” principle; if your data isn’t any good, your models won’t be either. This doesn’t mean that you’re looking to make your data pristine, however. The goal of pre-processing isn’t to support your hypothesis, but instead to support your experimentation. In this post, we’ll examine the steps that are most commonly needed to clean up your data, and how to perform them to make genuine improvements in your model’s learning potential.

Handling Missing Values

The most obvious form of pre-processing is the replacement of missing values. Frequently in your data, you’ll find that there are missing numbers, usually in the form of a NaN flag or a null. This could have been because the question was left blank on your survey, or there was a data entry issue, or any number of different reasons. The why isn’t important; what is important is what you’re going to do about it now.

I’m sure you’ll get tired of hearing me say this, but there’s no one right answer to this problem. One approach is to take the mean value of that row. This has the benefit of creating a relatively low impact on the distinctiveness of that feature. But what if the values of that feature are significant? Or have a wide enough range, and polarized enough values, that the average is a poor substitute for what would have been the actual data? Well, another approach might be to use the mode of the row. There’s an argument for the most common value being the most likely for that record. And yet, you’re now diminishing the distinctiveness of that answer for the rest of the dataset.

What about replacing the missing values with 0? This is also a reasonable approach. You aren’t introducing any ‘’new” data to the data set, and you’re making an implicit argument within the data that a missing value should be given some specific weighting. But that weighting could be too strong with respect to the other features, and could cause those rows to be ignored by the classifier. Perhaps the most ‘pure’ approach would be to remove any rows that have any missing values at all. This too is an acceptable answer to the missing values problem, and is also one that maintains the integrity of the data set, but it is frequently not an option depending on how much data you have to give up with this removal.

As you can see, each approach has its own argument for and against, and will impact the data in its own specific way. Take your time to consider what you know about your data before choosing a NaN replacement, and don’t be afraid to experiment with multiple approaches.

Normalization and Scaling

As we discussed in the above section with the 0 case, not all numbers were created equal. When discussing numerical values within machine learning, people often refer to numerical values instead as “continuous values”. This is because numerical values can be treated as having a magnitude and distance from each other (ie, 5 is 3 away from 2, is more than double the magnitude of 2, etc). The importance of this lies in the math of any sort of linearly based or vector based algorithm. When there’s a significant difference between two values mathematically, it creates more distance between the two records in the calculations of the models.

As a result, it is vitally important to perform some kind of scaling when using these algorithms, or else you can end up with poorly “thought out” results. For example: one feature of a data set might be number of cars a household owns (reasonable values: 0-3), while another feature in the data set might be the yearly income of that household (reasonable values: in the thousands). It would be a mistake to think that the pure magnitude of the second feature makes it thousands of times more important than the former, and yet, that is exactly what your algorithm will do without scaling.

There are a number of different approaches you can take to scaling your numerical (from here on out, continuous) values. One of the most intuitive is that of min-max scaling. Min-max scaling allows you to set a minimum and maximum value that you would like all of your continuous values to be between (commonly 0 and 1) and to scale them within that range. There’s more than one formula for achieving this, but one example is:

X’ = ( ( (X – old_min) / (old_max – old _min) ) * (new_max – new_min) ) + new_min

Where X’ is your result, X is the value in that row, and the old_max/min are the minimum and maximum of the existing data.

But what if you don’t know what minimum and maximum values you want to set on your data? In that case, it can be beneficial to use z-score scaling. Z-score scaling is a scaling formula that gives your data a mean of 0 and a standard deviation of 1. This is the most common form of scaling for machine learning applications, and unless you have a specific reason to use something else, it’s highly recommended that you start with z-score.

The formula for z-score scaling is as follows:

X’ = (X – mean) / standard_deviation

Once your data has been z-score scaled, it can even be useful to ‘normalize’ it by applying min-max scaling on the range 0-1, if your application is particularly interested in or sensitive to decimal values.

Categorical Encoding

We’ve focused entirely on continuous values up until now, but what about non-continuous values? Well, for most machine learning libraries, you’ll need to convert your string based or “categorical” data into some kind of numerical representation. How you create that representation is the process of categorical encoding, and there are – again – several different options for how to perform it.

The most intuitive is a one-to-one encoding, where each categorical value is assigned with and replaced by an integer value. This has the benefit of being easy to understand by a human, but runs into issues when being understood by a computer. For example: Let’s say we’re encoding labels for car companies. We assign 1 to Ford, 2 to Chrysler, 3 to Toyota, and so on. For some algorithms, this approach would be fine, but for any that involve distance computations, Toyota now has three times the magnitude that Ford does. This is not ideal, and will likely lead to issues with your models.

Instead, it could be useful to try to come up with a binary encoding, where certain values can be assigned to 0 and certain values can be assigned to 1. An example might be engine types, where you only care if the engine is gas powered or electric. This grouping allows for a simple binary encoding. If you can’t group your categorical values however, it might be useful to use what’s called ‘one-hot encoding’. This type of encoding converts every possible value for a feature into its own new feature. For example: the feature “fav_color” with answers “blue”, “red”, and “green”, would become three features, “fav_color_blue”, “fav_color_green”, and “fav_color_red”. For each of those new features, a record is given a 0 or a 1, depending on what their original response was.

One-hot encoding has the benefits of maintaining the most possible information about your dataset, while not introducing any continuous value confusion. However, it also drastically increases the number of features your dataset contains, often with a high cost to density. You might go from 120 categorical features with on average 4-5 answers each, to now 480-600 features, each containing a significant number of 0s. This should not dissuade you from using one-hot encoding, but it is a meaningful consideration, particularly as we go into our next section.

Feature Selection

Another way in which your data can be messy is noise. Noise, in a very general sense, is any extraneous data that is either meaningless or confuses your model. As the number of features in your model increases, it can actually become harder to distinguish between classes. For this reason, it’s sometimes important to apply feature selection algorithms to your dataset to find the features that will provide you with the best models.

Feature selection is a particularly tricky problem. At its most simple, one can just remove any features that contain a single value for all records, as they will add no new information to the model. After that, it becomes a question of calculating the level of mutual information and/or independence between the features of your data. There are many different ways to do this, and the statistical underpinnings are too dense to get into in the context of this blog post, but several machine learning libraries will implement these functions for you, making the process a little easier.

Outlier Removal

The other main form of noise in your data comes from outliers. Rather than a feature causing noise across multiple rows, an outlier occurs when a particular row has values that are far outside the “expected” values of the model. Your model, of course, tries to include these records, and by doing so pulls itself further away from a good generalization. Outlier detection is its own entire area of machine learning, but for our purposes, we’re just going to discuss trying to remove outliers as part of pre-processing for classification.

The simplest way to remove outliers is to just look at your data. If you’re able to, you certainly can by hand select values for each feature that are beyond the scope of the rest of the values. If there are too many records, too many features, or you want to remain blind and unbiased to your dataset, you can also use clustering algorithms to determine outliers in your data set and remove them. This is arguably the most effective form of outlier removal, but can be time consuming as you now have to build models in order to clean your data, in order to build your models.

Conclusion

Pre-processing may sound like a straightforward process, but once you get into the details it’s easy to see its importance in the machine learning process. Whether its preparing your data to go into the models, or trying to help the models along by cleaning up the noise, pre-processing requires your attention, and should always be the first step you take towards unlocking the information in your data.

 Like

Overview

Getting started on a machine learning project is always a challenge. There’s lots of questions to answer, and frequently, you don’t even know what questions to ask. In this post, and the four others linked to in their respective sections, I hope to explain the fundamentals of building a machine learning project from the ground up, what kind of choices you might have to make, and how best to make them.

For the purposes of this blog post, we’re going to be focusing mostly on supervised learning models – ones that learn with labeled data – as opposed to unsupervised. While most of what we’ll talk about can be applied to both sorts of problems, unsupervised learning has some quirks that are worth talking about separately, which won’t be addressed in this overview.

 

Preprocessing

The very first part of any machine learning project is cleaning and reformatting the data, also known as pre-processing. Raw data is almost never ready to go straight from the source into your training program.

The first reason for this is that raw data is frequently very messy. This can mean that there are values missing, or mis-labeled, and they might need to be replaced. Our first decision has now cropped up: what value do we want to substitute in for the ones that are missing? One option is to use the average, if the values are numerical; another is to use the most common, if the values are string-based. Regardless of what you choose, the shape of the data will be impacted, and it’s usually worth trying multiple approaches. Other forms of data cleaning include removing extraneous features, or removing outlier records, in order to decrease the amount of noise in the dataset.

A second reason for pre-processing is that frequently data will need to be reformatted before it is usable by a machine learning library. This often leads to the processes of categorical encoding – changing string-like values to continuous (AKA numerical) values – and numerical scaling. There are many different approaches to both these processes as well, each with their own benefits and tradeoffs. The most important thing to remember in pre-processing is that the choices you make now will impact the effectiveness of your model later, so serious consideration should be given to these questions when beginning your project.

For more information on pre-processing approaches, see our detailed blog post here.

Model Selection and Creation

With the data cleaned up and ready to go, it’s time to pick a method/algorithm to use to build your models. There is rarely a “right” answer when it comes to model selection; as with most things in machine learning, there are only good questions, reasonable approaches, and always some trade-offs.

Some of the big questions to consider when selecting a model are things like: Do I believe my data is linearly separable? What impact will the distribution of classes in my data set have on my model (ie, is the data biased heavily towards one class)? Do I need my model to support multi-class classification, or just a binary classification? How large is my dataset – in terms of both records and features – and how is that going to affect the runtime of my model? The answers you come up with to these questions might point you in the direction of any number of different models, but the key is to not think narrowly in terms of model selection. We’ll discuss how these questions might relate to specific machine learning models in a future post.

Before we continue though, it’s important to discuss a topic that will affect the flow of your project in a significant way: cross-validation. Cross-validation is the concept of building multiple models with different cross sections of your data set in order to determine the validity of your results. By using a subsection of your training data as your initial test data, it allows you to check the performance of your model before you start working with actual unlabeled data. The cross sections are generally created by determining a number of desired folds, n, and using 1/n records as your testing data, and n-1/n records as the training data. Cross-validation can be further enhanced by using a stratified cross-validation process, which takes equal portions of each class as it builds the training sets, instead of a random sample.

For more information on model selection and cross-validation, see our detailed blog post here.

Model Testing and Metrics

Once you have your models built, how do you know whether or not they’ve learned anything? This question is harder to answer than it looks. One might want to go on the model’s ability to “accurately” predict testing records. However, this allows for the following example: your model predicts that a given record will be of class A 100% of the time, in a binary classification problem. This is not a good classifier. However, when you give your training data to the model, it will be able to show 90% accuracy, if – for whatever reason – your testing data is 90% class A. Your model completely failed to “learn” any other classes, but in this case, the accuracy is still very high. These sort of thought experiments show why it’s important to look at a more complex system of metrics to determine a model’s quality.

This can most easily be done by calculating the following features: True Positives, False Positives, True Negatives, and False Negatives. These values keep track of how your model’s predictions align with the actual labels of your testing data. So, if a record is of class B, and your model predicts that it is of class A, you add one to the count of False Positives for class A (your model incorrectly classified as A), and one to the count of False Negatives for class B (your model failed to correctly classify as B).

These numbers can then be used to show accuracy (True Positives + True Negatives / number of records), your model’s effectiveness at capturing a target class, aka “recall” (True Positive / True Positive + False Negative), and how accurate your model is at predicting a target class, aka “precision” (True Positive / True Positive + False Positive). These are only a few of the ways to evaluate model performance, but they frequently the most useful.

For more information on model testing and evaluation metrics, see our detailed blog post here.

Model Improvement Methods

So now we have clean data, multiple models, and ways to peer at what’s going on inside your models (or at least know the quality of what’s coming out), but there’s still one critical step to go: tuning your models.

The exact process of this will depend on the particular algorithm that you’re using, but a good place to start is by choosing a handful of large ranges for your model’s parameters, and narrowing from there via the results of your tests. However, you should not take this advice to mean that you can expect linear relationships from your parameters; in fact, you should not expect this at all. The purpose is more to start your testing broadly, and then narrow your scope as you continue. You should not be afraid to retread old possibilities if something doesn’t work out.

Aside from tuning the parameters of your specific model, there are a handful of general approaches that you can use as well. One of these is boosting, which involves increasing the volume or the weight of your target class or classes. It is important to perform this after splitting your data for cross-validation, or else you will contaminate your model creation with bias. You should not boost your testing data.

Other processes mentioned before, such as outlier removal, feature selection, and changing the amount of cross-validation can also improve the quality of your models. For more information on model improvement methods, see our detailed blog post here.

Conclusion

Hopefully this general outline – and the posts to come – have given you a good starting framework for how to tackle your machine learning problems. There exists much more out there in the world of ML to be sure, but with solid fundamentals, you can start unlocking the secrets hidden in your data and use them to empower your decisions and applications.

 Like

In this post, I’ll go over how you can use Oracle WebLogic as your Java based webservice to host a Logi Analaytics application. The process is similar to hosting a Logi Analytics application on another Java service, such as Apache Tomcat or JBoss, but features a couple of specific quirks that are worth going over in detail.

Installing Java and WebLogic

The first step in hosting your LogiAnalytics application is to install the Java Development Kit (JDK) and the web service, WebLogic. To download the latest version of the JDK, go to http://www.oracle.com/technetwork/java/javase/downloads/index.html and select the correct version of the JDK installer for your operating system. Once the installer is downloaded, run it, and make sure that you know where the Java system files are located on your hard drive, as you’ll need to reference them later.

Installing Java and WebLogic
Installing Java and WebLogic

 

Once you have the JDK, you can download the latest version of WebLogic from http://www.oracle.com/technetwork/middleware/weblogic/overview/index.html. Again, go to the downloads page and select the correct version for your operating system. Then, within a terminal program, go to the folder that contains your WebLogic installer jar, and run the command: “java -jar <weblogic_jar_file_name>”. If the java command has not been added to your terminal path, you may need to use the full file path to your java executable, such as: “C:\Program Files\Java\jdk1.8.0_131\bin\java -jar …..” This command will open the WebLogic installer. Follow the prompts to install WebLogic Server.

installing WebLogic Server
installing WebLogic Server

Creating a WebLogic Domain

With WebLogic installed, it’s time to set up your server domain. Open a terminal, and navigate to your oracle home location (e.g., “C:\Oracle\Middleware\Oracle_Home”), and then further to: “<ORACLE_HOME>\oracle_common\common\bin”. Run “config.cmd”. This will open the WebLogic configuration wizard, which will allow you to set up a new domain where your server will be hosted. Follow the prompts in the wizard to create a new domain.

create a new domain
create a new domain

 

 

 

 

Deploying Your Logi Application

This should create a folder at “<ORACLE_HOME>\user_projects\domains\<your domain name>”. Inside your domain files, place your Logi program folder into the folder “autodeploy”. Once the files are copied, add the extension “.war” to the end of your Logi program folder’s filename. Lastly, you need to download an additional xml file provided by Logi, and save it into the WEB-INF folder of your Logi program. If you’re using Oracle WebLogic 12c, download this file at http://devnet.logianalytics.com/downloads/info/java/weblogic12c/weblogic.xml. If you are using BEA WebLogic 10, only versions 10-10.3 are supported. For versions 10-10.2, download the weblogic.xml file here: http://devnet.logianalytics.com/downloads/info/java/weblogic10/weblogic10.xml. If you are using BEA WebLogic 10.3, download the xml file here: http://devnet.logianalytics.com/downloads/info/java/weblogic10/weblogic10.3.xml.

WEB-INF folder
WEB-INF folder

Run Your Server

Finally, in your terminal window, navigate to your domain’s main folder (“<ORACLE_HOME>\user_projects\domains\<your domain name>”) and run “startWebLogic.cmd”. This will start your hosting service, and automatically deploy your Logi application. When the starter’s logs say the server is running, confirm that your application is deployed by visiting its URL, or by going to the WebLogic console, which will be located at http://host:<WebLogic_port>/console.

Run Your Server
Run Your Server

 

 Like
AWS Database Migration Pitfall #4 of 4: Making Perfect the Enemy of Good

Welcome to the final installment of our blog series on AWS database migration pitfalls. In addition to reading the preceding 3 blogs you can download our eBook for further details on all four.

Good Enough is Good Enough

AWS Database Migration Pitfall #4 of 4: Making Perfect the Enemy of Good

It might not be an impressive statement to engender admiration, but it’s true nonetheless. It’s important to balance the drive for excellence with the need for momentum.

In blog #1, we discussed the pitfall of failing to set clear goals, and blog #2 covered failing to evaluate all AWS database options. It may seem like pitfall #4, making perfect the enemy of good, contradicts that advice, but it’s important to bear in mind that perfect is always elusive.

Prioritizing your goals and educating yourself on choices are important preliminary steps, but there comes a point where a project becomes thwarted by “analysis paralysis.” The benefits of further planning won’t outweigh the delay in launch.

Making Perfect the Enemy of Good vs. “Lift and Shift”

Making perfect the enemy of good is the opposite of “lift and shift,” in which organizations don’t make any changes, or redesign the database and data processing engine.

But you must evaluate the tradeoff of each addition. Your biggest gains may come from two to three substantial things. Piling on additional services may not only delay your launch, but unnecessarily inflate your costs.

Incremental Improvements are The Norm in AWS Database Migrations

The TCO benefits of AWS increase over time. According to IDC, “There is a definite correlation between the length of time customers have been using Amazon cloud services infrastructure and their returns. At 36 months, the organizations are realizing $3.50 in benefits for every $1.00 invested in AWS; at 60 months, they are realizing $8.40 for every $1.00 invested. This relationship between length of time using Amazon cloud infrastructure services and the customers’ accelerating returns is due to customers leveraging the more optimized environment to generate more applications along a learning curve.”

Iterative progress is the norm. There’s simply no need to wait for perfection prior to migrating.

Learning and Growing with AWS

The rate at which AWS innovates continues to accelerate. During his 2016 re:Invent keynote, CEO Andy Jassy stated that AWS added around 1,000 significant new services and features in 2016, up from 722 the previous year. Jassy stated, “Think about that. As a builder on the AWS platform, every day you wake up, on average, with three new capabilities you can choose to use or not.”

The nature of AWS necessitates growth and change. Even with the best initial migration, you’ll need to evaluate new product offerings over time.

Perspective

An AWS database migration isn’t a rocket launch. Even with zero rearchitecting, you’ll still realize numerous benefits, such as translating CapEx to OpEx. While failure to consider rearchitecting can result in leaving money on the table, it’s not going to result in a catastrophic failure for your business.

A skilled AWS partner like dbSeer can help you launch successfully in a reasonable time frame. Contact us to discuss your unique priorities.

 Like

Welcome to number three in our series of four blogs on mistakes commonly made in AWS database migrations. You can read the first two blogs, “Failing to Set Clear Goals,” and, “Failing to Evaluate All AWS Database Options,” and download our eBook for further details on all four.

Optimize your AWS Database Migration to Reduce Costs

One well-known benefit of cloud computing is the ability to translate capital expenses to operating expenses. Moving to AWS in a simple “lift and shift” manner will achieve this objective. However, optimizing your AWS architecture can further reduce costs by as much as 80%.

Determine Total Cost of Ownership

Weighing the financial considerations of owning and operating a data center facility versus employing Amazon Web Services requires detailed and careful analysis.

AWS has a calculator you can use to estimate your Total Cost of Ownership in moving to the cloud. The TCO website states, “Our TCO calculators allow you to estimate the cost savings when using AWS and provide a detailed set of reports that can be used in executive presentations.” The TCO Calculator is an excellent tool to start off with.

Leaving money on the table is a common AWS database migration pitfall. The AWS TCO Calculator is an excellent tool to start investigating sources of savings.
The AWS TCO Calculator can provide an instant summary report which shows you the three year TCO comparison by cost categories.

Include Both Direct and Indirect Costs of AWS Database Migration

Network World published four steps to calculating the true cost of migrating to the cloud. As part of step 1, “Audit your current IT infrastructure costs,” author Mike Chan writes, “You should take a holistic approach and consider the total cost of using and maintaining your on-premise IT investment over time, and not just what you pay for infrastructure. This calculation will include direct and indirect costs.”

And the same considerations must be made in calculating your costs when considering your unique approach to migration.

Pay Only for What You Use, Use Only What You Need

AWS customers pay only for computing resources consumed. But to fully leverage this benefit, you must be sure you’re not consuming more than necessary.

For example, enterprises often size servers to support peak use to ensure high availability. But rather than remaining sized for max processing, you can code your apps to support the elasticity offered by AWS. Such adjustments could include rearchitecting a processing engine to process on a smaller machine, or shutting it down to create a bigger machine, or processing across ten machines if the architecture supports partitioning.

Taking full take advantage of this requires the expertise of a knowledgeable AWS partner.

AWS Database Options

While AWS easily hosts a variety of licensed databases, just shifting your databases to an Amazon EC2 instance will require you to continue paying those licensing fees. But by selecting AWS database options, such as Redshift and Aurora, you can eliminate many of those existing license fees. This often requires you to also move your existing logic, structure, and code, which can be both time-consuming and expensive to convert to another platform.

However, the savings can be very high, enabling a quick return on investment.

Ensure You’re Not Leaving Money on The Table

Rearchitecting requires both time and a financial investment. Many organizations lack the expertise in-house to conduct the re-architecture properly. It’s therefore difficult to determine the cost of making changes, and the potential savings that could result.

By analyzing your existing bills and structure, a skilled AWS partner like dbSeer can estimate potential cost savings by taking these types of steps. Contact us and we’ll help figure out how you can save every penny possible.

 Like
AWS Database Migration Pitfalls: You must evaluate all the options to see where each piece fits.

We’re addressing four mistakes commonly made in AWS database migrations. You can also read the first blog, “AWS Database Migration Pitfall #1 of 4: Failing to Set Clear Goals,” and download our eBook for further details on all four.

You Don’t Know What You Don’t Know

It’s essential to conduct thorough research prior to making any major purchase. Getting educated about your AWS database options is a necessary prerequisite to making the selections that will best accomplish your migration goals.

Moving your existing infrastructure in a “lift and shift” style might possibly be ideal. But you can’t know that is the best path without first considering the alternatives.

Insight from AWS Chief Evangelist

On the AWS blog, AWS Chief Evangelist Jeff Barr wrote, “I have learned that the developers of the most successful applications and services use a rigorous performance testing and optimization process to choose the right instance type(s) for their application.” To help others do this, he proceeded to review some important EC2 concepts and look at each of the instance types that make up the EC2 instance family.

Barr continues, “The availability of multiple instance types, combined with features like EBS-optimization, and cluster networking allow applications to be optimized for increased performance, improved application resilience, and lower costs.”

And it is that availability of numerous options that results in opportunities for optimization. But evaluating the options first is necessary to know which configuration is optimal.

AWS Database Migration Choices: Open Source, Relational, Decoupled?

Just as the EC2 instance family is comprised of numerous instance types, there are multiple database options available to you when migrating to AWS. Choosing between licensed and open source is one of the many AWS database choices you’ll have to make. Be sure to evaluate columnar data store versus relational, and consider NoSQL options.

Additionally, AWS offers a wide variety of storage solutions with a corresponding variety of price points. In rearchitecting systems for the cloud, you may want to consider:

  • Keeping your raw data in cheap data storage, such as S3
    • And using services such as Elasticsearch, Spark, or Athena to meet app requirements
  • Decoupling batch processes from storage and databases
  • Decoupling data stores for various apps from a single on-premise data store
  • Streamlining time-consuming, mundane infrastructure maintenance tasks, such as backup, high availability

Developing an AWS Database Migration Plan to Improve Infrastructure

Failing to plan is planning to fail. Our first pitfall, “Failing to Set Clear Goals,” addressed the risks of not setting priorities. Once objectives have been prioritized, you must consider how to rearchitect your legacy workloads in order to best meet your goals.

In addition to maximizing cost savings, rearchitecting can enable you to improve lifecycle maintenance and incorporate new services like real-time data analytics and machine learning.

Be sure to:

Download our eBook for more information and contact us. We can help you figure out how to not just leverage the cloud, but leverage it properly.

 Like
You need to balance your organization’s unique AWS database migration objectives

To be fair, there are likely more than four mistakes that can be made in the process of an AWS database migration. But we’ve grouped the most common errors into four key areas. This blog is the first in a series. You can also download our ebook for further details on all four.

Consider AWS Database Migration Objectives

With constant pressure to improve, organizations sometimes move to the cloud in a frenzied rush. In mistakenly thinking the cloud itself achieves all objectives, some abandon proper upfront planning.

Organizations frequently move to the cloud for benefits such as elasticity and costs savings. While everyone moving to the cloud will be able to translate capital expenses to operating expenses, benefits beyond that can vary.

And this variation requires you to set priorities. Every organization is unique, so you must carefully examine your unique objectives.

Common goals include:

  • Reduce costs
  • Improve scalability
  • Reduce maintenance effort and expense
  • Improve availability
  • Increase agility
  • Speed time to market
  • Retire technical debt
  • Adopt new technology

Set Priorities

AWS Migration Priority

 

You may read the above bulleted list and think, “Yes, exactly! I want to achieve that.”

Unfortunately, it may be challenging to accomplish every goal simultaneously, without delaying your migration. You must prioritize your organization’s particular goals so you can make plans which will appropriately balance objectives.

Take a Realistic Approach to AWS

It’s becoming ubiquitous, but still, AWS is no panacea. Goals may conflict with one another. Without first establishing priorities, you can’t determine the tactics which meet your goals.

“Do it all now,” isn’t an effective strategy for success.

AWS Database Migration Issues to Consider

You will likely need to reevaluate your architecture to fully take advantage of all the possible benefits. Issues to take into consideration include:

AWS Migration Problem To Consider

 

  • How much downtime your business could sustain
  • Your current licensing situation
  • Third party support contracts
  • Current use of existing databases. For example, consider the maximum possible number of applications which could use a database
  • Application complexity, where the code is running
  • Skills required – both internally and from an AWS Consulting Partner

 

Various changes are often required, such as shifting to open source services in order to eliminate unnecessary licensing expenses.

 

Tips for Setting AWS Migration Goals

Focus on why you are migrating. Maintaining focus on your specific objectives will impact the way you implement.

Make sure all stakeholders are aligned on prioritization. Take the time to work as a team and get as much consensus as possible from multiple people. Yes, it’s difficult to manage a project by committee, and can lead to delays. But you should, again, strive to balance these often competitive objectives.

Download our eBook for more information and contact us. We can help you determine the ideal pathway for your organization and get started.

 Like


As an increasing number of companies are moving their infrastructure to Microsoft’s Azure, it seems natural to rely on its Active Directory for user authentication. Logi application users can also reap the benefits of this enterprise level security infrastructure without having to duplicate anything. Additionally, even smaller companies who use Office365 without any other infrastructure on the cloud, excluding email of course, can take advantage of this authentication.
Integrating Logi applications with Microsoft’s Active Directory produces two main benefits: attaining world class security for your Logi applications, and simplifying matters by having a single source of authentication. The following post describes how this integration is done.
1. Register Your Application with Microsoft
First, register your application with Azure Active Directory v. 2.0. This will allow us to request an access token from Microsoft for the user. To do this navigate to “https://apps.dev.microsoft.com/#/appList,” and click the “Add an app” button. After entering your application name, on the following page, click the “Add Platform” button and select “Web”. Under Redirect URLs, enter the URL of your website logon page (sample format: https:////.jsp). Microsoft does not support redirects to http sites, so your page must either use https or localhost. Make note of the redirect URL and application ID for the next step.
2. Create Custom Log-on Page For Logi Application:
Microsoft allows users to give permissions to an application using their OAuth2 sign-in page. This process returns an access token, which has a name, email address, and several other pieces of information embedded within which we use to identify the user. These next steps show you how to create a log-in page that redirects users to the Microsoft sign-in, retrieves the access token, and passes whichever value you want to use to identify the employee to Logi.
1) Download the rdLogonCustom.jsp file or copy paste the contents into a file. Place it in the base folder of your application.
2) Configure the following settings within the rdLogonCustom.jsp file to match your Logi application:
Change the ‘action’ element in the HTML body to the address of your main Logi application page:

Change the “redirectUri” and “appId” in the buildAuthUrl() function to match the information from your application registration with Azure AD v2.0:

The sample log-on page redirects the user to Microsoft’s page, allows the user to sign in before redirecting back to the log-on page. At the log-on page, it parses the token for the email address, passes the value to the authentication element using the hidden input to pass as a request parameter.
If you want to use a different value from the access token to identify the user, adjust the key in the “document.getElementById(‘email’).value = payload.” in the bottom of the custom logon file to match your desired value.

3. Configure Logi App:
In your _Settings.lgx file, add a security element with the following settings:

*If your log-on page and failure page have different names, adjust accordingly.
Under the security element, add an authentication element with a data layer that uses the value found in @Request.email~ to identify the user. Optionally, you can add rights and roles elements to the security element as well.
In conclusion, utilizing this integration for your Logi applicatons can not only make your process more efficient by eliminating a duplicate authentication, but it can also provide for an added level of security because of Microsoft’s robust infrastructure.

 Like

Recently, options for connecting to distributed computing clusters with SQL-on-Hadoop functionality have sprung up all over the world of big data.

Amazon’s Elastic Map Reduce (EMR) is one such framework, enabling you to install a variety of different Hadoop based processing tools on an Amazon hosted cluster, and query them with software such as Hive, Spark-SQL, Presto, or other applications.

This post will show you how to use Spark-SQL to query data stored in Amazon Simple Storage Service (S3), and then to connect your cluster to Logi Studio so you can create powerful visualization and reporting documents.

 

Amazon EMR

Amazon Elastic Map Reduce is an AWS service that allows you to create a distributed computing analytics cluster without the overhead of setting up the machines and the cluster yourself. Using Amazon’s Elastic Compute Cloud (EC2) instances, EMR creates and configures the requisite number of machines you desire, with the software you need, and (almost) everything ready to go on startup.

Spark-SQL

Spark-SQL is an extension of Apache Spark, an open source data processing engine that eschews the Map Reduce framework of something like Hadoop or Hive for a directed acyclic graph (or DAG) execution engine. Spark-SQL allows you to harness the big data processing capabilities of Spark while using a dialect of SQL based on Hive-SQL. Spark-SQL is further connected to Hive within the EMR architecture since it is configured by default to use the Hive metastore when running queries. Spark on EMR also uses Thriftserver for creating JDBC connections, which is a Spark specific port of HiveServer2. As we will see later in the tutorial, this allows us to use Hive as an intermediary to simplify our connection to Spark-SQL.

Logi Analytics

            Logi Anaytics is an integrated development environment used to produce business intelligence tools, such as data visualizations and reports. Within Logi Studio, one can create powerful and descriptive webpages for the displaying of data, while requiring relatively little production of traditional code, as the Studio organizes (and often develops) the HTML, Javascript, and CSS elements for the developer. Logi Anaytics can connect to a variety of different data sources, including traditional relational databases, web APIs, and – as this tutorial will demonstrate – a distributed data cluster, via a JDBC connection.

Creating a Cluster

First things first: let’s get our cluster up and running.

  1. For this tutorial, you’ll need an IAM (Identity and Access Management) account with full access to the EMR, EC2, and S3 tools on AWS. Make sure that you have the necessary roles associated with your account before proceeding.
  2. Log in to the Amazon EMR console in your web browser.
  3. Click ‘Create Cluster’ and select ‘Go to Advanced Options’.
  4. On the ‘Software and Steps’ page, under ‘Software Configuration’ select the latest EMR image, Hadoop, Spark, and Hive (versions used at time of writing shown in image below).
  5. On the ‘Hardware Configuration’ page, select the size of the EC2 instances your cluster should be built with. Amazon defaults to m3.xlarge, but feel free to adjust the instance type to suit your needs.
  6. On the ‘General Options’ page, make sure you name your cluster, and select a folder in S3 where you’d like the cluster’s logs to go (if you choose to enable logging).
  7. Finally, on the ‘Security Options’ page, choose an EC2 SSH key that you have access to, and hit ‘Create Cluster’. Your cluster should begin to start up, and be ready in 5-15 minutes. When it’s ready, it will display ‘Waiting’ on the cluster list in your console.

Connecting via SSH

In order to work on your cluster, you’re going to need to open port 22 in the master EC2’s security group so you can connect to your cluster via SSH.

  1. Select your new cluster from the cluster list and click on it to bring up the cluster detail page.
  2. Under the ‘Security and Access’ section, click on the link for the Master’s security group.
  3. Once on the EC2 Security Group page, select the master group from the list, and add a rule to the Inbound section allowing SSH access to your IP.
  4. You should also allow access to port 10001, as that will be the port through which we’ll connect to Spark-SQL later on. You can also do this via SSH tunneling in your SSH client if you’d prefer.
  5. Open an SSH client (such as PuTTY on Windows or Terminal on Mac) and connect to your cluster via the Master Public DNS listed on your cluster page, and using the SSH key you chose during the configuration. The default username on your EMR cluster is ‘hadoop’ and there is no password.

Configuring Permissions

Some of the actions we’re going to take in this tutorial will require the changing of certain file permissions. Make sure that you complete these commands in your SSH terminal before you try to start your Spark service!

  1. Navigate via the cd command to /usr/lib/ and type the command ‘sudo chown hadoop spark -R’. This will give your account full access to the spark application folder.
  2. Navigate via the cd command to /var/log/ and type the command ‘sudo chown hadoop spark -R’. This will give your account access to modify the log folder, which is necessary for when you begin the Spark service.

Importing the Data into Hive from CSV

For this tutorial, we’ll be assuming that your data is stored in an Amazon S3 Bucket as comma delimited CSV files. However, you can use any Hive accepted format – ORC, Parquet, etc – and/or use files stored locally on the HDFS. Refer to the Hive documentation to change the SQL below to suit those scenarios; the rest of the tutorial should still apply.

  1. In your SSH session, use the command ‘hive’ to bring up the hive shell.
  2. Once the shell is opened, use the following SQL to create a table in the Hive metastore – no actual data will be imported yet:

“CREATE EXTERNAL TABLE tablename(values) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ STORED AS TEXTFILE AT LOCATION ‘s3://bucket-name/path/to/csv/folder’;”

  1. Now that the table exists, use the following command to populate the table with data:

“LOAD DATA INPATH ‘s3://bucket-name/path/to/csv/folder’ INTO TABLE tablename;

  1. If you provide a path to a specific folder, Hive will import all the data in all the CSV files that folder contains. However, you can also specify a specific .csv in the filename instead, if you so desire.

Starting Spark-SQL Service

Luckily for us, Amazon EMR automatically configures Spark-SQL to use the metadata stored in Hive when running its queries. No additional work needs to be done to give Spark access; it’s just a matter of getting the service running. Once it is, we’ll test it via the command line tool Beeline, which comes with your Spark installation.

  1. In SSH, type the command: “/usr/lib/spark/sbin/start-thriftserver.sh”
    1. Make sure you’ve configured the permissions as per the section above or this won’t work!
  2. Wait about 30 seconds for the Spark application to be fully up and running, and then type: “/usr/lib/spark/bin/beeline”.
  3. In the beeline command line, type the command: “!connect jdbc:hive2://localhost:10001 -n hadoop”. This should connect you to the Spark-SQL service. If it doesn’t, wait a few seconds and try again. It can take up to a minute for the Thriftserver to be ready to connect.
  4. Run a test query on your table, such as: “select count(*) from tablename;”
  5. Spark-SQL is currently reading the data from S3 before querying it, which slows the process significantly. Depending on the size of your dataset, you may be able to use the command “cache table tablename” to place your data into Spark’s local memory. This process may take a while, but it will significantly improve your query speeds.
  6. After caching, run the same test query again, and see the time difference.

Downloading the JDBC Drivers

Spark Thriftserver allows you to connect to and query the cluster remotely via a JDBC connection. This will be how we connect our Logi application to Spark.

    1. Using an SFTP program (such as Cyberduck), connect to your cluster, and download the following JARs from /usr/lib/spark/jars:
      SPARK HIVE JDBC
    2. To test the connection, download SQL Workbench/J.
    3. Make a new driver configuration, with all of the above jars as part of the library.
    4. Format the connection string as: “jdbc:hive2://host.name:10001;AuthMech=2;UID=hadoop” and hit “Test”.
    5. If your Thriftserver is running, you should see a successful connection.

    Connecting with a Logi Application

    Because we’re using a JDBC connection, you’ll have to use a Java based Logi Application, as well as Apache Tomcat to run it locally. If you don’t already have it, download Tomcat, or another Java web application hosting platform.

    1. Create a Java based Logi Application in Logi Studio, with the application folder located in your Apache Tomcat’s webapps folder.
    2. Add all of the JARs from the previous section to the WEB_INF/lib folder of your Logi application folder.
    3. Within Logi Studio, create a Connection.JDBC object in your Settings definition. The connection string should be formatted as: “JdbcDriverClassName=org.apache.hive.jdbc.HiveDriver;JdbcURL=jdbc:hive2://host.name:10001/default;AuthMech=2;UID=Hadoop”
    4. In a report definition, add a Data Table and SQL Data Layer, connected to your JDBC connection, and querying the table you created in your cluster. Make sure you add columns to your Data Table!
    5. In the command shell, navigate to your Tomcat folder, and start Tomcat via the command: “./bin/catalina.bin start”
    6. Try to connect to your Logi webpage (usually at “localhost:8080/your-Logi-App-name-here”) and see your data.

     

    Final Thoughts

    Because EMR clusters are expensive, and the data doesn’t persist, it becomes important eventually to automate this setup process. Once you’ve done it once, it’s relatively easy to use the EMR Console’s “Steps” feature to perform this automation. To create the table, you can use a Hive Step and a .sql file containing the commands in this tutorial saved on S3. To change the permissions and start the server, you can use a .sh script saved on S3 via the script-command jar. More information on that process is listed here. Make sure that with the .sh script that you don’t include a bash tag, and that your end line characters are formatted for Unicode, instead of a Windows/Mac text document.

    Another possible solution to the lack of data persistence is to use a remote metastore. By default, Hive creates a local MySQL metastore to save the table metadata. However, you can override this behavior and use your own MySQL or PostgreSQL database as a remote metastore that will persist after the cluster is shut down. This is also advantageous for the Logi application, as the local metastore – Derby – does not allow for concurrent running of queries. Information on how to configure your cluster to use a remote metastore will be coming in a second tutorial, to be published soon!

     

 Like