Gokhan Atil’s Technology Blog

Boosting Performance in Snowflake: The Power of Automatic Clustering

2023-05-16T00:00:00+00:00

When it comes to managing large data sets in Snowflake, Automatic Clustering is one of the key features that can help improve performance. But before we dive into how it works, let’s first take a look at how Snowflake stores data and what partition pruning means.

Snowflake stores data in the tables as micro-partitions. These are small, contiguous units of storage that Snowflake can use to quickly access the data you need. Snowflake maintains clustering metadata for the micro-partitions in a table, including min and max values of columns in the micro-partition. When you run a query, Snowflake examines the filters to determine which partitions contain the relevant data. For example, if a query filters data by a specific date range, Snowflake can exclude partitions that do not contain data within that range. This is called “partition pruning”. Partition pruning is particularly effective for large and complex data sets, where the amount of data that needs to be scanned can be prohibitively high.

Why is Automatic Clustering important and how exactly does Automatic Clustering help improve performance?

Automatic Clustering continually manages reclustering operations, and redistributse data to micro-partitions based on clustering keys. This will help Snowflake to read fewer micro-partitions to access the relevant data, resulting in faster query processing times and lower costs. When similar data is grouped together, compression algorithms can also work more effectively, further reducing the amount of data that needs to be scanned and processed.

In order to assess the benefits of automatic clustering in Snowflake and its impact on partition pruning and query performance, I will do some tests. First, I create a table with some dummy data:

CREATE or REPLACE TABLE sales (
    sales_date DATE,
    customer_id INTEGER,
    payload VARCHAR
) AS SELECT dateadd( 'day', uniform(0,450, random(0)), '2022-01-01' ),
     uniform(0,10000, random(0)),
     randstr( 1000, random(0) )
     FROM TABLE(generator(ROWCOUNT => 1000000 ));

In the current scenario, it is worth noting that no specific clustering key was defined during the creation of the table, leading to the data being organized based on the order in which it arrived. To assess the effectiveness of micro-partition pruning, I disabled the result cache (a mechanism that stores query results for subsequent use), and run the following statements:

SELECT COUNT( distinct customer_id) FROM sales WHERE sales_date = '2023-03-10'; 
SELECT operator_statistics:pruning  FROM TABLE(GET_QUERY_OPERATOR_STATS(LAST_QUERY_ID())) WHERE operator_type='TableScan';

+----------------------------------------------------------+
|               OPERATOR_STATISTICS:PRUNING                |
+----------------------------------------------------------+
| {   "partitions_scanned": 41,   "partitions_total": 41 } |
+----------------------------------------------------------+

Upon examining the last query, it becomes apparent that our table comprises 41 micro-partitions, and despite our query targeting data from a single day (‘2023-03-10’), all micro-partitions are being scanned.

In order to demonstrate the functionality of automatic clustering, I create a duplicate of the table and define a clustering key on the sales_date column. It is important to note that Snowflake undertakes automatic reclustering in the background, but the process does not initiate immediately. Therefore, I will allow a brief interval for the maintenance to commence. Subsequently, I will execute queries against the automatic_clustering_history function at regular intervals until confirmation of the completion of the reclustering task:

CREATE or REPLACE TABLE sales_clustered AS SELECT * FROM sales;
ALTER TABLE sales_clustered CLUSTER BY (sales_date);

SELECT * FROM table(information_schema.automatic_clustering_history( TABLE_NAME => 'sales_clustered' )); 

After I see the maintance is completed, I re-run the queries I did earlier on non-clustered query:

SELECT COUNT( distinct customer_id) FROM sales WHERE sales_date = '2023-03-10'; 
SELECT operator_statistics:pruning  FROM TABLE(GET_QUERY_OPERATOR_STATS(LAST_QUERY_ID())) WHERE operator_type='TableScan';


+---------------------------------------------------------+
|               OPERATOR_STATISTICS:PRUNING               |
+---------------------------------------------------------+
| {   "partitions_scanned": 2,   "partitions_total": 41 } |
+---------------------------------------------------------+

The implementation of clustering has effectively reduced the overall number of micro-partitions required to access the relevant data. I would like to remind that while Automatic Clustering is not essential for benefiting from partition pruning, it proves highly advantageous in maintaining the optimal structure of your table, enabling queries to leverage the advantages of partition pruning. It’s also important that choosing the right clustering key plays a critical role in achieving optimal performance. Here are some guidelines to follow when selecting a clustering key:

Identify the columns used in your most frequent queries: The clustering key should be chosen based on the columns that are most commonly used in your queries. This will ensure that the data is physically organized in a way that optimizes query performance.
Choose columns with high cardinality: Columns with high cardinality (i.e. a large number of distinct values) make better clustering keys because they allow for more precise grouping of similar data, which in turn results in more efficient query processing, but avoid very high cardinality will may increase maintenance costs.
Select columns with a good distribution of values: A good clustering key should have a distribution of values that is roughly uniform. If a column has a skewed distribution, it may result in unevenly distributed data and suboptimal query performance.
Consider the size of the table: You don’t need to define a clustering key for small tables (there is no strict rule but I don’t recommend automatic clustering for the tables under 100 GB).
Test different clustering keys: It may be necessary to test different clustering keys to determine which one provides the best performance for your specific workload.

In conclusion, Automatic Clustering is a powerful feature in Snowflake that can help improve query performance and reduce costs. By clustering your data, you can ensure that similar data is grouped together, making it easier to access. If you’re using Snowflake for data warehousing, be sure to take advantage of this powerful feature to get the most out of your data!

The Future of Data Warehousing: Snowflake vs. Traditional Solutions

2023-04-19T00:00:00+00:00

When it comes to managing and analyzing large amounts of data, traditional data warehouses have been the go-to solution for decades. But in recent years, a new challenger has emerged: Snowflake. So, what exactly is Snowflake, and how does it differ from traditional data warehouses?

To understand the differences, it’s important to first define what we mean by “traditional data warehouse.” Essentially, a traditional data warehouse is a large repository of structured data that’s been preprocessed, organized, and optimized for querying and analysis. Typically, data is loaded into the warehouse from a variety of sources, such as transactional databases, flat files, or external data feeds. The data is then transformed, cleaned, and loaded into the warehouse using ETL (extract, transform, load) processes.

Snowflake, on the other hand, is a cloud-based data warehouse that takes a different approach. Rather than requiring users to provision and manage their own hardware and software infrastructure, Snowflake is provided as a service. This means that all of the hardware, software, and networking is managed by Snowflake, allowing users to focus on analyzing their data rather than managing their infrastructure.

One of the key differences between Snowflake and traditional data warehouses is their architecture. Traditional data warehouses are typically implemented using a “shared-nothing” architecture, which means that data is partitioned and distributed across multiple nodes, each with its own CPU, memory, and storage. In contrast, Snowflake uses a hybrid of traditional shared-disk and shared-nothing architectures. Snowflake stores all data in a central repository and accessed using a virtual warehouse. This allows for greater flexibility and scalability, as users can scale their virtual warehouse up or down depending on their needs, without worrying about the underlying hardware.

Another major difference is the pricing model. Traditional data warehouses typically require a significant upfront investment in hardware and software, and ongoing maintenance costs can be high. In contrast, Snowflake operates on a pay-as-you-go model, meaning that users only pay for the resources they use, without needing to invest in any upfront infrastructure. This can make it a more cost-effective option, particularly for smaller businesses that don’t have the resources to invest in a traditional data warehouse.

Finally, Snowflake offers a number of unique features that traditional data warehouses may not have. For example, Snowflake allows users to query data in real-time, using a combination of traditional SQL and advanced analytics techniques. Additionally, Snowflake offers built-in support for semi-structured data, such as JSON, XML, and Avro, making it easier to work with complex data types.

While traditional data warehouses have been the go-to solution for managing and analyzing large amounts of data for decades, Snowflake represents a new breed of cloud-based data warehouses that offer unique features, scalability, and cost-effectiveness. Whether you’re a small business just getting started with data analysis, or a large enterprise looking to modernize your data infrastructure, Snowflake is definitely worth considering as an alternative to traditional data warehouses.

How to Disable Caching in Snowflake for Testing Query Performance?

2023-02-20T00:00:00+00:00

It’s a common question to ask how to disable caching in Snowflake for testing. Although it’s a very straightforward question, the answer is a bit complicated, and we need to understand the cache layers of Snowflake to answer this question.

There are three cache layers in Snowflake:

1) Metadata cache: The Cloud Service Layer has a cache for metadata. It impacts compilation time and metadata-based operations such as SHOW command. The users may see slow compilation times when the metadata cache required by their query is expired. This cache cannot be turned off and is not visible to end-users if the metadata cache is used.

2) Warehouse cache: Each node in a warehouse has an SSD storage. Snowflake caches recently accessed micro-partitions (from the Cloud storage) in this local SSD storage on the warehouse nodes. So running similar queries may use these cached micro-partitions instead of accessing remote storage. This cache cannot be turned off, but it’s possible to see how much warehouse cache is used via the query profile:

You may try to suspend/wait/resume the warehouse to clean the warehouse cache, but if the same nodes are assigned to the warehouse, your query may continue to use the cached data:

ALTER WAREHOUSE YOU_WH SUSPEND;
ALTER WAREHOUSE YOU_WH RESUME;

You may run the query on a new warehouse to avoid the warehouse cache, but it will be challenging if you need to re-run the query multiple times.

3) Result cache: Snowflake stores the results of queries on Cloud Storage. If a query is re-executed within 24 hours and the underlying wasn’t changed (check other requirements(https://docs.snowflake.com/en/user-guide/querying-persisted-results#retrieval-optimization)), Snowflake can return the result without executing the query. Each time the persisted result for a query is reused, Snowflake resets the 24-hour retention period for the result up to 31 days from the date and time the query was first executed. After 31 days, the result is purged, and the next time the query is submitted, a new result is generated and persisted.

The USE_CACHED_RESULT parameter can control this cache. The following command will disable it for the current session:

ALTER SESSION SET USE_CACHED_RESULT=FALSE;

You can check the query profile to see if a query uses the result cache:

Please note that even result cache will be used for a query, the query still needs to be compiled! Sometimes, people may see that the total execution time is longer when the result cache is used because the metadata cache is expired, and the query execution time is already fast.

So when testing the query performance, I suggest you ignore the metadata and warehouse cache. Just disable the result cache and run the query on a dedicated warehouse multiple times to get an average execution time. This will give you a better estimation of how the query will perform in the production environment.

PySpark Examples

2023-02-16T00:00:00+00:00

This post contains some sample PySpark scripts. During my “Spark with Python” presentation, I said I would share example codes (with detailed explanations). I posted them separately earlier but decided to put them together in one post.

Grouping Data From CSV File (Using RDDs)

For this sample code, I use the u.user file of MovieLens 100K Dataset. I renamed it as “users.csv”, but you can use it with the current name if you want.

Using this simple data, I will group users based on gender and find the number of men and women in the users data. As you can see, the 3rd element indicates the gender of a user, and the columns are separated with a pipe symbol instead of a comma. So I write the below script:

Here is the step-by-step explanation of the above script:

Line 1) Each Spark application needs a Spark Context object to access Spark APIs. So we start with importing the SparkContext library.
Line 3) Then I create a Spark Context object (as “sc”). If you run this code in a PySpark client or a notebook such as Zeppelin, you should ignore the first two steps (importing SparkContext and creating sc object) because SparkContext is already defined. You should also skip the last line because you don’t need to stop the Spark context.
Line 5) sc.TextFile method reads from a file and returns the content as RDD (when we call an action because RDDs have lazy evaluation). The print command will write out the result.
Line 6) I use “map” to apply a function to all rows of RDD. Instead of defining a regular function, I use the “lambda” function. The lambda functions have no name and are defined inline where they are used. My function accepts a string parameter (called X), parses the X string to a list, and returns the combination of the 3rd element of the list with “1”. So we get Key-Value pairs like (‘M’,1) and (‘F’,1). By the way, the index of the first element is 0.
Line 7) reduceByKey method is used to aggregate each key using the given reduce function. The previous “map” function produced an RDD which contains (‘M’,1) and (‘F’,1) elements. So the reduceByKey will group ‘M’ and ‘F’ keys, and the lambda function will add these 1’s to find the number of elements in each group. The result will be a Python list object: [(u’M’, 670), (u’F’, 273)]
Line 8) Collect is an action to retrieve all returned rows (as a list), so Spark will process all RDD transformations and calculate the result.
Line 10) sc.stop will stop the context – as I said, it’s not necessary for PySpark client or notebooks such as Zeppelin.

If you’re not familiar with the lambda functions, let me share the same script with regular functions:

It produces the same result with the same performance. Now let me write another one. This time, I will group the users based on their occupations:

Here is the step-by-step explanation of the above script:

Lines 1,3,14) I already explained them in the previous code block.
Line 5) Instead of writing the output directly, I will store the result of the RDD in a variable called “result”. sc.textFile opens the text file and returns an RDD.
Line 6) I parse the columns and get the occupation information (4th column)
Line 7) I filter out the users whose occupation information is “other”
Line 8) Calculating the counts of each group
Line 9) I sort the data based on “counts” (x[0] holds the occupation info, x[1] contains the counts) and retrieve the result.
Lined 11) Instead of print, I use “for loop” so the output of the result looks better.

Grouping Data From CSV File (Using Dataframes)

This time, I will use DataFrames instead of RDDs. DataFrames are distributed data collections organized into named columns (in a structured way). They are similar to tables in relational databases. They also provide a domain-specific language API to manipulate your distributed data, so it’s easier to use.

The Spark SQL module provides DataFrames, which are primarily used as API for Spark’s Machine Learning lib and structured streaming modules. Spark developers recommend using DataFrames instead of RDDs because the Catalyst (Spark Optimizer) will optimize your execution plan and generate better code to process the data.

I will use the “u.user” file of MovieLens 100K Dataset again. I will find the total number of men and women in the users data. I recommend you compare these codes with the previous ones (in which I used RDDs) to see the difference.

Here is the step-by-step explanation of the above script:

Lines 1-5,12) I already explained them in previous code blocks.
Line 7) I use DataFrameReader object of spark (spark.read) to load CSV data. As you can see, I don’t need to write a mapper to parse the CSV file.
Line 8) If the CSV file has headers, DataFrameReader can use them, but our sample CSV has no headers, so I give the column names.
Line 9) Instead of reduceByKey, I use the groupby method to group the data.
Line 10) I calculate the counts, add them to the grouped data, and show the method prints the output.

What if we want to group the users based on their occupations?

Here is the step-by-step explanation of the above script:

Lines 1-5,14) I already explained them in previous code blocks.
Line 7) I use DataFrameReader object of spark (spark.read) to load CSV data. As you can see, I don’t need to write a mapper to parse the CSV file.
Line 8) If the CSV file has headers, DataFrameReader can use them, but our sample CSV has no headers, so I give the column names.
Line 9) “Where” is an alias for the filter (but it sounds more SQL-ish. Therefore, I use it). I use the “where” method to select the rows whose occupation is not others.
Line 10) I group the users based on occupation.
Line 11) Count them, and sort the output ascending based on counts.
Line 12) I use the show to print the result

Please compare these scripts with RDD versions. You’ll see that using DataFrames is more straightforward, especially when analyzing data.

Spark SQL Module

Spark SQL Module provides DataFrames (and DataSets – but Python doesn’t support DataSets because it’s a dynamically typed language) to work with structured data. First, let’s start creating a temporary table from a CSV file and run a query on it. I will use the “u.user” file of MovieLens 100K Data (I save it as users.csv).

Here is the step-by-step explanation of the above script:

Lines 1-5,13) I already explained them in previous code blocks.
Line 7) I use DataFrameReader object of spark (spark.read) to load CSV data. As you can see, I don’t need to write a mapper to parse the CSV file.
Line 8) If the CSV file has headers, DataFrameReader can use them, but our sample CSV has no headers, so I give the column names.
Line 9) Using the “createOrReplaceTempView” method, I register my data as a temporary view.
Line 11) I run SQL to query my temporary view using Spark Sessions sql method. The result is a DataFrame, so I can use the show method to print the result.

When I check the tables with “show tables”, I see that the “users” table is temporary, so when our session(job) is done, the table will be gone. What if we want to store our data as persistent? If our Spark environment is configured to connect Hive, we can use the DataFrameWriter object’s “saveAsTable” method. We can also save the file as a parquet table, CSV file, or JSON file.

Here is the step-by-step explanation of the above script:

Lines 1-5,21) I already explained them in previous code blocks.
Line 7) I use DataFrameReader object of spark (spark.read) to load CSV data. The result will be stored in df (a DataFrame object)
Line 8) If the CSV file has headers, DataFrameReader can use them, but our sample CSV has no headers, so I give the column names.
Line 10) I use the saveAsTable method of DataFrameWriter (write property of a DataFrame) to save the data directly to Hive. The “mode” parameter lets me overwrite the table if it already exists.
Line 12) I save data as JSON files in the “users_json” directory.
Line 14) I save data as JSON parquet in the “users_parquet” directory.
Line 16) I save data as CSV files in the “users_csv” directory.
Line 18) Spark SQL’s direct read capabilities are incredible. You can directly run SQL queries on supported files (JSON, CSV, parquet). Because I selected a JSON file for my example, I did not need to name the columns. The column names are automatically generated from JSON files.

Spark SQL module also enables you to access various data sources, including Hive, Avro, Parquet, ORC, JSON, and JDBC. You can even join data from different data sources.

Discretized Streams (Dstreams)

Spark supports two different ways of streaming: Discretized Streams (DStreams) and Structured Streaming. DStreams is the basic abstraction in Spark Streaming. It is a continuous sequence of RDDs representing a stream of data. Structured Streaming is the newer way of streaming built on the Spark SQL engine.

When you search for any example scripts about DStreams, you find sample codes that read data from TCP sockets. So I decided to write a different one: My sample code will read from files in a directory. The script will check the directory every second and process the new CSV files it finds. Here’s the code:

Here is the step-by-step explanation of the above script:

Lines 1,4) I already explained this in previous code blocks.
Line 2) For DStreams, I import the StreamingContext library.
Line 5) I create a Streaming Context object. The second parameter indicated the interval (1 second) for processing streaming data.
Line 7) Using textFileStream, I set the source directory for streaming, and create a DStream object.
Line 8) This simple function parses the CSV file.
Line 10) This is the action command for the DStream object. pprint method writes the content.
Line 12) Starts the streaming process.
Line 14) Waits until the script is terminated manually.

At every second, the script will check “/tmp/stream” folder, if it finds a new file, it will process the file and write the output. For example, if we put a file that contains the following data in the folder:

Fatih,5
Cenk,4
Ahmet,3
Arda,1

The script will print:

-------------------------------------------
Time: 2023-02-16 13:31:53
-------------------------------------------
['Fatih', '5']
['Cenk', '4']
['Ahmet', '3']
['Arda', '1']

pprint is a perfect function to debug your code, but you probably want to store the streaming data to an external target (such as a Database or HDFS location). DStream object’s foreachRDD method can be used for it. Here’s another code to save the streaming data to JSON files:

Here is the step-by-step explanation of the above script:

Lines 1,5,6,19,21) I already explained them in previous code blocks.
Line 2) Because I’ll use DataFrames, I also import the SparkSession library.
Line 3) For DStreams, I import the StreamingContext library.
Line 7) I create a Streaming Context object. The second parameter indicated the interval (1 second) for processing streaming data.
Line 9) Using textFileStream, I set the source directory for streaming and created a DStream object.
Line 10) This simple function parses the CSV file.
Line 12) I define a function accepting an RDD as parameter.
Line 13) This function will be called every second – even if there’s no streaming data, so I check if the RDD is not empty
Line 14) Convert the RDD to a DataFrame with columns “name” and “score”.
Line 15) Write the data to the points_json folder as JSON files.
Line 17) Assign the saveresult function for processing streaming data

After storing all these data in JSON format, we can run a simple script to query data:

Structured Streaming

Structured Streaming is a stream processing engine built on the Spark SQL engine. It supports File and Kafka sources for production; Socket and Rate sources for testing. Here is a very simple example to demonstrate how structured streams work:

Here is the step-by-step explanation of the above script:

Lines 1-5) I already explained them in previous code blocks.
Line 7) I create a DataFrame to process streaming data.
Line 8) It will read CSV files in the path (/tmp/stream/), and the CSV files will contain the name (string) and points (int) data. By default, Structured Streaming from file-based sources requires you to specify the schema, rather than rely on Spark to infer it automatically.
Line 9) The data will be grouped based on the “name” column, and aggregate points.
Line 10) The data will be ordered based on points (descending)
Line 12) The output will be written to the console and the application will wait for termination.

For testing, I created 2 CSV files:

1.csv:

Fatih,5
Cenk,4
Ahmet,3
Arda,1

2.csv:

Fatih,1
Cenk,1
Ahmet,2
Osman,1
David,2

Then I started the script, and on another terminal, I copied the above files one by one to /tmp/stream/ directory (if you don’t have the directory, you should create it):

cp 1.csv /tmp/stream 
cp 2.csv /tmp/stream

Here is the output of the PySpark script:

-------------------------------------------                                     
Batch: 0
-------------------------------------------
+-----+-----------+
| name|sum(points)|
+-----+-----------+
|Fatih|          5|
| Cenk|          4|
|Ahmet|          3|
| Arda|          1|
+-----+-----------+

-------------------------------------------                                     
Batch: 1
-------------------------------------------
+-----+-----------+
| name|sum(points)|
+-----+-----------+
|Fatih|          6|
| Cenk|          5|
|Ahmet|          5|
|David|          2|
| Arda|          1|
|Osman|          1|
+-----+-----------+

Although I also talked about GraphFrames and Spark’s Machine Learning capabilities in my presentation, I will not include examples of them in this blog post. I hope this blog post will be helpful.

How To Find Storage Occupied by Each Internal Stage in Snowflake?

2022-11-18T00:00:00+00:00

The account usage view has two views related to stages: STAGES and STAGE_STORAGE_USAGE_HISTORY. The STAGES view helps list all the stages defined in your account but does not show how much storage each stage consumes. The STAGE_STORAGE_USAGE_HISTORY view shows the total usage of all stages but doesn’t show detailed use.

I wrote the following script to list the internal stages (and their occupied storage) in all available databases:

Here is the step-by-step explanation of the above script:

Line 1: Decleration of variables
Line 2: A resultset is a SQL data type that points to the result set of a query.
Line 3: A variant variable to hold the list of stages and sizes
Lines 5: Total_size will store the size of a stage
Lines 6: This query will convert the variant data to a table with two columns (stage_name and total_bytes)
Lines 7-10) Defining a cursor to list all internal stages
Line 11) Begin of the anonymous code block
Line 12) Initializing rpt variable as an empty array
Line 13) A loop for each record in cursor (each stage in the database)
Line 14) Begining of the block that we handle exceptions
Line 15) The name of the stage is assigned to the name variable
Line 16) Execute the LS command for the stage and store the result in the RES variable
Line 17) Open a new cursor to process the rows in the RES variable
Line 18) Total size is 0
Lines 19-21) The size of each file in the stage, is added to the total_size variable
Line 22) Add the stage name and the total_size as a new element to the RPT array
Lines 23-25) Exception handling section to return -1 as the size if we can’t access the stage
Line 26) End ot the block that we handle exceptions
Line 27) Repeat this process for each stage. The loops was stated on line 12
Line 28) Convert the array to a table with two columns (stage_name and total_bytes)
Line 29) Return the result as a table
Line 30) Ends the anonymous block

Amazon QLDB and the Missing Command Line Client

2019-09-12T00:00:00+00:00

Amazon Quantum Ledger Database is a fully managed ledger database that tracks all changes in user data and maintains a verifiable history of changes over time. It was announced at AWS re:Invent 2018 and is now available in five AWS regions: US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), and Asia Pacific (Tokyo).

You may ask why you would like to use QLDB (a ledger database) instead of your traditional database solution. We all know that it’s possible to create history tables for our fact tables and keep them up to date using triggers, stored procedures, or even with our application code (by writing changes of the main table to its history table). You can also say that your database has write-ahead/redo logs, so it’s possible to track and verify the changes in all your data as long as you keep them in your archive. On the other hand, this will create an extra workload and complexity for the database administrator and the application developer. At the same time, it does not guarantee that the data was intact and reliable. What if your DBA directly modifies the data and history table after disabling the triggers and altering the archived logs? You may say it’s too hard, but you know that it’s technically possible. In a legal dispute or a security compliance investigation, this might be enough to question the integrity of the data.

QLDB solves this problem with a cryptographically verifiable journal. When an application needs to modify data in a document, the changes are logged into the journal files first (WAL concept). The difference is that each block is hashed (SHA-256) for “verification” and has a sequence number to specify its address within the journal. QLDB calculates this hash value using the journal block’s content and the previous block’s hash value. So the journal blocks are chained by the hash values! The QLDB users do not have access to the immutable logs. If someone modifies data, they also need to update the journal blocks related to the data. This will cause a new hash to be generated for the journal block, and all the following blocks will have a different hash value than before.

As a devil’s advocate, you may wonder, “what happens if my data is modified without my permission and all the journal blocks are regenerated with new hash values”. It’s a very unlikely situation, but what if it happens? Honestly, this was my first question when I heard about the chain mechanism between the journal blocks. QLDB lets you download (and store) the last generated hash value (the digest), and this digest can be used to verify all previously committed transactions. So you have the key (the digest) to verify the integrity of the data. If any change is made to your journals or your data, even a bite changes, the digest will be different and not match yours!

What else do you need to know about QLDB?

Journal-first: The system of record is the journal instead of the table storage.
It’s immutable. You can see all history of data (even deleted data) because nothing can be deleted from the journal.
Cryptographically verifiable: Hash-chaining provides data integrity.
Highly scalable: It’s serverless, and you don’t need to maintain the underlying structure or resources.
Easy to use: Supports PartiQL – SQL-compatible access to relational, semi-structured, and nested data.
Document based: All records are stored in Amazon Ion format.

If you create a ledger, you’ll see that you can access data in two ways: Write a Java application or use the query editor. Unfortunately, there’s no “data import” tool or a command line client for now. So I wrote a basic command line tool which you can download the JAR file and the sources from the GitHub repository:

It supports importing CSV files to the existing tables in your ledger. Of course, it’s just a sample application and not designed for production work. To be able to use it, you need to configure your AWS CLI and set the region where your QLDB ledger resides.

After that, you can run it with “java -jar qldbcli.jar -l LEDGERNAME” (My ledger name is Deneme):

java -jar qldbcli.jar -l Deneme
----------------------------------------------------------
QLDB "the missing" Command Line Client v0.2 by Gokhan Atil
----------------------------------------------------------
PartiQL (Deneme) > select * from Vehicle
VIN,Type,Year,Make,Model,Color
"3HGGK5G53FM761765","Motorcycle",2011,"Ducati","Monster 1200","Yellow"
"1HVBBAANXWH544237","Semi",2009,"Ford","F 150","Black"
"KM8SRDHF6EU074761","Sedan",2015,"Tesla","Model S","Blue"
"1C4RJFAG0FC625797","Sedan",2019,"Mercedes","CLK 350","White"
"1N4AL11D75C109151","Sedan",2011,"Audi","A5","Silver"

If you want to quit from the client, you can use the “quit” command, and if you’re going to connect to another ledger, you can use the “CONN LedgerName” command. I usually use it for exporting/importing sample data between my tables. You can export your ledger table as CSV. All you need is to run a SELECT query with the “-q” parameter and then redirect the output to a file:

java -jar qldbcli.jar -l Deneme -q "SELECT * FROM Vehicle" > Vehicle.csv

You can also import data from a CSV file to a table you created on the ledger (Please note that I created the NewVehicle table by running the “CREATE TABLE NewVehicle” PartiQL command before running the import). The application expects you to give the filename (with “-f” parameter) and the target table (with “-t” parameter):

java -jar qldbcli.jar -l Deneme -f Vehicle.csv -t NewVehicle

You can enable verbose mode to debug connection or the errors you get with PartiQL commands by the “-v” parameter:

java -jar qldbcli.jar -l Deneme -v

Please keep in your mind that the ledger, table, and field names are all case-sensitive. As I said, my sample application is not designed to use for production, so please do not expect to use it to import millions of records. On the other hand, if you examine the source codes, it might help you to write your own QLDB application.

Sample AWS Lambda Function to Monitor Oracle Databases

2019-09-04T00:00:00+00:00

I wrote a simple AWS Lambda function to demonstrate how to connect an Oracle database, gather the tablespace usage information, and send these metrics to CloudWatch. First, I wrote this lambda function in Python, and then I had to re-write it in Java. As you may know, you need to use the cx_oracle module to connect Oracle Databases with Python. This extension module requires some libraries shipped by Oracle Database Client (oh God!). It’s a little bit tricky to pack it for the AWS Lambda.

Here’s the main class which a Lambda function can use:

Here is the step-by-step explanation of the above script:

Line 15) I created a class named Monitoring – this will be the handler class. To be able to test it, it accepts JSON objects. This is why it implements “RequestHandler, String>”.
Line 16) Definition of the function required to handle requests
Line 18) Taken from the AWS documents. It’s the object which we use to push metrics to CloudWatch
Lines 20-24) Loading the JDBC class to be able to connect an Oracle database
Lines 26-29) Defining the connection and credentials. As you see, I fetch the required parameters from the environment because Lambda lets you define env variables
Lines 32-33) Connecting to the database
Line 35) Creating a statement object to run queries
Line 37) As a sample, I query dba_tablespace_usage_metrics to get the usage percentage for tablespaces
Line 38) Loop for each record
Lines 40-47) Preparing the metric I want to push to CloudWatch – The metric will be stored as “Space Used (pct) for each tablespace in the “Databases” namespace in the customer metrics
Line 49) Pushes the metric data
Lines 53-55) Mandatory exception handling block
Line 57) Returns a result (success)

You may notice a lot of “System.getenv” calls. I used it to read the environment parameters from the Lambda function.

You can download the Maven project from my GitHub repository. After you build the maven project, you need to upload the JAR file to your S3 bucket:

aws s3 cp target/samplelambda-0.1-jar-with-dependencies.jar s3://yourbucketname/

I used this JAR file to create a lambda function through the AWS Lambda console. My package name is “com.gokhanatil.samplelambda”, my class name is “Monitoring”, and the function name is “handleRequest”. This is why I entered “com.gokhanatil.samplelambda.Monitoring::handleRequest” as the handler. Please note that you need to give the required permissions to grant access to CloudWatch (e.g. CloudWatchFullAccess) and AWSLambdaVPCAccessExecutionRole. You also need to enter the Virtual Private Cloud (VPC) information (Subnets: pick two subnets that can reach your Oracle instance. Security groups: select one that gives you access to your Oracle instance).

To schedule this job to run periodically, you can use “CloudWatch events”. This event will be triggered every 15 minutes and launch the lambda function.

After enabling the lambda function, new metrics appear in the CloudWatch dashboard. It’s possible to create an alarm for these metrics using the console or AWS CLI commands.

How to Build A Cassandra Cluster On Docker?

2018-02-13T00:00:00+00:00

In this blog post, I’ll show how to build a three-node Cassandra cluster on Docker for testing. I’ll use official Cassandra images instead of creating my images, so all processes will take only a few minutes (depending on your network connection). I assume you have Docker installed on your PC, have an internet connection (I was born in 1976, so it’s normal for me to ask this kind of question), and have at least 8 GB RAM. First, we need to assign about 5 GB RAM to Docker (in case it has less RAM) because each node will require 1.5+ GB RAM to work properly.

Open the docker preferences, click the advanced tab, set the memory to 5 GB or more, and click “apply and restart” docker service. Launch a terminal window, and run the “docker pull cassandra” command to fetch the latest official Cassandra image.

I’ll use cas1, cas2, cas3 as the node names, and the name of my Cassandra cluster will be “MyCluster” (a very creative and unique name). I’ll also configure cas1 and cas2 like they are placed in datacenter1 and cas3 like it’s placed in datacenter2. So we’ll have three nodes, two of them in datacenter1 and one in datacenter2 (to test Cassandra’s multi-DC replication support). For multi-DC support, my Cassandra nodes will use “GossipingPropertyFileSnitch”. This extra information can be passed to docker containers using environment variables (with -e parameter).

Now it’s time to start the first node:

docker run --name cas1 -p 9042:9042 -e CASSANDRA_CLUSTER_NAME=MyCluster -e CASSANDRA_ENDPOINT_SNITCH=GossipingPropertyFileSnitch -e CASSANDRA_DC=datacenter1 -d cassandra

The -p parameter is for publishing the container’s port to the host, so I could connect to the Cassandra service from the outside of the docker container (for example, using DataStax Studio or DevCenter). After the first node is up, I’ll add the cas2 and cas3 nodes, but I need to tell them the IP address of cas1, so they can use it as the seed node and join the cluster. We can find the IP address of cas1 by running the following command:

docker inspect --format='' cas1

I’ll add it to docker run command strings for cas2 and cas3:

docker run --name cas2 -e CASSANDRA_SEEDS="$(docker inspect --format='' cas1)" -e CASSANDRA_CLUSTER_NAME=MyCluster -e CASSANDRA_ENDPOINT_SNITCH=GossipingPropertyFileSnitch -e CASSANDRA_DC=datacenter1 -d cassandra

docker run --name cas3 -e CASSANDRA_SEEDS="$(docker inspect --format='' cas1)" -e CASSANDRA_CLUSTER_NAME=MyCluster -e CASSANDRA_ENDPOINT_SNITCH=GossipingPropertyFileSnitch -e CASSANDRA_DC=datacenter2 -d cassandra

I gave a different datacenter name (datacenter2) while creating the cas3 node. Run them one by one, give time to the new nodes to join the cluster, and then run the “nodetool status” command from cas1 (or any other node):

docker exec -ti cas1 nodetool status

The above code, connects to cas1 node and runs the “nodetool status” command. If everything went fine, you should see something similar to the below output.

The status column of each node should show UN (node is UP and its state is Normal). If you see “UJ” that means your node is joining, just wait a while and recheck it. If your new nodes didn’t appear in the list, they probably crashed before joining the cluster. In this case, you may restart the missing nodes. For example, if cas3 (the last node) didn’t join the cluster and it’s down, you can run the “docker start cas3” command to start it. It’ll try to join the cluster automatically.

Now Let’s create a keyspace (database) that will be replicated to datacenter1 and datacenter2 and a table in this newly created keyspace. I’ll use NetworkTopologyStrategy for replicating data. Each datacenter will store one copy of data. Here is the CQL (Cassandra query language) commands to create the keyspace and table:

CREATE KEYSPACE mykeyspace
WITH replication = {
	'class' : 'NetworkTopologyStrategy',
	'datacenter1' : 1,
	'datacenter2' : 1
};

CREATE TABLE mykeyspace.mytable (
	id int primary key,
	name text
);

We can execute these commands using cqlsh by connecting one of our nodes:

docker exec -ti cas1 cqlsh

Or we can execute them using a client program such as DevCenter (you need to register to the DataStax website to be able to download it). I tried to find a stable GUI for Cassandra, and DevCenter looks fine to me:

After we created the keyspace, we can run “nodetool status” to check the data distribution:

docker exec -ti cas1 nodetool status mykeyspace

As you can see, I gave the name of the keyspace as a parameter to nodetool, so it will show the distribution of our newly created keyspace.

Did you notice that the nodes at datacenter1 share data almost evenly, while the node at datacenter2 has a replication of all data? Remember the replication strategy of our keyspace: Each datacenter stores one copy. Because there are two nodes in datacenter1, the data will be evenly distributed between these two nodes.

You can shut down nodes using “docker stop cas1 cas2 cas3” and start them again with “docker start cas1 cas2 cas3”. So, we have a working Cassandra cluster that is deployed to multiple data centers.

Oracle Enterprise Manager Cloud Control: Write Powerful Scripts With EMCLI

2016-09-25T00:00:00+00:00

Last week, I attended the Oracle Open World and gave a presentation about writing scripts with EMCLI. If you’re unfamiliar with EMCLI, it’s the command line interface for Oracle Enterprise Manager Cloud Control. Here’s my presentation:

Although EMCLI is a very specific topic that appeals only to advanced users, many people attended my session. I want to thank Ray Smith (IOUG Director of Education) for his support. He did his best to inform people about my session.

If you attended my session, or if you have just seen the presentation slides, and have questions about EMCLI scripting, please do not hesitate to ask me.

How To Recover The Weblogic Administrator Password Of The Enterprise Manager?

2015-03-31T00:00:00+00:00

As you know, Weblogic is a part of the Enterprise Manager Cloud Control environment, and it’s automatically installed and configured by the EM installer. The Enterprise Manager asks you to enter a username and password for Weblogic administration. This information is stored in secure files; you usually do not need them unless you use the Weblogic console. So it’s easy to forget this username and password, and that’s what happened to me. Fortunately, there’s a way to recover them without resetting a new user/password. Here are the steps:

First, we need to know the DOMAIN_HOME directory. My OMS is located in “/u02/Middleware/oms”. You can find yours if you read “/etc/oragchomelist”. If the full path of OMS is “/u02/Middleware/oms”, the middleware home is “/u02/Middleware/”. Under my middleware home, I need to go GCDomains folder:

oracle@db-cloud /$ cd /u02/Middleware
oracle@db-cloud Middleware$ cd gc_inst/user_projects/domains/GCDomain

Then we get the encrypted information from boot.properties file:

oracle@db-cloud GCDomain$ cat servers/EMGC_ADMINSERVER/security/boot.properties

# Generated by Configuration Wizard on Wed Jun 04 10:22:47 EEST 2014
username={AES}nPuZvKIMjH4Ot2ZiiaSVT/RKbyBA6QITJE6ox56dHvk=
password={AES}krCf4h1du93tJOQcUg0QSoKamuNYYuGcAao1tFvHxzc=

The encrypted information starts with {AES} and ends with an equal (=) sign. To decrypt the username and password, we will create a simple Java application:

public class recoverpassword {
 public static void main(String[] args)
 {
  System.out.println(
  new weblogic.security.internal.encryption.ClearOrEncryptedService(
  weblogic.security.internal.SerializedSystemIni.getEncryptionService(args[0]
   )).decrypt(args[1]));
  }
}

Save it as “recoverpassword.java”. To compile (and run) it, we need to set environment variables (we’re still in the GCDomain folder). We’ll give the encrypted part as the last parameter:

oracle@db-cloud GCDomain$ . bin/setDomainEnv.sh
oracle@db-cloud GCDomain$ javac recoverpassword.java
oracle@db-cloud GCDomain$ java -cp $CLASSPATH:. recoverpassword $DOMAIN_HOME {AES}nPuZvKIMjH4Ot2ZiiaSVT/RKbyBA6QITJE6ox56dHvk=
oracle@db-cloud GCDomain$ java -cp $CLASSPATH:. recoverpassword $DOMAIN_HOME {AES}krCf4h1du93tJOQcUg0QSoKamuNYYuGcAao1tFvHxzc=

Correct CLASSPATH and DOMAIN_NAME are set when we issued the “setDomainEnv.sh” command. When we run the last two commands, we should see the WebLogic administrator username and password in plain text. By the way, WebLogic uses the cipher key stored in “security/SerializedSystemIni.dat” file when encrypting and decrypting. So even if you use the same password as me, you can see a different encrypted text.