Oleh’s Substack

Data Pipeline Patterns

Oleh Bezhenar — Sat, 07 Oct 2023 06:06:42 GMT

When you want to prepare application data for analysis, you need to create a process called a data pipeline. This process collects, prepares, transforms, and transfers the data from an application to a data storage place like a data lake. It's important to carefully think about the requirements for this data pipeline because they involve specific rules.

There are generally three patterns for data pipelines:

Extract Transform Load (ETL): In this pattern, data is first collected and filtered if necessary. Then, it's aggregated and processed before being stored in a data warehouse. ETL works well when data consistency is crucial, like historical data. However, it has downsides like slower speed, complexity, and limited scalability. Popular tools for this pattern include Apache Spark, AWS Glue, and Azure Data Factory.
Extract Load Transform (ELT): Similar to ETL, data is first collected, but instead of processing it right away, the raw data is stored first. Then, it's transformed within the data warehouse. ELT is suitable for situations that need more flexibility or when the data isn't fully structured. However, it requires a data warehouse with robust transformation capabilities, which adds to management efforts. Most popular solutions support this pattern, except AWS Glue.
Extract Transform Load Transform (ETLT): This approach is a hybrid, aiming to balance ETL's consistency and ELT's flexibility. Data is partially pre-processed, then stored, and finally processed again to its desired format. While it offers some consistency and speed benefits, it demands more planning and effort during the design stage. It's useful for scenarios requiring complex data transformations.

TOML

Oleh Bezhenar — Fri, 06 Oct 2023 02:45:01 GMT

YAML, JSON, INI, and TOML are popular choices for configuration files, and they are replacing XML for good. While the first three are familiar and straightforward, TOML (Tom's Obvious Minimal Language) is an interesting choice.

The official description of TOML says:

TOML aims to be a simple configuration file format that's easy to read because it has clear rules. TOML is designed to easily turn into a data structure, like a table or a dictionary, in many programming languages. TOML should be easy to change into data structures in many different programming languages.

TOML lets you create objects, which are called tables here, using a simple syntax that avoids the need for nesting objects. For example:

[parent-object] field1 = "the value" [parent-object.child-object] field2 = "another value"

The same goes for arrays of objects:

[[user]] id = "1" name = "user1" [[user]] id = "2" name = "user2"

Another interesting feature is its support for date and time:

ldt1 = 1979-05-27T07:32:00 ldt2 = 1979-05-27T00:32:00.999999

In terms of file size, TOML generally falls between JSON and YAML.

Personal Productivity System

Oleh Bezhenar — Thu, 05 Oct 2023 03:34:25 GMT

Organizing your life isn't a one-size-fits-all solution. Everyone has their own journey to find what works for them. During my journey, I've tried different methods. Some didn't work, and some did. Here's what didn't work for me:

Methods That Didn't Work:

Kanban/Scrum for Personal Use:
- These methods are great for projects but require a lot of effort when you have various projects, chores, and daily tasks.
- When the board gets filled with too many tasks, it becomes overwhelming. Tools like Trello didn't help me manage my tasks effectively.
GTD (Getting Things Done):
- GTD works well for structured projects, but it started to break down for me over time.
- The "Sometimes/Maybe Later" category became a mess with too many projects and tasks without clear start dates.
Bullet Journaling:
- While I liked the idea of a paper-based system, it didn't work for me because many of my projects are digital.

What I'm Looking For:

Through these experiences, I've figured out what I need in an organizational system:

Quick Capture: I want a system that lets me record ideas quickly without delay.
Inbox: During idea capture, I don't want to worry about when or where to do something. I need a simple inbox to collect my ideas.
Projects and Tags: Projects are great for grouping tasks by goals, while tags help categorize tasks by context for better planning.
Scalability: The system should be easy for everyday tasks but also flexible for bigger projects.
Centralization: Using multiple tools for personal organization is a hassle. I want one simple system to stick with.
Resilience: Sometimes, I step away from my system, and when I return, I want to pick up where I left off, not start over.

What Works for Me:

Currently, I'm using a simple to-do app like Todoist or TickTick. It supports projects and tags. I don't use a "sometimes/maybe" folder; instead, I schedule all tasks I'm actively working on and keep the rest under filter "Unscheduled." I have a special task type called "milestones" to remind me of my overall direction, and I can adjust priorities as needed. For prioritization, I use the Eisenhower matrix. I find the "Quick Capture" feature in my to-do app very helpful. I also experiment with retrospectives and use project comments to store project-specific notes.

Reflecting On 30 Days Of Daily Posting

Oleh Bezhenar — Wed, 04 Oct 2023 04:44:27 GMT

30 days ago, I started an experiment where I committed to posting daily notes on topics I'm interested in. Here's a brief look back on this period:

It encouraged me to research areas that interest me.
It motivated me to stay updated on industry trends.
It pushed me to dive deeper into each topic and double-check the facts I thought I knew.
Sometimes, it's disappointing to discard a note halfway through when I realize I don't like it.
Other times, it's discouraging to admit my lack of expertise in a specific topic, but I'm trying to post anyway, as it's the area I'm currently investigating.

Overall, I've decided to continue this experiment for at least three more months, and I'll share another update then. Thank you guys for your support.

GPT Function Calling

Oleh Bezhenar — Tue, 03 Oct 2023 03:24:58 GMT

I've been exploring the GPT API, and one of its cool features is called Function Calling. Basically, this allows the AI model to use code that you give it in a special format called JSON, and it's wrapped up in something called a "schema."

You can do some interesting things with this feature. For example, you can make a program that can understand and run code that you provide. It's similar to how a Code Interpreter works.

You can also use this to make your chatbot work with other programs. This means you can create a chat-based interface for your app. And if you combine it with a code interpreter, the API can even create code that works with your existing software.

Customer Segmentation

Oleh Bezhenar — Mon, 02 Oct 2023 02:18:33 GMT

Customer segmentation is the process of categorizing a company's customers into groups based on common characteristics. This enables companies to effectively and appropriately tailor their marketing efforts to each group.

There are four primary types of customer segmentation:

Demographic segmentation: This method involves dividing customers into groups based on shared characteristics such as age, gender, income, occupation, education level, marital status, and location.
Psychographic segmentation: In this approach, customers are grouped based on their lifestyle, interests, values, and attitudes.
Behavioral segmentation: This method classifies customers into different groups based on their purchase history, usage patterns, brand loyalty, and responses to marketing campaigns.
Geographic segmentation: Here, customers are divided into groups based on their location, which can include country, region, city, or neighborhood.

Customer segmentation offers various benefits, including optimizing your marketing strategy and defining specific marketing channels that target each segment. It also helps identify ways to improve products tailored to specific segments and even test various pricing options.

Segmentation can be carried out in various ways, such as through surveys, cold calls, collecting membership data, insights from customer support interactions, purchase history analysis, online analytics, and machine learning.

Here are some application examples:

Implementing different pricing strategies for students.
Offering family discounts.
Conducting age and gender-specific marketing campaigns (Netflix serves as a great example).
Developing distinct products for various cultural groups.
Sending customized messages based on how customers discovered a service.

LLM as a Standard Text Interface

Oleh Bezhenar — Sun, 01 Oct 2023 04:40:08 GMT

The Portable Operating System Interface (POSIX) is a family of standards that defines a set of APIs (Application Programming Interfaces) and conventions for building and interacting with operating systems.

POSIX is designed to enhance the portability of applications. Essentially, this standard defines what a Unix-like operating system is. Among various components, such as Error Codes and Inter-Process Communication standards, it includes a list of utilities that are familiar to us, such as cd, ls, mkdir, and many more. These utilities have shaped how people interact with operating systems using text for decades.

It appears that we are witnessing a resurgence of text-based interfaces in the form of LLMs. Technologies like ChatGPT plugins, Microsoft Copilot 365, and the recently updated Bard indicate that LLMs might serve as text-based interfaces for a range of services and applications. I'm wondering if we will eventually establish a set of standards to define the interaction between LLMs and extensions, similar to how POSIX standardized Unix-like systems in its time.

Several factors could contribute to the emergence of such standards. Some of them:

1. User Demands: In a competitive market with multiple chat-based services that support third-party plugins, having a set of standards would enable compatibility across platforms or easy switching between them.

2. Technology Maturity: As these interfaces become more mature, and their applications span various domains, standardization may naturally evolve. The absence of disruptive changes and widespread usage can lead to the establishment of these standards.

Apple and AI

Oleh Bezhenar — Sat, 30 Sep 2023 02:48:31 GMT

In its latest event, Apple didn't utilize AI even once, in contrast to its closest competitor, Google. While refraining from making any announcements, the new generations of iPhone and Apple Watch boast more powerful Neural Engines. I believe Apple will take a different approach from what we currently observe in the market. Instead of further enhancing their existing cloud-based Siri experience, they will shift towards on-device processing. This strategy aligns with their strong stance on security and privacy, as we've already seen them testing on-device Siri processing with the Apple Watch. I'm curious about what they could offer with offline AI and have a few thoughts:

Context-aware on-device search: Imagine being able to search across all types of files, including images, documents, and videos, and retrieve information in any format simply by asking Siri.
Context-aware writing assistance: With training based on your email history, typing suggestions could become context-aware, offering email responses that align with your ongoing conversations.
Deeper integration with other applications: It would be fascinating to enable any app to leverage an API that creates a "skill" for the local assistant, much like how you can extend ChatGPT with extensions. This could potentially open up new niches for apps centered around AI interaction.

Homeostasis

Oleh Bezhenar — Fri, 29 Sep 2023 05:15:07 GMT

Homeostasis is a self-regulating process that enables biological systems to maintain stability while adapting to changing environmental conditions. It describes how an organism can keep its internal environment relatively constant, allowing it to adapt and survive in a frequently challenging environment.

Homeostasis consists of several key components:

Receptor: As the name suggests, receptors detect changes in the external or internal surroundings. They initiate a cascade of reactions to uphold homeostasis.
Control Center: Also referred to as the integration center, the control center receives information from the receptors and processes it.
Effector: Effectors respond according to the instructions received from the control center, either reducing or enhancing the stimulus as needed.

The concept of homeostasis finds widespread application in software engineering across various domains and industries. Here are some notable examples:

Configuration as Code: Technologies like Kubernetes, Terraform, and CloudFormation adopt an approach where users declare the desired system state, and the system autonomously determines how to achieve and maintain it.
Elasticity: Systems can dynamically scale up or down in response to workload fluctuations, ensuring they can efficiently perform their tasks.
Self-Healing: Container orchestrators such as Kubernetes attempt to restart a malfunctioning service if it stops responding to health checks or exhibits unusual behavior.

Overall, the concept of homeostasis closely aligns with the idea of a desired state and a declarative approach to programming. A straightforward and widely used example is markup languages, where developers specify the desired page state, and the browser is responsible for rendering it as closely as possible to that desired state.

Quality Gates

Oleh Bezhenar — Thu, 28 Sep 2023 04:38:10 GMT

A quality gate is a critical checkpoint in the software development lifecycle that assesses whether software meets specific criteria. Its primary goal is to identify and fix as many issues as possible before releasing the software. Quality gates may include, but are not limited to, the following checks:

Build: Checking if the software builds and compiles without any errors.
Linting: Ensuring that the codebase adheres to accepted best practices.
Tests: Including both functional tests and coverage reports.

Typical locations for implementing quality gates are:

Local Environment: Usually implemented with pre-commit hooks, this allows for early issue detection during code commit. Among other checks, it's an excellent place to enforce code style using tools like Prettier and validate naming conventions for the branch.
PR Validation: These checks duplicate those in the local environment, in case a developer skips pre-commit hooks using the --no-verify option. They also add PR-specific validations. For example, Azure can check if the associated work item was created or if a description was provided for the PR.
Main Branch Actions: This is the best place to run extensive integration and automation tests in addition to the previous checks. It ensures that the software continues to meet quality standards after merging into the main branch.

This setup works exceptionally well with temporary teams, such as contractors or outsourcers, to ensure that the codebase complies with defined standards.

Model First VS Query First

Oleh Bezhenar — Wed, 27 Sep 2023 04:40:14 GMT

SQL is a good example of an abstraction that works in most cases (I assume the 80/20 rule is applicable here). But, like most abstractions, it cracks under pressure, and instead of writing readable, well-structured queries, developers find themselves writing dynamic SQL, tweaking indices, and investigating execution plans.

I think query-first data modelling, as used in Apache Cassandra, is more transparent compared to model-first, used in SQL:

It doesn't try to hide the physical nature of the query and insists on picking a good index beforehand, thus not faltering under high loads, allowing it to handle huge workloads.
It doesn't presume complete data integrity, and thus techniques like partitioning don't seem alien. CQL, for instance, insists on picking a good partition key when modelling your data, presuming partitioning from the beginning.

Complexity Analysis

Oleh Bezhenar — Tue, 26 Sep 2023 04:46:20 GMT

I didn't realize that there are so many ways to do complexity analysis in software engineering. Some of them:

Story Points: This is a classic one. The team decides on a minimum point and then compares the complexity of other tasks to this minimum. Usually, they use the Fibonacci Sequence and Planning Poker for this.
T-Shirt Sizes: This method uses a set of predefined sizes like XS, S, M, L, XL, etc., to estimate the complexity.
Ideal Days: This is straightforward. The team estimates how many ideal workdays they need to finish a task.
Function Point Analysis (FPA): This feels more academic. It considers things like the number of external outputs and internal interfaces. You can learn more about it here.
The Matrix Method: This is a visual method. It uses time on the X-axis and complexity on the Y-axis.

Explicit Error Handling

Oleh Bezhenar — Mon, 25 Sep 2023 04:31:50 GMT

When I first started using Go, it took me some time to become familiar with its error handling approach. In comparison to the traditional control flow approach, where exceptions are handled in a separate block (try/catch), modern languages like Go and Rust use a different approach called explicit error handling. In this approach, the error is one of the return values, and the developer is expected to check and handle it right away.

For example, in Go:

value, err := someFunction()
if err != nil {
    // handle the error
}

In Rust, you would use Result, which is an enum with two variants: Ok(value) and Err(value):

match some_function() {
    Ok(value) => {
        // use the value
    },
    Err(e) => {
        // handle the error
    },
}

Both languages also have a similar concept called "panic" mode, which represents unrecoverable errors that should interrupt the execution.

Explicit error handling arguably helps developers write more readable error handling code because it is closely related to the actual place where the error will occur, as opposed to the control flow approach, where the catch block might be in a separate function or buried deep within many other function calls that might produce the error, or even worse, hidden by a generic exception.

Red Ocean Blue Ocean

Oleh Bezhenar — Sun, 24 Sep 2023 05:47:23 GMT

"Red Ocean, Blue Ocean" is the concept from the business strategy book "Blue Ocean Strategy" by W. Chan Kim and Renée Mauborgne that aims to help companies grow under different market conditions and adjust their actions accordingly to market "temperature":

Red Ocean: an already established market with competitors. How to take advantage?
- Better cost;
- Better quality;
- Focus on a specific niche;
- Better branding;
- Relationships;
Blue Ocean: an unknown market space, where demand is created rather than fought over. How to win:
- Innovate
- Create new demand
- Attract new customers The Blue Ocean almost always looks more appealing due to the lack of competitors; however, you should be a visionary to see them. While the Red Ocean may seem rough, knowing the rules and the market might help you secure your share.

The Paradox Of Choice

Oleh Bezhenar — Sat, 23 Sep 2023 04:14:35 GMT

In his book "The Paradox of Choice," Barry Schwartz mentioned that having too many choices could lead to less satisfaction and greater regret. The paradox is related to the following characteristics:

Choice Overload: Studies have shown that once a certain threshold is reached, there is a decrease in interest. This is particularly evident in retail, where stores know how many different brands are enough to keep customers interested, but not too many to overwhelm them.
Escalation of Expectations: The more choices you have, the more you tend to believe that there must be "the best one" among them.
Regret and Opportunity Costs: This is closely tied to the previous point; people tend to experience more regret when they have to choose between various options.

How to cope with the Paradox Of Choice? Schwartz divides decision-makers into "maximizers" - those who constantly seek the best possible option, and "satisfiers" - those who are content with a good enough option. By proactively limiting the number of choices to those that have been proven to be "good enough," you can reduce decision anxiety for a significant portion of consumer-based choices.

WAL and SQLite

Oleh Bezhenar — Thu, 21 Sep 2023 22:33:33 GMT

The Pocket Base, an open-source backend, made an interesting choice for their persistent storage: SQLite. In the industry, SQLite is usually used for simple client-side storage because it's lightweight, portable, and doesn't need much setup. However, it lacks some features, like OUTER JOINs, and doesn't support multiple writers or user management. This choice raises some questions.

Here's their response:

PocketBase uses embedded SQLite (in WAL mode) and has no plans to support other databases. For most queries, SQLite (in WAL mode) performs better than traditional databases like MySQL, MariaDB, or PostgreSQL, especially for read operations. If you need replication and disaster recovery, you can consider using Litestream.

Basically, they're writing into a WAL (Write-Ahead Logging) instead of using a rollback journal to address some critical issues with SQLite:

Increased Concurrency: Using WAL allows multiple readers and a writer to work at the same time. Readers don't block writers, and writers don't block readers, so reading and writing can happen at the same time.
Improved Performance: Writing a commit to the end of the WAL file is faster than writing to the middle of the main database file. This makes transactions faster with WAL.
Crash Recovery: It's more robust if there's a crash. Changes are first written to the WAL file and then transferred to the database file, reducing the risk of database corruption.
Disk I/O Reduction: With WAL, disk operations are more sequential, which reduces the time spent on disk seek operations.

Hofstadter's Law

Oleh Bezhenar — Wed, 20 Sep 2023 22:16:54 GMT

Hofstadter's Law addresses the common problem of accurately estimating the time it will take to complete a task.

"It always takes longer than you expect, even when you take into account Hofstadter's Law."

Bill Gates' interpretation of this law especially resonates with me:

"Most people overestimate what they can do in one year and underestimate what they can do in ten years."

You overestimate what you can achieve in a gym in one month, and underestimate progress for six.
You overestimate your ability to learn something in a month, but underestimate what you can learn in one year.
You overestimate your saving ability for a particular paycheck, but underestimate what you can save in one year.

While planning something, I try to take this law into account and be more humble in my short-term goals, and more ambitious with long-term ones.

Not Invented Here (NIH) Syndrome

Oleh Bezhenar — Tue, 19 Sep 2023 21:57:47 GMT

Not Invented Here (NIH) syndrome is a tendency to build in-house software instead of utilizing existing options. In its simplest form, it's a constant need to reinvent the wheel. Here are some notable examples:

Netscape Navigator: Netscape decided to rewrite its entire codebase for Netscape Navigator 5.0, believing that starting from scratch would enable them to leapfrog the competition. Unfortunately, the project took much longer than expected, and by the time Netscape 6.0 (5.0 was skipped altogether) was released in 2000, Internet Explorer had taken over the browser market. Netscape's market share never recovered.
Digg v4: Social news aggregator Digg decided to rewrite its entire codebase for version 4, moving away from MySQL and Memcache to Cassandra. The move was not well-received by users, and numerous bugs and performance issues led to a mass exodus to competitors like Reddit. The company's value plummeted, and they were eventually sold for a fraction of their peak value.
Rewriting Quake by id Software: John Carmack, a co-founder of id Software, decided to rewrite the Quake game engine from scratch in C++, moving away from C. The rewrite ended up taking much longer than anticipated and led to numerous bugs and stability issues, damaging the game's reputation.
Friendster: One of the first social networking sites, Friendster, faced scalability issues as more users joined. Instead of improving and optimizing their existing platform, they decided to rewrite the entire codebase. The result was a buggy, slow platform that frustrated users and led to a rapid decline in the user base.
HealthCare.gov: When the U.S. government launched HealthCare.gov in 2013, it was a disaster due to numerous technical issues. Despite the government's massive resources, the site suffered from poor performance and frequent crashes. A key reason for the site's issues was that the government insisted on custom-building much of the site's functionality rather than using proven existing solutions.

Cloud IDEs

Oleh Bezhenar — Mon, 18 Sep 2023 21:18:27 GMT

I was working in StackBlitz and thinking about the potential future of Cloud Integrated Development Environments (IDEs). While they provide a convenient way to quickly set up configured development environments, I believe they won't entirely replace on-device development environments for everyday engineering tasks. They may, however, find their niche in replacing Virtual Desktop Environment (VDE) software, especially in situations where contractual restrictions prevent storing codebases locally, as is often the case for consulting companies.

Here are some reasons why I think local development environments will continue to dominate:

Freedom of Tooling Choice: In your local environment, you have the freedom to select and customize tools and plugins that align perfectly with your workflow. You can even use proprietary tools if necessary, which can be challenging to integrate into online IDEs.
Databases: While it's possible to create a development database in the cloud for certain use cases, having a local database can be indispensable. Whether you need it for testing migrations or simply for experimenting with data, a local environment offers greater flexibility. GitHub Codespaces does allow the use of Docker images, but this can add to your bill, leading to the next point.
Pricing: Cloud IDEs often come with a price tag. GitHub Codespaces, for instance, bills on a per-core, per-hour basis, while CodeSandbox charges $15 per month per editor. AWS Cloud9's pricing is tied to the underlying EC2 instance usage. Paying a monthly fee for a tool that offers a subset of the capabilities available in your local environment may not be cost-effective for many developers.

Tokenization

Oleh Bezhenar — Sun, 17 Sep 2023 19:31:32 GMT

Tokenization is the process of breaking down text into components, known as tokens. Each token might represent an individual word or phrase. This process is required to make the data more manageable and suitable for various NLP tasks (text mining, ML, or text analysis). Let's have a look at the BERT-like model tokenization process:

Normalization (any clean-up of the text that is deemed necessary, such as removing spaces or accents, Unicode normalization, etc.)
Pre-tokenization (splitting the input into words)
Running the input through the model (using the pre-tokenized words to produce a sequence of tokens)
Post-processing (adding the special tokens of the tokenizer, generating the attention mask and token type IDs)

Example:

"Hello how are U today?" - Input
    |
    v
"hello how are u today?" - Case normalization
    |
    v
["hello", "how", "are", "u", "td", "##ay", "?"] - Subword tokenization
    |
    v
["CLS", "hello", "how", "are", "u", "td","##ay","?","SEP"] - Assigning special tokens

Here today has been split into td, ##ay. This technique is known as Subword Tokenization, often used in models like BERT to handle out-of-vocabulary words.

Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords. For instance "annoyingly" might be considered a rare word and could be decomposed into "annoying" and "ly". Both "annoying" and "ly" as stand-alone subwords would appear more frequently while at the same time the meaning of "annoyingly" is kept by the composite meaning of "annoying" and "ly". This is especially useful in agglutinative languages such as Turkish, where you can form (almost) arbitrarily long complex words by stringing together subwords. Source

Special tokens are used for classification in BERT like models. The CLS token is used to represent the entire context of the input for tasks like classification, while the SEP token is used to separate different sentences or contexts within the same text.