Tech mistake |In the present day, it is not uncommon for companies to leverage their vast reserves of data to derive business value and better strategic performance. To achieve these goals, a strong set of data scientist skills is very important.

Equally important – if not more so – is proper management of the data infrastructure. Doing this requires the data to be:

  • Stored such that it can be accessed conveniently


  • Cleaned up properly


  • Kept up to date

Does a data architect earn more than a data scientist?

It is hard to say if a data architect is more important than a data science professional. The rising importance of the former, however, has meant that salaries for data architects are equally high or sometimes are even higher. According to PayScale, the average annual salary in the US for a data scientist is USD 96,089, while for a data architect it is USD 121,816.

What are the important components of data architecture?

Below are the key components of the data architecture at a company. These handle different functionalities, as explained here:

  • Data lake: A data lake is where the original set of business data is stored. This data should ideally undergo no processing, or minimal if at all, so that if there is any error in the processing, it can be fixed if not done on the original.


  • Data warehouse: A data warehouse reflects the global direction of how the data is finally used. It stores the data that has been processed and structured by a managed data model, often in a tabular form.


  • Data mart: This is where the data to be used by a particular business unit or geographical area is stored. It holds an aggregated and/or subpart data set. A good example is summarizing the key performance indicators (KPIs) for a specific line of business, and then visualizing it through a business intelligence (BI) tool. When frequent and regular updating of the data mart is required, it is worth spending time preparing this component after the warehouse step. However, if some data is needed only for one-time ad hoc data analysis, this can be skipped.

Here are the key differences:




Built by

Used by

Data speed

Maintenance cost

Data lake

Data storage


Data architect

Data engineer

Data engineer

Data scientist

High, real-time

Most expensive

Data warehouse

Data availability in structured and managed format


Data architect

Data engineer

Data scientist

Data scientist

Business expert

Medium, regular/real-time

Somewhat expensive

Data mart

Cherry-picking for specific business line usage


Data scientist

Business expert

Business expert

Low, regular

Least expensive


What is the utility of separate components?

The purpose of different components is to cater to the requirements at different stages:

  • Data lake: Maintain data as close to original as possible


  • Data warehouse: Structure the data sets properly, put a proper maintenance plan in place for good manageability, clear ownership, and analytic-oriented database type


  • Data mart: Suited to non-tech people who can easily access the final outputs of data journeys

The data extraction stage of the business, the data lake, building data models in the data warehouse, and establishing an extract-transform-load (ETL) pipeline fall under the purview of a data engineer. Data scientists, on the other hand, handle the stages from data extraction out of a data warehouse to building a data mart and creating value for business through further application. This, however, is just the ideal distribution, and in real life, the roles tend to overlap to some degree.

What tools are used for each component?

Given the differing purposes of each component, they require different tools to serve their respective purposes. Here are the details:

  • Data lake and data warehouse:


Data size







snowflake, BigQuery, Redshift, Amazon Athena, Azure Synapse Analytics, Treasure Data, Amazon S3




Hadoop (HDFS, Hive, Presto), ClickHouse

100s of MBs


10s of MBs

Microsoft Excel, Google Sheets


  • ETL tools: The ETL process comes into the picture once the data in the data lake is to be processed for the data warehouse. Due to real-time arrival, event-driven messaging tools are preferred.











ASTERIA WARP, Alteryx, CDataSync, Information PowerCenter, Talend Platform



AWS Glue, Databricks


Trocco, Reckoner, Azure Data Factory



Fluentd, Apache Kafka, Logstash, Apache Storm


Amazon Kinesis, AWS SNS + SQS + Lambda, Google Cloud Pub/Sub + Cloud Functions, Azure Stream Analytics



Apache Spark, Apache Beam, SQL, Python, Java, Scala, Go, awk


Cloud Dataflow


  • Workflow engine: This manages the process of pipelining the data. Its uses include using a flow chart to visualize the stage of progress of a process, or triggering automatic retries when there are errors. Common options here include Apache Airflow, Luigi, Digdag, Azkaban, Jenkins, and Rundeck.


  • Data mart or BI tools: The choice here depends on the business context, familiarity, aggregated data size, and the warehouse solution currently in use, among other considerations.





Tableau, Looker, Qlik, ThoughtSpot, Domo, Microsoft Power BI


Redash, Metabase, Kibana, Grafana, Microsoft Power BI, Google DataStudio

For use by anyone

Microsoft, Google Sheets