Data Warehouse Construction Standards - Data Modeling Standards

3.1 Data Modeling Standards

a. Horizontal Layering:

  • Explanation

    Our layered design approach is an essential byproduct of our data architecture strategy, adhering to strict standards during the modeling phase for optimal results.

  • Layer Standards

    • The Operational Data Store (ODS)
      • Design Methodology:
        • Store raw data with minimal processing in the data warehouse system, maintaining a structure that is similar to the source system.
      • Main Functions:
        • Synchronize and store foundational data in the data warehouse to address data silo issues and ensure the integrity of data integration .
        • To ensure data persistence, keeping tables and data in perfect sync.
        • Perform regular synchronization and add synchronization timestamps to the tables to capture their temporal variability.
      • Data Processing:
        • Handling of anomalies and erroneous data.
      • Naming Convention:
        • Layer abbreviation_Source System_Source System Table Name
        • Example: ods_oms_order
More...

Data Warehouse Construction Standards - What

3.What are the key components of data warehouse standards?

a. Data Modeling Standards:

These standards define the guidelines for designing and structuring the data models used in the data warehouse. They include naming conventions, entity-relationship diagrams, data type definitions, and relationships between tables.

b. Data Integration Standards:

These standards focus on the processes and methods used to extract, transform, and load (ETL) data into the data warehouse. They cover data extraction techniques, data cleansing procedures, transformation rules, and data loading strategies.

c. Data Quality Standards:

These standards ensure the accuracy, consistency, completeness, and validity of data within the data warehouse. They include data profiling, data validation rules, data cleansing methodologies, and data quality metrics.

d. Metadata Standards:

Metadata standards define the structure and format of metadata stored in the data warehouse. They cover metadata definitions, metadata repositories, metadata integration, and metadata management processes.

e. Security and Access Control Standards:

These standards focus on protecting the data warehouse from unauthorized access, ensuring data privacy, and enforcing data security policies. They include user authentication, authorization mechanisms, encryption techniques, and data masking methods.

f. Performance and Scalability Standards:

These standards address the performance optimization and scalability aspects of the data warehouse. They cover query optimization techniques, indexing strategies, partitioning schemes, and data archiving processes.

g. Data Governance Standards:

Data governance standards define the overall framework and processes for managing and governing data within the data warehouse. They include data stewardship, data ownership, data lifecycle management, and data governance policies.

h. Documentation Standards:

These standards ensure the documentation of all aspects of the data warehouse, including data models, ETL processes, data dictionaries, and data lineage. They promote understanding, maintainability, and ease of future enhancements.

Data Warehouse Construction Standards - How

2.How To Implement Data Warehouse Standards

a. Standard formulation

This involves defining and documenting the specific rules, guidelines, and best practices that will govern the data warehouse. It is essential to involve stakeholders from different teams and departments to ensure comprehensive coverage and alignment with business requirements.

b. Standard discussion

Once the standards are formulated, it is important to engage in discussions with relevant teams and individuals. This can include conducting workshops, training sessions, or meetings to explain the standards, address any questions or concerns, and gather feedback.

c. Standard implementation

After the standards have been discussed and finalized, they need to be effectively communicated and implemented across the organization. This can involve creating documentation, providing training, and establishing processes and tools to support adherence to the standards.

d. Standard enforcement and supervision

It is crucial to have mechanisms in place to monitor and enforce compliance with the data warehouse standards. This can include regular audits, performance reviews, and ongoing communication and support to ensure that the standards are being followed consistently.

e. Standard refinement

Data warehouse standards should be treated as living documents that are continuously reviewed and refined based on feedback, evolving business needs, and technological advancements. Regular evaluations and updates should be conducted to ensure that the standards remain relevant and effective.

Data Warehouse Construction Standards - Why

1. Why We Need Data Warehouse Standards

The adage “No rules, no success” underscores the critical importance of established standards in ensuring optimal team performance and high-quality deliverables. In the absence of such standards, operational efficiency and collaboration may suffer, leading to potentially chaotic outcomes.

Have you encountered similar issues in your work?

  • Received a requirement and not sure which table to pull data from. Table A seems feasible, while table B also appears to work. Asked colleague A, and they said they always pull from table C. Spent a long time exploring these three tables but couldn’t match them up. Oh well, let me calculate from the source again, and then a new table D appeared.
  • I’ve noticed that there are thousands of tables in our database, but I only use a handful. So, what’s the point of all these other ones? I asked my colleagues, but nobody seems to know. Should I just get rid of them? Nobody else is touching them anyway.
  • I got tasked with investigating an error in our process after my boss asked me to take a look. But man, the code is a total mess! I can’t make heads or tails of it. Plus, I’ve been searching for what feels like forever but still can’t find the upstream dependencies. What a headache!
  • My coworker bailed on our project, and now I’m stuck with their share of the work. I’ve been grinding away for weeks, but I just can’t seem to wrap my head around it. It’s like they dumped a ton of unfinished business on me, and I’m feeling pretty frustrated myself now. Maybe it’s time for me to look for a new gig too…

Our data warehouse team’s performance has taken a hit due to all the issues we’ve faced lately. Efficiency, output quality, job satisfaction – you name it. And let me tell you, it’s usually the hardest-working and most loyal employees who bear the brunt of all these problems. It’s just not right.

If you’ve ever worked in data development, you know the pain I’m talking about. I mean, who hasn’t experienced some of these frustrations, right? So, what’s going on here? In my humble opinion, it all boils down to a lack of standards or proper implementation. And hey, I get it – sometimes business demands are tight, and shortcuts gotta be taken. But, that technical debt better be paid off pronto. Blaming employees for that ain’t cool. Leadership needs to own up to it.

Think of a data warehouse like a digital construction project – it’s the intangible result of our data engineers’ hard work. Data standards are like the blueprints for building this system, serving as both the instruction manual and translator for data usage. And just like how you need quality control in construction, we gotta ensure data quality too. But here’s the thing – for our data system to really thrive, we need to move away from relying on individual judgment and toward standardized, tool-driven management. That way, we can scale sustainably and keep the system healthy.

在Docker环境下部署 Kibana 可视化工具 (从零到一搭建ELK日志)

什么是 Kibana?

Kibana是一个开源的分析与可视化平台,设计出来用于和Elasticsearch一起使用的。

你可以用kibana搜索、查看存放在Elasticsearch中的数据。Kibana与Elasticsearch的交互方式是各种不同的图表、表格、地图等,直观的展示数据,从而达到高级的数据分析与可视化的目的。

部署 Kibana 可视化工具

  1. 获取 kibana 镜像

    1
    sudo docker pull kibana:7.7.1
  2. 获取elasticsearch容器 ip

    1
    2
    sudo docker inspect --format '{{ .NetworkSettings.IPAddress }}' es
    > 172.17.0.2
  3. 创建 kibana 配置文件

    1
    sudo mkdir -p /data/elk/kibana/
  4. 配置kibana.yml

    elasticsearch.hosts 配置 ES 服务的地址

    1
    2
    3
    4
    server.name: kibana
    server.host: "0"
    elasticsearch.hosts: ["http://172.17.0.2:9200"]
    xpack.monitoring.ui.container.elasticsearch.enabled: true
  5. 启动服务

    1
    sudo docker run -d --restart=always --log-driver json-file --log-opt max-size=100m --log-opt max-file=2 --name kibana -p 5601:5601 -v /data/elk/kibana/kibana.yml:/usr/share/kibana/config/kibana.yml kibana:7.7.1
  6. 验证

    在浏览器打开 [服务器ip]:5601

    Kibana 控制台的界面如下所示,打开 kibana 时,首页会提示让你选择加入一些测试数据,点击 try our sample data 按钮就可以了。最后选择Dashboard,可以看到 sample data 的图表:

配置文件kibana.yml

  1. 服务的端口配置

    属性名为:server.port默认是5601

  2. 允许远程访问的地址配置

    属性名为:server.host默认为本机

    如果我们需要把Kibana服务给远程主机访问,只需要在这个配置中填写远程的那台主机的ip地址,那如果我们希望所有的远程主机都能访问,那就填写0.0.0.0

  3. 连接Elasticsearch服务配置

    属性名为:elasticsearch.url默认为连接到本机的elasticsearch,并且端口为9200,也就是为localhost:9200,如果我们Elasticsearch不是与Kibana安装在同一台主机上,或者Elasticsearch的端口号不是9200,就就需要修改这个配置了

  4. Elasticsearch的用户名和密码

    属性名为:elasticsearch.username和elasticsearch.password,默认是没有用户名和密码,如果elasticsearch是配置了用户名和密码的,那就需要配置这两行属性

  5. 切换中文配置

    属性名为:i18n.locale: “zh-CN”

安装 Python 虚拟环境(venv)

背景

Python有很多虚拟环境解决方案,今天介绍一个Python自带的解决方案。

venv 模块

Python3.3 之后,可以用模块 venv 代替 virtualenv 工具,好处是不用单独安装,3.3 及之后的版本,都可以通过安装好的 Python 来创建虚拟环境:

python -m venv myvenv

可以在当前目录创建一个名为 myvenv 的虚拟环境

venv 有些参数,不过相比 virtualenv 少了些,这里简单介绍几个:

  • -without-pip: 不需要安装 pip,默认为安装
  • -clear:如果创建虚拟环境的目录已经有了其他虚拟环境,清楚重建

因为 venv 是依附于一个 Python 解析器创建的,所以不需要指定 Python 解释器版本

激活

虚拟环境创建好后,需要激活才能在当前命令行中使用,可以理解成将当前命令行环境中 PATH 变量的值替换掉

通过 virtualenv 和 模块 venv 创建的虚拟环境,激活方式是一样的,即运行激活脚本

  • Windows 系统中,激活脚本路径是 <myvenv>\Scripts\activate.bat,如果是 powershell 命令行,脚本换成 Activate.ps1 , 注意将 <myvenv>换成你自己的虚拟环境目录
  • Linux 系统中,激活脚本路径是 <myvenv>/bin/activate,默认脚本没有执行权限,要么设置脚本为可执行,要么用 source 命令执行,例如

$ source myvenv/bin/activate

激活后,可以在命令行中看到虚拟环境标记,如上图

打印 PATH,命令如下:

Linux 下:

echo $PATH

Windows 下

echo %PATH%

可以看到创建的虚拟环境脚本目录被加载了最前面

退出

退出虚拟环境很简单,只需要执行 deactivate 命令就行,这个命令也在虚拟环境的脚本目录下,因为激活时,将脚本目录设置到 PATH 中了,所以可以直接使用

退出虚拟环境相当于将 PATH 恢复成原来的。

ElasticSearch入门篇-操作1

背景

如果一切正常,Elastic 就会在默认的9200端口运行。这时,打开另一个命令行窗口,请求该端口,会得到说明信息。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
$ curl localhost:9200
{
"name" : "10eaed25cea5",
"cluster_name" : "my-es",
"cluster_uuid" : "dk5aq0evTjaXc2LEHIGtsA",
"version" : {
"number" : "7.7.1",
"build_flavor" : "default",
"build_type" : "docker",
"build_hash" : "ad56dce891c901a492bb1ee393f12dfff473a423",
"build_date" : "2020-05-28T16:30:01.040088Z",
"build_snapshot" : false,
"lucene_version" : "8.5.1",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}

基本概念

Node 与 Cluster

Elastic 本质上是一个分布式数据库,允许多台服务器协同工作,每台服务器可以运行多个 Elastic 实例。

单个 Elastic 实例称为一个节点(node)。一组节点构成一个集群(cluster)。

Index

Elastic 会索引所有字段,经过处理后写入一个反向索引(Inverted Index)。查找数据的时候,直接查找该索引。

所以,Elastic 数据管理的顶层单位就叫做 Index(索引)。它是单个数据库的同义词。每个 Index (即数据库)的名字必须是小写。

下面的命令可以查看当前节点的所有 Index。

1
$ curl -X GET 'http://localhost:9200/_cat/indices?v'

显示下面的结果:

More...

在Docker环境下部署 Elasticsearch 数据库(从零到一搭建ELK日志)

背景

最近在玩 ELK 日志平台,它是 Elastic 公司推出的一整套日志收集、分析和展示的解决方案。只有学习了,操作了才能算真正的学会使用了,虽然看起来简单,但是里面的流程步骤还是很多的,将步骤和遇到的问题记录和总结下,今天主要分享下在Docker环境下部署 Elasticsearch 数据库。

什么是Elasticsearch?

ElasticStack简称为ES, 是一个分布式、高扩展、高实时的搜索与数据分析引擎,是整个ELK架构的核心。

它能很方便的使大量数据具有搜索、分析和探索的能力。充分利用Elasticsearch的水平伸缩性,能使数据在生产环境变得更有价值。

Elasticsearch 的实现原理主要分为以下几个步骤,首先用户将数据提交到Elasticsearch 数据库中,再通过分词控制器去将对应的语句分词,将其权重和分词结果一并存入数据,当用户搜索数据时候,再根据权重将结果排名,打分,再将返回结果呈现给用户。

主要优点

  • 处理方式灵活: elasticsearch是实时全文索引,具有强大的搜索功能;
  • 配置相对简单: elasticsearch全部使用JSON接口,logstash使用模块配置,kibana的配置文件部分更简单;
  • 检索性能高效: 基于优秀的设计,虽然每次查询都是实时,但是也可以达到百亿级数据的查询秒级响应;
  • 集群线性扩展: elasticsearch和logstash都可以灵活线性扩展。

常用场景

  • 全文搜索

Elasticsearch提供了全文搜索的功能,适用于电商商品搜索、App搜索、企业内部信息搜索、IT系统搜索等。

  • 日志分析

Elasticsearch能够借助Beats、Logstash等快速对接各种常见的数据源,并通过集成的Kibana高效地完成日志的可视化分析,让日志产生价值。

  • 运维监控

使用Elasticsearch结合Beats、Logstash或ElasticFlow将所有日志实时集中并构建索引,然后通过集成的Kibana灵活地运用数据构建可视化运维看板

  • 安全分析

通过日志解决公司内部繁杂的安全审计工作,可通过Elasticsearch分析、检索海量历史日志,高效地完成安全审计工作

部署ES数据库

1.获取es镜像

1
sudo docker pull elasticsearch:7.7.1

2. 创造挂载目录

1
2
3
sudo mkdir -p /data/elk/es/config
sudo mkdir -p /data/elk/es/data
sudo mkdir -p /data/elk/es/logs

3.赋予权限

1
sudo chown -R 1000:1000 /data/elk/es

4.进入config创建配置文件

1
2
cd /data/elk/es/config
sudo vim elasticsearch.yml

编辑 elasticsearch.yml 文件:

1
2
3
cluster.name: "my-es"
network.host: 0.0.0.0
http.port: 9200

5.启动服务

1
sudo docker run -it  -d -p 9200:9200 -p 9300:9300 --name es -e ES_JAVA_OPTS="-Xms84m -Xmx512m" -e "discovery.type=single-node" --restart=always -v /data/elk/es/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml -v /data/elk/es/data:/usr/share/elasticsearch/data -v /data/elk/es/logs:/usr/share/elasticsearch/logs elasticsearch:7.7.1

这里我们需要特别注意的是,做好容器内的文件与第二步创建的挂载目录进行文件夹的映射。

6.验证服务

1
curl 127.0.0.1:9200

看到这个,就表示我们已经安装成功了!

Python的列表,字典,元组,集合的区别和各自使用方法

背景

我在学习Python的时候,经常把 几个数据类型搞混,下面就对 list,tuple,set,dict总结区别:

1.列表list []

列表是处理一组有序的数据结构,可以读写,添加和删除,或者搜索列表里的元素。因为可以添加和删除,所以称为可变的数据类型,即这种类型是可以被改变的,并且列表可以嵌套。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
res = [1,2,'yihang']
#增加元素:extend和append
res.append(1)
res.extend('6')
#删除元素:del,pop,切片,remove
del res[1]
res.pop(1)#删除该位置上的元素,没有指定则是最后一个元素
res = res[:2]+res[3:]#切片
res.remove(2)#删除指定值的元素
#更改元素
res[1] = 100 #
#查元素
print(res[0])
print(res[1])

注意:如果想添加的一个元素是一个列表,那么append是将这个这个列表作为一个元素添加进来,而extend是将列中的元素一个一个添加进去

More...

请我喝杯咖啡吧~

支付宝
微信