Bearalise's Blog

Data Warehouse Construction Standards - How

2024-03-20

2.How To Implement Data Warehouse Standards

a. Standard formulation

This involves defining and documenting the specific rules, guidelines, and best practices that will govern the data warehouse. It is essential to involve stakeholders from different teams and departments to ensure comprehensive coverage and alignment with business requirements.

b. Standard discussion

Once the standards are formulated, it is important to engage in discussions with relevant teams and individuals. This can include conducting workshops, training sessions, or meetings to explain the standards, address any questions or concerns, and gather feedback.

c. Standard implementation

After the standards have been discussed and finalized, they need to be effectively communicated and implemented across the organization. This can involve creating documentation, providing training, and establishing processes and tools to support adherence to the standards.

d. Standard enforcement and supervision

It is crucial to have mechanisms in place to monitor and enforce compliance with the data warehouse standards. This can include regular audits, performance reviews, and ongoing communication and support to ensure that the standards are being followed consistently.

Data warehouse standards should be treated as living documents that are continuously reviewed and refined based on feedback, evolving business needs, and technological advancements. Regular evaluations and updates should be conducted to ensure that the standards remain relevant and effective.

Data Warehouse Construction Standards - Why

2024-03-20

当日所学

1. Why We Need Data Warehouse Standards

The adage “No rules, no success” underscores the critical importance of established standards in ensuring optimal team performance and high-quality deliverables. In the absence of such standards, operational efficiency and collaboration may suffer, leading to potentially chaotic outcomes.

Have you encountered similar issues in your work?

Received a requirement and not sure which table to pull data from. Table A seems feasible, while table B also appears to work. Asked colleague A, and they said they always pull from table C. Spent a long time exploring these three tables but couldn’t match them up. Oh well, let me calculate from the source again, and then a new table D appeared.
I’ve noticed that there are thousands of tables in our database, but I only use a handful. So, what’s the point of all these other ones? I asked my colleagues, but nobody seems to know. Should I just get rid of them? Nobody else is touching them anyway.
I got tasked with investigating an error in our process after my boss asked me to take a look. But man, the code is a total mess! I can’t make heads or tails of it. Plus, I’ve been searching for what feels like forever but still can’t find the upstream dependencies. What a headache!
My coworker bailed on our project, and now I’m stuck with their share of the work. I’ve been grinding away for weeks, but I just can’t seem to wrap my head around it. It’s like they dumped a ton of unfinished business on me, and I’m feeling pretty frustrated myself now. Maybe it’s time for me to look for a new gig too…

Our data warehouse team’s performance has taken a hit due to all the issues we’ve faced lately. Efficiency, output quality, job satisfaction – you name it. And let me tell you, it’s usually the hardest-working and most loyal employees who bear the brunt of all these problems. It’s just not right.

If you’ve ever worked in data development, you know the pain I’m talking about. I mean, who hasn’t experienced some of these frustrations, right? So, what’s going on here? In my humble opinion, it all boils down to a lack of standards or proper implementation. And hey, I get it – sometimes business demands are tight, and shortcuts gotta be taken. But, that technical debt better be paid off pronto. Blaming employees for that ain’t cool. Leadership needs to own up to it.

Think of a data warehouse like a digital construction project – it’s the intangible result of our data engineers’ hard work. Data standards are like the blueprints for building this system, serving as both the instruction manual and translator for data usage. And just like how you need quality control in construction, we gotta ensure data quality too. But here’s the thing – for our data system to really thrive, we need to move away from relying on individual judgment and toward standardized, tool-driven management. That way, we can scale sustainably and keep the system healthy.

有用的学习英语插件-1

2024-03-18

有趣资源

语言反应器（Language Reactor）

非常强大，可以提取Youtube字幕，翻译，如图：

字幕可以导出

导出格式如下：

插件：链接
官网：链接

在Docker环境下部署 Kibana 可视化工具 (从零到一搭建ELK日志)

2024-03-16

日常练习

什么是 Kibana？

Kibana是一个开源的分析与可视化平台，设计出来用于和Elasticsearch一起使用的。

你可以用kibana搜索、查看存放在Elasticsearch中的数据。Kibana与Elasticsearch的交互方式是各种不同的图表、表格、地图等，直观的展示数据，从而达到高级的数据分析与可视化的目的。

部署 Kibana 可视化工具

获取 kibana 镜像
1
sudo docker pull kibana:7.7.1

获取elasticsearch容器 ip

1 2	sudo docker inspect --format '{{ .NetworkSettings.IPAddress }}' es > 172.17.0.2

创建 kibana 配置文件
1
sudo mkdir -p /data/elk/kibana/

配置kibana.yml

elasticsearch.hosts 配置 ES 服务的地址

server.name: kibana
server.host: "0"
elasticsearch.hosts: ["http://172.17.0.2:9200"]
xpack.monitoring.ui.container.elasticsearch.enabled: true

启动服务

sudo docker run -d --restart=always --log-driver json-file --log-opt max-size=100m --log-opt max-file=2 --name kibana -p 5601:5601 -v /data/elk/kibana/kibana.yml:/usr/share/kibana/config/kibana.yml kibana:7.7.1

验证

在浏览器打开 [服务器ip]:5601

Kibana 控制台的界面如下所示，打开 kibana 时，首页会提示让你选择加入一些测试数据，点击 try our sample data 按钮就可以了。最后选择Dashboard，可以看到 sample data 的图表：

配置文件kibana.yml

服务的端口配置

属性名为：server.port默认是5601
允许远程访问的地址配置

属性名为：server.host默认为本机

如果我们需要把Kibana服务给远程主机访问，只需要在这个配置中填写远程的那台主机的ip地址，那如果我们希望所有的远程主机都能访问，那就填写0.0.0.0
连接Elasticsearch服务配置

属性名为：elasticsearch.url默认为连接到本机的elasticsearch，并且端口为9200，也就是为localhost:9200，如果我们Elasticsearch不是与Kibana安装在同一台主机上，或者Elasticsearch的端口号不是9200，就就需要修改这个配置了
Elasticsearch的用户名和密码

属性名为：elasticsearch.username和elasticsearch.password，默认是没有用户名和密码，如果elasticsearch是配置了用户名和密码的，那就需要配置这两行属性
切换中文配置

属性名为：i18n.locale: “zh-CN”

安装 Python 虚拟环境（venv）

2024-03-08

当日所学

背景

Python有很多虚拟环境解决方案，今天介绍一个Python自带的解决方案。

venv 模块

Python3.3 之后，可以用模块 venv 代替 virtualenv 工具，好处是不用单独安装，3.3 及之后的版本，都可以通过安装好的 Python 来创建虚拟环境:

python -m venv myvenv

可以在当前目录创建一个名为 myvenv 的虚拟环境

venv 有些参数，不过相比 virtualenv 少了些，这里简单介绍几个：

-without-pip: 不需要安装 pip，默认为安装
-clear：如果创建虚拟环境的目录已经有了其他虚拟环境，清楚重建

因为 venv 是依附于一个 Python 解析器创建的，所以不需要指定 Python 解释器版本

激活

虚拟环境创建好后，需要激活才能在当前命令行中使用，可以理解成将当前命令行环境中 PATH 变量的值替换掉

通过 virtualenv 和模块 venv 创建的虚拟环境，激活方式是一样的，即运行激活脚本

Windows 系统中，激活脚本路径是 <myvenv>\Scripts\activate.bat，如果是 powershell 命令行，脚本换成 Activate.ps1 , 注意将 <myvenv>换成你自己的虚拟环境目录
Linux 系统中，激活脚本路径是 <myvenv>/bin/activate，默认脚本没有执行权限，要么设置脚本为可执行，要么用 source 命令执行，例如

$ source myvenv/bin/activate

激活后，可以在命令行中看到虚拟环境标记，如上图

打印 PATH，命令如下：

Linux 下：

echo $PATH

Windows 下

echo %PATH%

可以看到创建的虚拟环境脚本目录被加载了最前面

退出

退出虚拟环境很简单，只需要执行 deactivate 命令就行，这个命令也在虚拟环境的脚本目录下，因为激活时，将脚本目录设置到 PATH 中了，所以可以直接使用

退出虚拟环境相当于将 PATH 恢复成原来的。

ElasticSearch入门篇-操作1

2024-03-07

日常练习

背景

如果一切正常，Elastic 就会在默认的9200端口运行。这时，打开另一个命令行窗口，请求该端口，会得到说明信息。

$ curl localhost:9200
{
  "name" : "10eaed25cea5",
  "cluster_name" : "my-es",
  "cluster_uuid" : "dk5aq0evTjaXc2LEHIGtsA",
  "version" : {
    "number" : "7.7.1",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "ad56dce891c901a492bb1ee393f12dfff473a423",
    "build_date" : "2020-05-28T16:30:01.040088Z",
    "build_snapshot" : false,
    "lucene_version" : "8.5.1",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

基本概念

Node 与 Cluster

Elastic 本质上是一个分布式数据库，允许多台服务器协同工作，每台服务器可以运行多个 Elastic 实例。

单个 Elastic 实例称为一个节点（node）。一组节点构成一个集群（cluster）。

Index

Elastic 会索引所有字段，经过处理后写入一个反向索引（Inverted Index）。查找数据的时候，直接查找该索引。

所以，Elastic 数据管理的顶层单位就叫做 Index（索引）。它是单个数据库的同义词。每个 Index （即数据库）的名字必须是小写。

下面的命令可以查看当前节点的所有 Index。

1	$ curl -X GET 'http://localhost:9200/_cat/indices?v'

显示下面的结果：

More...

在Docker环境下部署 Elasticsearch 数据库(从零到一搭建ELK日志)

2024-03-07

日常练习

背景

最近在玩 ELK 日志平台，它是 Elastic 公司推出的一整套日志收集、分析和展示的解决方案。只有学习了，操作了才能算真正的学会使用了，虽然看起来简单，但是里面的流程步骤还是很多的，将步骤和遇到的问题记录和总结下，今天主要分享下在Docker环境下部署 Elasticsearch 数据库。

什么是Elasticsearch？

ElasticStack简称为ES，是一个分布式、高扩展、高实时的搜索与数据分析引擎，是整个ELK架构的核心。

它能很方便的使大量数据具有搜索、分析和探索的能力。充分利用Elasticsearch的水平伸缩性，能使数据在生产环境变得更有价值。

Elasticsearch 的实现原理主要分为以下几个步骤，首先用户将数据提交到Elasticsearch 数据库中，再通过分词控制器去将对应的语句分词，将其权重和分词结果一并存入数据，当用户搜索数据时候，再根据权重将结果排名，打分，再将返回结果呈现给用户。

主要优点

处理方式灵活： elasticsearch是实时全文索引，具有强大的搜索功能；
配置相对简单： elasticsearch全部使用JSON接口，logstash使用模块配置，kibana的配置文件部分更简单；
检索性能高效： 基于优秀的设计，虽然每次查询都是实时，但是也可以达到百亿级数据的查询秒级响应；
集群线性扩展： elasticsearch和logstash都可以灵活线性扩展。

常用场景

全文搜索

Elasticsearch提供了全文搜索的功能，适用于电商商品搜索、App搜索、企业内部信息搜索、IT系统搜索等。

日志分析

Elasticsearch能够借助Beats、Logstash等快速对接各种常见的数据源，并通过集成的Kibana高效地完成日志的可视化分析，让日志产生价值。

运维监控

使用Elasticsearch结合Beats、Logstash或ElasticFlow将所有日志实时集中并构建索引，然后通过集成的Kibana灵活地运用数据构建可视化运维看板

安全分析

通过日志解决公司内部繁杂的安全审计工作，可通过Elasticsearch分析、检索海量历史日志，高效地完成安全审计工作

部署ES数据库

1.获取es镜像

1	sudo docker pull elasticsearch:7.7.1

2. 创造挂载目录

1
2
3

sudo mkdir -p /data/elk/es/config
sudo mkdir -p /data/elk/es/data
sudo mkdir -p /data/elk/es/logs

3.赋予权限

1	sudo chown -R 1000:1000 /data/elk/es

4.进入config创建配置文件

1 2	cd /data/elk/es/config sudo vim elasticsearch.yml

编辑 elasticsearch.yml 文件：

1
2
3

cluster.name: "my-es"
network.host: 0.0.0.0
http.port: 9200

5.启动服务

sudo docker run -it  -d -p 9200:9200 -p 9300:9300 --name es -e ES_JAVA_OPTS="-Xms84m -Xmx512m" -e "discovery.type=single-node" --restart=always -v /data/elk/es/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml -v /data/elk/es/data:/usr/share/elasticsearch/data -v /data/elk/es/logs:/usr/share/elasticsearch/logs elasticsearch:7.7.1

这里我们需要特别注意的是，做好容器内的文件与第二步创建的挂载目录进行文件夹的映射。

6.验证服务

1	curl 127.0.0.1:9200

看到这个，就表示我们已经安装成功了!

Python的列表，字典，元组，集合的区别和各自使用方法

2024-02-27

当日所学

背景

我在学习Python的时候，经常把几个数据类型搞混，下面就对 list，tuple，set，dict总结区别：

1.列表list []

列表是处理一组有序的数据结构，可以读写，添加和删除，或者搜索列表里的元素。因为可以添加和删除，所以称为可变的数据类型，即这种类型是可以被改变的，并且列表可以嵌套。

res = [1,2,'yihang']
#增加元素：extend和append
res.append(1)
res.extend('6')
#删除元素：del，pop，切片，remove
del res[1]
res.pop(1)#删除该位置上的元素，没有指定则是最后一个元素
res = res[:2]+res[3:]#切片
res.remove(2)#删除指定值的元素
#更改元素
res[1] = 100 #
#查元素
print(res[0])
print(res[1])

注意：如果想添加的一个元素是一个列表，那么append是将这个这个列表作为一个元素添加进来，而extend是将列中的元素一个一个添加进去

More...

算法练习：BM2 链表内指定区间反转

2024-02-26

日常练习

问题描述

将一个节点数为 size 链表 m 位置到 n 位置之间的区间反转，要求时间复杂度O(n)，空间复杂度O(1)。例如：给出的链表为 1→2→3→4→5→NULL, m=2,n=4,返回 1→4→3→2→5→NULL.

数据范围：链表长度 0<size≤10000，0<m≤n≤size，链表中每个节点的值满足 val∣≤10000

要求：时间复杂度 O(n) ，空间复杂度 O(n)

进阶：时间复杂度O(n)，空间复杂度O(1)

示例1

输入：

{1,2,3,4,5},2,4

返回值：

{1,4,3,2,5}

示例2

输入：

{5},1,1

返回值：

{5}

More...

Window上用包管理应用安装软件 — Chocolatey

2024-02-26

有趣资源

前言

在经历了多次重装系统、重新安装各种常用软件后，我突然萌生了一个想法，难道 Window 就没有一个像 Linux 一样可以使用一行命令安装软件的包管理器吗？

答案是：有的。

这里主要是介绍 Chocolatey，对其他包管理器感兴趣的可以自行查看。

安装使用

安装 Chocolatey 前，你需要对检查一下自身系统是否达到所需的要求。

Windows 7+ / Windows Server 2003+
PowerShell v2+
.NET Framework 4+

如果已经满足要求，就可以打开命令行 copy 下面的命令执行。

1
2

@powershell -NoProfile -ExecutionPolicy unrestricted -Command "(iex ((new-object net.webclient).DownloadString('https://chocolatey.org/install.ps1'))) >$null 2>&1" && SET PATH=%PATH%;%ALLUSERSPROFILE%\chocolatey\bin

如果没报红字错误，应该是安装成功了。

1 2	choco -v # 0.10.11

接下来你可以打开下方的链接，查看一下可以安装的软件列表。

chocolatey.org/docs

比如，安装 Git

1	choco install git

安装 Node

1 2	choco install nodejs.install

也可以加上-y，安装时就不需要等待确认了

1 2	choco install firefox -y

基本用法

安装 choco install baretail
升级 choco upgrade baretail
卸载 choco uninstall baretail
搜索 choco search something
列出已安装 choco list -lo

关于更多用法可点击以下链接查看：
chocolatey.org/docs

修改默认安装目录

但是，Chocolatey 安装的软件默认存放在 C 盘，对于我这种不太喜欢把软件装到 C 盘的人来说是不能忍的。

虽然 Chocolatey 自己提供了一个修改默认安装目录的命令，但是需要注册（收费），对于资金充裕的同学我还是建议使用这种方法。

这里我提供另外一个思路，既然他存到默认安装目录，那我们把这个默认安装目录改一下就好了。

比如我要默认存放到E盘，打开注册表，把 HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion 下的：

ProgramFilesDir的值改为：E:\Program Files
ProgramFilesDir (x86)的值改为：E:\Program Files (x86)

参考链接：https://juejin.cn/post/6844903782854164487

2.How To Implement Data Warehouse Standards

a. Standard formulation

b. Standard discussion

c. Standard implementation

d. Standard enforcement and supervision

e. Standard refinement

1. Why We Need Data Warehouse Standards

语言反应器（Language Reactor）

什么是 Kibana？

部署 Kibana 可视化工具

配置文件kibana.yml

背景

venv 模块

激活

退出

背景

基本概念

Node 与 Cluster

Index

背景

什么是Elasticsearch？

主要优点

常用场景

部署ES数据库

1.获取es镜像

2. 创造挂载目录

3.赋予权限

4.进入config创建配置文件

5.启动服务

6.验证服务

背景

1.列表list []

问题描述

示例1

示例2

前言

安装使用

基本用法

修改默认安装目录