mysql炸裂函数实例

2021-03-20

Word count: 1.4k | Reading time≈ 6 min

首先需要了解一个函数：

substring_index（“待截取字符串”，“截取规则”，截取前n个字符），如果第三个参数为负数，则倒过来取前n个

举例：

# 取第一个:==>得到结果为： 15
SELECT SUBSTRING_INDEX('15,151,152,16',',',1);

# N可以为负数，表示倒数前N个字符串。（有负号的时候，可以将整个字符倒过来看）==>得到结果为： 16
SELECT SUBSTRING_INDEX('15,151,152,16',',',-1);

# 取前2个字符串，然后倒过来取第一个 ==>得到结果为： 151
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX('15,151,152,16',',',2),',',-1);

# 倒过来取前两个字符串，再正数第一个:==> 得到结果为：152
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX('15,151,152,16',',',-2),',',1);

如果有一个字符数组，我们想分别取第一个，第二个，第三个字符：

user1,user2,user3

1
2
3

substring_index(substring_index(processed_data,',',1),',',-1),
substring_index(substring_index(processed_data,',',2),',',-1),
substring_index(substring_index(processed_data,',',3),',',-1)

以此类推第n个字符：

1	substring_index(substring_index(processed_data,',',n),',',-1)

如果我们需要把前n个字符都取出来，那就要借助一个足够长从0单调递增的数列，把<=n的所有字符都提取出来

创建一个只有一列的单调递增表：

drop procedure test_200w;

DELIMITER $$
CREATE PROCEDURE test_200w(IN in_count INT) # 创建存储过程
BEGIN
    DECLARE COUNT INT DEFAULT 1;
# DECLARE SUM INT DEFAULT 0;
    WHILE COUNT <= in_count
        DO
            # SET SUM = SUM + COUNT;
            insert into test_200w values (COUNT);
            SET COUNT = COUNT + 1;
        END WHILE;
END $$
DELIMITER ;

drop table test_200w;

create table test_200w
(
    id bigint primary key default 0
);

call test_200w(200000);

select *
from test_200w;

下面就来利用这个表进行炸裂：

需求1：炸裂email

示例数据中的email字段示例：

[
  "last@roechling.com",
  "rep@roechling.com.sg",
  "info@roechling.com",
  "bewerbung@roechling.com",
  "brown@roechling.com",
  "bkoennecker@roechling.com",
  "info-xan@roechling.com",
  "ikhammarit@roechling.com",
  "dwalulik@roechling.com",
  "reg_thermo@roechling.com.sg",
  "ina.breitsprecher@roechling.com",
  "bewerbung.debre@roechling.com",
  "robaproducts@roechling.com",
  "reckerstorfer@roechling.com",
  "aluger@roechling.com",
  "datenschutz@roechling.com",
  "info.deneu@roechling.com",
  "rep@roechling.com",
  "zentrale.deneu@roechling.com",
  "hr-oep@roechling.com",
  "apueschel@roechling.com",
  "seminar-management@roechling.com",
  "admin@roechling.com.sg",
  "ibaumgaertel@roechling.com",
  "personal.waldachtal@roechling.com",
  "kbiedebach@roechling.com",
  "mfritzmann@roechling.com",
  "sales.deneu@roechling.com",
  "nikolaus_niepon@roechling.com",
  "lsieber@roechling.com",
  "d-jane@roechling.com",
  "rep@roechling.com.cn",
  "pstraub@roechling.com",
  "jane_d@roechling.com",
  "info@roechling.com.cn",
  "peter.walsh@roechling.com.email",
  "breitsprecher@roechling.com",
  "susanne.salomon@roechling.com",
  "johannes.salmuth@roechling.com",
  "angela.lin@roechling.com.sg",
  "gabriele.stegher@roechling.com",
  "jermel.jones@roechling.com",
  "rega@roechling.com",
  "lewis.carter@roechling.com",
  "kgina.breitsprecher@roechling.com",
  "jane.doe@roechling.com",
  "martina.anschuetz@roechling.com",
  "jtoepfer-linss@roechling.com",
  "peter.walsh@roechling.com",
  "witherspoon@roechling.com",
  "cachedkeegan@roechling.com",
  "chris@roechling.com",
  "chris_krebs@roechling.com",
  "heidi@roechling.com.email",
  "abernathy@roechling.com",
  "ludger@roechling.com",
  "liangtong@roechling.com.sg",
  "jtoepfer-linss@roechling.com.mit",
  "rep@roechling.com.company",
  "plastics.rep@roechling.com.sg",
  "bewerbung.deneu@roechling.com",
  "jtoepfer-linss@roechling.com.sv",
  "ludger.bartels@roechling.com",
  "admin@roechling.com",
  "james_andolina@roechling.com",
  "darren@roechling.com",
  "linlin.ng@roechling.com.sg",
  "cachedrep@roechling.com.sg",
  "georg.duffner@roechling.com",
  "km.pang@roechling.com.sg",
  "j-d@roechling.com",
  "djane@roechling.com",
  "odell@roechling.com",
  "info.debre@roechling.com"
]

这里首先要对数据进行清洗：清洗掉’[‘,’]’,’”‘,然后在join单调表，取id<=字符条数的所有字符进行炸裂

SELECT 公司名称,
               地址,
               网址,
               电话,
               主要管理人,
               雇员人数,
               成立时间,
               substring_index(substring_index(processed_data, ',', b.id), ',', - 1) 邮箱
        FROM (
            SELECT 公司名称,
            地址,
            网址,
            电话,
            主要管理人,
            雇员人数,
            成立时间,
            REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(邮箱, '[', ''), ']', ''), '"', ''), ' ', ''),
                            CHAR(10), ''), CHAR(13), '') processed_data
            FROM company_all
        ) temp
                 JOIN test_200w b
                      ON b.id <= (length(temp.processed_data) - length(REPLACE(temp.processed_data, ',', '')) + 1)
       union all select * from company_all where 邮箱 is null;

这里有一点需要特别注意：如果有的列email为null，那么这些列的length也是null，长度是无法作差的，所以会丢失数据，这里需要加上。结果如下：

需求2：炸裂email并取前5

方案一：

     SELECT 公司名称,
            地址,
            网址,
            电话,
            主要管理人,
            雇员人数,
            成立时间,
            substring_index(substring_index(processed_data, ',', b.id), ',', -1) 邮箱
     FROM (
              SELECT 公司名称,
                     地址,
                     网址,
                     电话,
                     主要管理人,
                     雇员人数,
                     成立时间,
                     substring_index(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(邮箱, '[', ''), ']', ''), '"', ''),' ', ''), CHAR(10), ''), CHAR(13), ''), ',',5) processed_data
              FROM company_all
          ) temp
              JOIN test_200w b
                   ON b.id <= (length(temp.processed_data) - length(REPLACE(temp.processed_data, ',', ''))+1)
union all select * from company_all where 邮箱 is null;

方案二：

SELECT 公司名称,
       地址,
       网址,
       电话,
       主要管理人,
       雇员人数,
       成立时间,
       substring_index(substring_index(processed_data, ',', b.id), ',', - 1) 邮箱
FROM (
         SELECT 公司名称,
                地址,
                网址,
                电话,
                主要管理人,
                雇员人数,
                成立时间,
                REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(邮箱, '[', ''), ']', ''), '"', ''), ' ', ''),
                                CHAR(10), ''), CHAR(13), '') processed_data
         FROM company_all
     ) temp
         JOIN test_200w b ON b.id <= (5)
group by 公司名称, 地址, 网址, 电话, 主要管理人, 雇员人数, 成立时间,
         substring_index(substring_index(processed_data, ',', b.id), ',', - 1);

效果如下：

Donate

Copyright： Copyright is owned by the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.