MySQL提取Json内部字段转储为数字

背景

用户眼科属性表记录数大概 986w，目的是把大概 29w 记录的属性值（json 格式）的其中八个字段解析为数字，转储为统计表的记录，用于图表分析。

以下结构、数据都大部分我瞎诌的，不可当真

用户眼科属性表结构如下

create table `property` (
  `id` int(11) not null auto_increment,
  `ownerid` int(11) not null comment '记录id或者模板id',
  `ownertype` tinyint(4) not null comment '类型。0：记录 1：模板',
  `recorderid` bigint(20) not null default '0' comment '记录者id',
  `userid` bigint(20) not null default '0' comment '用户id',
  `roleid` bigint(20) not null default '0' comment '角色id',
  `type` tinyint(4) not null comment '字段类型。0：文本 1：备选项 2：时间 3：图片 4：icd10 9：新图片',
  `name` varchar(128) not null default '' comment '字段名称',
  `value` mediumtext not null comment '字段值',
  primary key (`id`),
  unique key `idxowneridownertypenametype` (`ownertype`,`ownerid`,`name`,`type`) using btree,
  key `idxuseridroleidrecorderidname` (`userid`,`roleid`,`recorderid`,`name`)
) engine=innodb default charset=utf8mb4 comment='属性';

问题分析

1、属性值是 json 格式的，需要使用 json 操作函数处理

因为属性值是 json 格式的，如下。较大的一个 json，但是只需要其中 8 个字段值，提取出来分门别类归为不同统计指标下。

{   ......
    "sight": {
        "nakedeye": {
            "left": "0.9",
            "right": "0.6"
        },
        "correction": {
            "left": "1",
            "right": "1"
        }
    },
    ......
    "axiallength": {
        "left": "21",
        "right": "12"
    },
    "kornearadius": {
        "left": "34",
        "right": "33"
    },
    ......
}

所以，需要用到 json 操作函数：json_extract(value,’$.key1.key2′)。

但是需要注意的是这个函数提取的值是带””。比如对上述记录执行json_extract(value,’$.sight.nakedeye.left’)的结果是”22″；也可能字段值是空字符串，那结果就是””。

所以，需要使用 replace函数把结果中的 “” 删除掉，最后提取字段的表达式就是：replace(json_extract(value,’$.sight.nakedeye.left’),'”‘,”)。

如果字段不存在的话，结果就是 null；无论是外层 sight 不存在，或是内层 left 不存在。

2、字段内容不规范，乱七八糟

理想下，填写的都是规范数字，那经过上面那一步就可以提取完直接导入新表。

但是，现实很残酷，填的东西那叫一个乱七八糟。比如：

数字 + 备注：1(配合欠佳)、1-\+（我猜这是想表示偏高或偏低）
数字 + 单位：跟上面相似，1mm
多数值或区间：22.52/42.45、1-5
纯文本描述：不配合、无法记录
文本、数字混杂描述：较上次增长 10、<1、小于1、bd234/kd23

没办法，找产品和业务对情况，好在不多，就 4000 多条，大致扫一下心里有数。得出以下几条解决方案：

数字开头：数字开头都是正确记录的数据，省略掉文字描述即可
多数值或区间：取最前面的数即可
纯文本：说明没有数据，排除掉
文本、数字混杂：具体问题具体分析，把其他处理掉之后看还有多少

具体怎么做呢？

第一步：排除正常的数字数据和空数据

where `nakedeyeleft` regexp '[^0-9.]' = 1 // 这个已经可以排除 null 了
 and `nakedeyeleft` != ''

第二步：如果不包含数字，将其设置 null 或空字符串

set nakedeyeleft = if(nakedeyeleft not regexp '[0-9]', '', nakedeyeleft)

第三步：提取数字开头的数据的首个数值

set nakedeyeleft = if((nakedeyeleft + 0 = 0), nakedeyeleft, nakedeyeleft + 0)

结合起来就是

set nakedeyeleft = if(nakedeyeleft not regexp '[0-9]''', '', 
                      if((nakedeyeleft + 0 = 0), nakedeyeleft, nakedeyeleft + 0))
where `nakedeyeleft` regexp '[^0-9.]' = 1 // 这个已经可以排除 null 了
 and `nakedeyeleft` != ''

ps：处理一个字段的sql 看着就简单，但是因为批量一次处理 8 个字段，组合起来就很长。

千万注意不要写错字段。

最后剩下的就是第四类：文本、数字混杂，40 多条。

有些看着简单的，可以用正则自动化处理，比如<1、小于1。

记录的增长值，需要查找上次记录进行计算：较上次增长 10。

剩下有点复杂的，就需要人为处理，提取出可用数据，比如bd234/kd23

不知道看到这里的各位是不是也觉得有些麻烦呢？

我也以为咬着牙搞了，结果业务说直接处理成 0，到时候发现是 0 的话，可以通过页面重新保存的。

就不需要判断是不是数字打头了，直接 + 0；如果是数字打头，会保留开头的数字；否则 = 0。

那最后数据格式化sql：

update property 
set nakedeyeleft = if(nakedeyeleft not regexp '[0-9]''', '', nakedeyeleft + 0)
where `nakedeyeleft` regexp '[^0-9.]' = 1 // 这个已经可以排除 null 了
 and `nakedeyeleft` != '';

3.又要抽取内容、又要格式化，记录还有 900w+，太慢了

property 表有 900w+ 的数据，而所需记录的条件，只有name、ownertype、type是可知的，没法命中现有的索引。

如果直接查找的话，直接就是全表扫描，外加数据提取和格式化；更何况还需要关联其他表，补充统计指标的一些其他字段。

这种情况下，直接导入统计表的话，结果就是把两张表+关联表一起锁较长时间，期间没法更改和插入，这样不大现实。

减少扫描行数

做法一：给 name、ownertype、type 加上索引，将扫描记录缩减到 20 w。

但是问题是900w 数据加索引，用完需要删除索引（因为不是业务情况需要），就会导致两次波动；

再加上后续处理锁表时长，问题还是很大。

做法二：将一个记录较少的表做驱动表，这个表可以关联目标表。

create table `property` (
  `ownerid` int(11) not null comment '记录id或者模板id',
  `ownertype` tinyint(4) not null comment '类型。0：记录 1：模板',
  `type` tinyint(4) not null comment '字段类型。0：文本 1：备选项 2：时间 3：图片 4：icd10 9：新图片',
  `name` varchar(128) not null default '' comment '字段名称',
  `value` mediumtext not null comment '字段值',
    省略其他字段
  unique key `idxowneridownertypenametype` (`ownertype`,`ownerid`,`name`,`type`) using btree
) engine=innodb default charset=utf8mb4 comment='属性';

表中ownerid 可以关联到记录表，加上之前的条件name、ownertype、type，如此刚好命中并“idxowneridownertypenametype (ownertype,ownerid,name,type) 。

create table `medicalrecord` (
  `id` int(11) not null auto_increment,
  `name` varchar(50) not null default '' comment '记录名称',
  `type` tinyint(4) not null default '0' comment '记录类型。',
    省略其他字段
  key `idxname` (`name`) using btree
) engine=innodb  default charset=utf8mb4 comment='记录';

记录表可以通过 name=’眼科记录’命中索引idxname，扫描行数只有2w，加上属性表 29w，最后扫描行数只有 30w 左右，比之全表扫描属性表少了 30 倍！！！。

避免数据提取和格式化的锁表时长

因为存在 8 个字段，每个字段都需要提取和格式化，中间还需要进行判断。这样子一个 sql 里面同样的提取和格式化操作就要多次执行了。

所以，为了避免这样的问题，需要中间表暂存提取和格式化结果。

create table `propertytmp` (
  `id` int(11) not null auto_increment,
   `value` mediumtext not null comment '字段值',
  `nakedeyeleft` varchar(255) default null comment '视力-裸眼-左眼',
  `nakedeyeright` varchar(255) default null comment '视力-裸眼-右眼',
  `correctionleft` varchar(255) default null comment '视力-矫正-左眼',
  `correctionright` varchar(255) default null comment '视力-矫正-右眼',
  `axiallengthleft` varchar(255) default null comment '眼轴长度-左眼',
  `axiallengthright` varchar(255) default null comment '眼轴长度-右眼',
  `kornearadiusleft` varchar(255) default null comment '角膜曲率-左眼',
  `kornearadiusright` varchar(255) default null comment '角膜曲率-右眼',
  `updated` datetime not null comment '更新时间',
  `deleted` tinyint(1) not null default '0',
  primary key (`id`)
) engine=innodb  default charset=utf8mb4;

先将数据导入该表，在此基础上做提取，然后格式化。

最后执行结果比较

数据导入比较

结果：全表扫描属性表导入中间表（40s），属性表新增索引+导入（6s + 3s），关联导入（1.4s）。

因为需要关联其他表，并没有预测的那么理想。

中间表数据提取：7.5s

update `propertytmp` 
set nakedeyeleft = replace(json_extract(value,'$.sight.axiallength.left'),'"',''),
nakedeyeleft = replace(json_extract(value,'$.sight.nakedeye.left'),'"',''),
nakedeyeright = replace(json_extract(value,'$.sight.nakedeye.right'),'"',''),
correctionleft = replace(json_extract(value,'$.sight.correction.left'),'"',''),
correctionright = replace(json_extract(value,'$.sight.correction.right'),'"',''),
axiallengthleft = replace(json_extract(value,'$.axiallength.left'),'"',''),
axiallengthright = replace(json_extract(value,'$.axiallength.right'),'"',''),
kornearadiusleft = replace(json_extract(value,'$.kornearadius.left'),'"',''),
kornearadiusright = replace(json_extract(value,'$.kornearadius.right'),'"','');

中间表数据格式化：2.3s

正则判断比我想象的要快啊

update propertytmp 
set nakedeyeleft = if(nakedeyeleft not regexp '[0-9]' and nakedeyeleft != '', '', nakedeyeleft + 0), 
nakedeyeright = if(nakedeyeright not regexp '[0-9]' and nakedeyeright != '', '', nakedeyeright + 0), 
correctionleft = if(correctionleft not regexp '[0-9]' and correctionleft != '', '', correctionleft + 0),
correctionright = if(correctionright not regexp '[0-9]' and correctionright != '', '', correctionright + 0),
axiallengthleft = if(axiallengthleft not regexp '[0-9]' and axiallengthleft != '', '', axiallengthleft + 0),
axiallengthright = if(axiallengthright not regexp '[0-9]' and axiallengthright != '', '', axiallengthright + 0),
kornearadiusleft = if(kornearadiusleft not regexp '[0-9]' and kornearadiusleft != '', '', kornearadiusleft + 0),
kornearadiusright = if(kornearadiusright not regexp '[0-9]' and kornearadiusright != '', '', kornearadiusright + 0)
where (`nakedeyeleft` regexp '[^0-9.]' = 1
       and `nakedeyeleft` != '')
  or (`nakedeyeright` regexp '[^0-9.]' = 1
      and `nakedeyeright` != '')
  or (`correctionleft` regexp '[^0-9.]' = 1
      and `correctionleft` != '')
  or (`correctionright` regexp '[^0-9.]' = 1
      and `correctionright` != '')
  or (`axiallengthleft` regexp '[^0-9.]' = 1
      and `axiallengthleft` != '')
  or (`axiallengthright` regexp '[^0-9.]' = 1
      and `axiallengthright` != '')
  or (`kornearadiusleft` regexp '[^0-9.]' = 1
      and `kornearadiusleft` != '')
  or (`kornearadiusright` regexp '[^0-9.]' = 1
      and `kornearadiusright` != '');

统计指标中间表

因为实际导入统计指标表时，还需要排除为空数据，以及关联其他表做补充。

为了减少对指标表的影响，又建了指标表的中间表，结构完全一致，id自增是目标表 + 10000。

将属性中间表的数据导入指标中间表，最后直接 insert … select from，就很快了。

当然这步其实有点矫枉过正了，但是为了避免线上的一些波动，还是谨慎一些较好。

总结

这是一次简单的数据迁移经历记录。

没有索引优化、sql优化的内容，只是觉得大家需要有这种关注性能和对用户影响的考虑。

到此这篇关于mysql提取json内部字段转储为数字的文章就介绍到这了,更多相关mysql提取json转储为数字内容请搜索www.887551.com以前的文章或继续浏览下面的相关文章希望大家以后多多支持www.887551.com！